Samenvatting

Summary asdfaa

1 keer bekeken 0 keer verkocht

Vak
General Papers (ASFD)

Instelling
AOC Groenhorst College

Summary of 6 pages for the course General Papers at AOC Groenhorst College (asdfas)

[Meer zien]

Voorbeeld 2 van de 6 pagina's

Bekijk voorbeeld

Geupload op 4 oktober 2022
Aantal pagina's 6
Geschreven in 2022/2023
Type Samenvatting

sadfasd
sdf
t
a

€6,79

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Integrating the Probabilistic Model
BM25/BM25F into Lucene.

Joaquı́n Pérez-Iglesias1, José R. Pérez-Agüera2, Vı́ctor Fresno1 and
Yuval Z. Feinstein3
1
NLP&IR Group, Universidad Nacional de Educación a Distancia, Spain
arXiv:0911.5046v2 [cs.IR] 1 Dec 2009

2
University of North Carolina at Chapel Hill, USA
3
Answers Corporation, Jerusalem 91481, Israel
joaquin.perez@lsi.uned.es, jaguera@email.unc.edu, vfresno@lsi.uned.es,
yuvalf@answers.com

Abstract. This document describes the BM25 and BM25F implemen-
tation using the Lucene Java Framework. The implementation described
here can be downloaded from [Pérez-Iglesias 08a]. Both models have
stood out at TREC by their performance and are considered as state-
of-the-art in the IR community. BM25 is applied to retrieval on plain
text documents, that is for documents that do not contain fields, while
BM25F is applied to documents with structure.

Introduction

Apache Lucene is a high-performance and full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application
that requires full-text search. Lucene is scalable and offers high-performance
indexing, and has become one of the most used search engine libraries in both
academia and industry [Lucene 09].
Lucene ranking function, the core of any search engine applied to determine
how relevant a document is to a given query, is built on a combination of the
Vector Space Model (VSM) and the Boolean model of Information Retrieval.
The main idea behind Lucene approach is the more times a query term appears
in a document relative to the number of times the term appears in the whole
collection, the more relevant that document will be to the query [Lucene 09].
Lucene uses also the Boolean model to first narrow down the documents that
need to be scored based on the use of boolean logic in the query specification.
In this paper, the implementation of BM25 probabilistic model and its ex-
tension for semi-structured IR, BM25F, is described in detail.
One of the main Lucene’s constraints to be widely used by IR community is
the lack of different retrieval models implementations. Our goal with this work is
to offer to IR community a more advanced ranking model which can be compared
with other IR software, like Terrier, Lemur, CLAIRlib or Xapian.

, 1 Motivation

There exists previous implementations of alternative Information Retrieval Mod-
els for Lucene. The most representative case of that is the Language Model im-
plementation4 from Intelligent Systems Lab Amsterdam. Another example is
described at [Doron 07] where Lucene is compared with Juru system. In this
case Lucene document length normalization is changed in order to improve the
Lucene ranking function performance.
BM25 has been widely use by IR researchers and engineers to improve search
engine relevance, so from our point of view, a BM25/BM25F implementation for
Lucene becomes necessary to make Lucene more popular for IR community.

Included Models

The developed models are based in the information that can be found at [Robertson 07].
More specifically the implemented ranking functions are as next:

BM25
X occursdt
R(q, d) = ld
t∈q
k1 ((1 − b) + b avl d
) + occursdt

where occursdt is the term frequency of t in d; ld is the document d length; avld is
the document average length along the collection; k1 is a free parameter usually
chosen as 2 and b ∈ [0, 1] (usually 0.75). Assigning 0 to b is equivalent to avoid
the process of normalisation and therefore the document length will not affect
the final score. If b takes 1, we will be carrying out a full length normalisation.
The classical inverse document frequency is computed as next:

N − df (t) + 0.5
idf (t) = log
df (t) + 0.5

where N is the number of documents in the collection and df is the number of
documents where appears the term t.
A different version of this formula, as can be found at Wikipedia5 , multiplies
the obtained bm25 weight by the constant (k1 + 1) in order to normalize the
weight of terms with a frequency equals to 1 that occurs in documents with an
average length.

BM25F

First we obtain the accumulated weight of a term over all fields as next:
4
http://ilps.science.uva.nl/resources/lm-lucene
5
http://en.wikipedia.org/wiki/Probabilistic relevance model (BM25)

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper afdsasdf. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,79. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 50155 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

Samenvatting

Summary asdfaa

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?