Integrating the Probabilistic Model
BM25/BM25F into Lucene.
Joaquı́n Pérez-Iglesias1, José R. Pérez-Agüera2, Vı́ctor Fresno1 and
Yuval Z. Feinstein3
1
NLP&IR Group, Universidad Nacional de Educación a Distancia, Spain
arXiv:0911.5046v2 [cs.IR] 1 Dec 2009
2
University of North Carolina at Chapel Hill, USA
3
Answers Corporation, Jerusalem 91481, Israel
joaquin.perez@lsi.uned.es, jaguera@email.unc.edu, vfresno@lsi.uned.es,
yuvalf@answers.com
Abstract. This document describes the BM25 and BM25F implemen-
tation using the Lucene Java Framework. The implementation described
here can be downloaded from [Pérez-Iglesias 08a]. Both models have
stood out at TREC by their performance and are considered as state-
of-the-art in the IR community. BM25 is applied to retrieval on plain
text documents, that is for documents that do not contain fields, while
BM25F is applied to documents with structure.
Introduction
Apache Lucene is a high-performance and full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application
that requires full-text search. Lucene is scalable and offers high-performance
indexing, and has become one of the most used search engine libraries in both
academia and industry [Lucene 09].
Lucene ranking function, the core of any search engine applied to determine
how relevant a document is to a given query, is built on a combination of the
Vector Space Model (VSM) and the Boolean model of Information Retrieval.
The main idea behind Lucene approach is the more times a query term appears
in a document relative to the number of times the term appears in the whole
collection, the more relevant that document will be to the query [Lucene 09].
Lucene uses also the Boolean model to first narrow down the documents that
need to be scored based on the use of boolean logic in the query specification.
In this paper, the implementation of BM25 probabilistic model and its ex-
tension for semi-structured IR, BM25F, is described in detail.
One of the main Lucene’s constraints to be widely used by IR community is
the lack of different retrieval models implementations. Our goal with this work is
to offer to IR community a more advanced ranking model which can be compared
with other IR software, like Terrier, Lemur, CLAIRlib or Xapian.
, 1 Motivation
There exists previous implementations of alternative Information Retrieval Mod-
els for Lucene. The most representative case of that is the Language Model im-
plementation4 from Intelligent Systems Lab Amsterdam. Another example is
described at [Doron 07] where Lucene is compared with Juru system. In this
case Lucene document length normalization is changed in order to improve the
Lucene ranking function performance.
BM25 has been widely use by IR researchers and engineers to improve search
engine relevance, so from our point of view, a BM25/BM25F implementation for
Lucene becomes necessary to make Lucene more popular for IR community.
Included Models
The developed models are based in the information that can be found at [Robertson 07].
More specifically the implemented ranking functions are as next:
BM25
X occursdt
R(q, d) = ld
t∈q
k1 ((1 − b) + b avl d
) + occursdt
where occursdt is the term frequency of t in d; ld is the document d length; avld is
the document average length along the collection; k1 is a free parameter usually
chosen as 2 and b ∈ [0, 1] (usually 0.75). Assigning 0 to b is equivalent to avoid
the process of normalisation and therefore the document length will not affect
the final score. If b takes 1, we will be carrying out a full length normalisation.
The classical inverse document frequency is computed as next:
N − df (t) + 0.5
idf (t) = log
df (t) + 0.5
where N is the number of documents in the collection and df is the number of
documents where appears the term t.
A different version of this formula, as can be found at Wikipedia5 , multiplies
the obtained bm25 weight by the constant (k1 + 1) in order to normalize the
weight of terms with a frequency equals to 1 that occurs in documents with an
average length.
BM25F
First we obtain the accumulated weight of a term over all fields as next:
4
http://ilps.science.uva.nl/resources/lm-lucene
5
http://en.wikipedia.org/wiki/Probabilistic relevance model (BM25)
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller intergalactictrainee. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.47. You're not tied to anything after your purchase.