information retrieval boolean index ranked postings lists vocabulary normalization stemming lemmatization queries phase queries biword indexes dictionaries wildcard queries vector space model vector space
Connected book
Book Title:
Author(s):
Edition:
ISBN:
Edition:
More summaries for
Samenvatting voor Information Retrieval Exam (X_400435)
Summary Data Analysis & Retrieval Midterm
Lectures Information Retrieval
All for this textbook (4)
Written for
Vrije Universiteit Amsterdam (VU)
Artificial Intelligence
Information Retrieval
All documents for this subject (3)
Seller
Follow
cdh
Reviews received
Content preview
Introduction to Information Retrieval
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
Chapter 1 Boolean Retrieval
Information retrieval (IR) is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually stored
on computers).
The term “unstructured data” refers to data which does not have clear, semantically overt,
easy-for-a-computer structure. In reality, almost no data are truly “unstructured”.
IR is also used to facilitate “semistructured” search such as finding a document where the
nd the body contains threading.
title contains java a
Information retrieval systems can also be distinguished by the scale at which they operate,
and it is useful to distinguish three prominent scales.
● In web search, the system has to provide search over billions of documents stored on
millions of computers.
○ Distinctive issues: needing to gather documents for indexing, being able to
build systems that work efficiently at this enormous scale, and handling
particular aspects of the web.
● Personal information retrieval.
○ Distinctive issues: handling the broad range of document types on a typical
personal computer, and making the search system maintenance free and
sufficiently lightweight in terms of startup, processing, and disk space usage
that it can run on one machine without annoying its owner.
● Enterprise, institutional, and domain-specific search, w here retrieval might be
provided. Documents will typically be stored on centralized file systems and one or a
handful of dedicated machines will provide search over the collection.
1.1 An example information retrieval problem
● Grepping : For a computer the simplest form of document retrieval by doing a linear
scan through document.
● Indexing : Avoid linearly scanning the texts for each query by indexing the documents
in advance.
○ The result is a binary term-document incidence matrix.
○ Terms a re the indexed units; they are usually words.
○ Depending on whether we look at the matrix rows or columns, we can have a
vector for each term, which shows the documents it appears in, or a vector for
each document, showing the terms that occur in it.
,The Boolean retrieval model i s a model for information retrieval in which we can pose any
query which is in the form of a Boolean expression of terms, that is, in which terms are
combined with the operators AND, OR, and NOT.
In the most standard information retrieval test, ad hoc retrieval task, a system aims to
provide documents from within the collection that are relevant to an arbitrary user
information need, communicated to the system by means of a one-off, user-initiated query.
An information need i s the topic about which the user desires to know more, and is
differentiated from a query, w hich is what the user conveys to the computer in an attempt to
communicate the information need. A document is relevant if it is one that the user perceives
as containing information of value with respect to their personal information need.
f an IR system (i.e., the quality of its search results), a user will
To assess the effectiveness o
usually want to know two key statistics about the system’s returned results for a query:
Precision: What fraction of the returned results are relevant to the information need?
Recall: What fraction of the relevant documents in the collection were returned by the
system?
The matrix is extremely sparse, that is, it has few non-zero entries. A much better
representation is to record only the things that do occur, that is, the 1 positions.
Inverted index
We keep a dictionary of terms. Then for each term, we have a list that records which
documents the term occurs in,. Each item in the list - which records that a term appeared in
a document - is conventionally called a posting. The list is then called a postings list (or
inverted list), and all the postings lists taken together are referred to as the postings.
1.2 A first take at building an inverted index
1. Collect the documents to be indexed
2. Tokenize the text, turning each document into a list of tokens.
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the
indexing terms.
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.
1.3 Processing Boolean queries
Processing the simple conjunctive query:
aA ND b
1. Locate a in the Dictionary
2. Retrieve its postings
3. Locate b in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists
peration is the crucial one: we need to efficiently intersect postings lists so
The intersection o
as to be able to quickly find documents that contain both terms. (merging postings lists).
, Query optimization i s the process of selecting how to organize the work of answering a
query so that the least total amount of work needs to be done by the system. If we start by
intersecting the two smallest postings lists, then all intermediate results must be no bigger
than the smallest postings list, and we are therefore likely to do the least amount of total
work.
The extended Boolean model versus ranked retrieval
In ranked retrieval models such as the vector space model, users largely use free text
queries, m eaning, just typing one or more words rather than using a precise language with
operators for building up query expressions, and the system decides which documents best
satisfy the query.
The extended Boolean model has additional operators (next to the basic Boolean
operations: AND, OR and NOT), such as term proximity operators.
A proximity operator is a way of specifying that two terms in a query must occur close to
each other in a document, where closeness may be measured by limiting the allowed
number of intervening words or by reference to a structural unit such as a sentence or
paragraph.
Boolean queries are precise: a document either matches the query or it does not. This offers
the user greater control and transparency over what is retrieved. And some domains, such
as legal materials, allow an effective means of document ranking within a Boolean model.
A general problem with Boolean search is that using AND operators tends to produce high
precision but low recall searches, while using OR operators gives low precisions but high
recall searches, and it is difficult or impossible to find a satisfactory middle ground.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller cdh. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $4.28. You're not tied to anything after your purchase.