Samenvatting

Samenvatting Introduction to Information Retrieval

1 keer verkocht

Vak
Information Retrieval

Instelling
Vrije Universiteit Amsterdam (VU)

Samenvatting van het boek Introduction to Information Retrieval

[Meer zien]

Voorbeeld 3 van de 23 pagina's

Bekijk voorbeeld

Heel boek samengevat? Nee
Wat is er van het boek samengevat? H1 tm h6
Geupload op 13 april 2020
Aantal pagina's 23
Geschreven in 2009/2010
Type Samenvatting

information
retrieval
boolean
index
ranked
postings lists
vocabulary
normalization
stemming
lemmatization
queries
phase queries
biword indexes
dictionaries
wildcard queries
vector space model
vector s

Volgen

cdh Lid sinds 5 jaar 43 documenten verkocht

€3,99

Ook beschikbaar in voordeelbundel v.a. €5,99

In winkelwagen

Opslaan

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Ook beschikbaar in voordeelbundel (1)

Information Retrieval

€ 7,48 € 5,99

3x verkocht

2 items

1. Samenvatting - Samenvatting introduction to information retrieval
2. College aantekeningen - Lectures information retrieval
Meer zien

Introduction to Information Retrieval
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze

Chapter 1 Boolean Retrieval
Information retrieval (IR) is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually stored
on computers).
The term “unstructured data” refers to data which does not have clear, semantically overt,
easy-for-a-computer structure. In reality, almost no data are truly “unstructured”.
IR is also used to facilitate “semistructured” search such as finding a document where the
nd the body contains threading.
title contains java a
Information retrieval systems can also be distinguished by the scale at which they operate,
and it is useful to distinguish three prominent scales.
● In web search, the system has to provide search over billions of documents stored on
millions of computers.
○ Distinctive issues: needing to gather documents for indexing, being able to
build systems that work efficiently at this enormous scale, and handling
particular aspects of the web.
● Personal information retrieval.
○ Distinctive issues: handling the broad range of document types on a typical
personal computer, and making the search system maintenance free and
sufficiently lightweight in terms of startup, processing, and disk space usage
that it can run on one machine without annoying its owner.
● Enterprise, institutional, and domain-specific search, w here retrieval might be
provided. Documents will typically be stored on centralized file systems and one or a
handful of dedicated machines will provide search over the collection.

1.1 An example information retrieval problem
● Grepping : For a computer the simplest form of document retrieval by doing a linear
scan through document.
● Indexing : Avoid linearly scanning the texts for each query by indexing the documents
in advance.
○ The result is a binary term-document incidence matrix.
○ Terms a re the indexed units; they are usually words.
○ Depending on whether we look at the matrix rows or columns, we can have a
vector for each term, which shows the documents it appears in, or a vector for
each document, showing the terms that occur in it.

,The Boolean retrieval model i s a model for information retrieval in which we can pose any
query which is in the form of a Boolean expression of terms, that is, in which terms are
combined with the operators AND, OR, and NOT.

In the most standard information retrieval test, ad hoc retrieval task, a system aims to
provide documents from within the collection that are relevant to an arbitrary user
information need, communicated to the system by means of a one-off, user-initiated query.
An information need i s the topic about which the user desires to know more, and is
differentiated from a query, w hich is what the user conveys to the computer in an attempt to
communicate the information need. A document is relevant if it is one that the user perceives
as containing information of value with respect to their personal information need.

f an IR system (i.e., the quality of its search results), a user will
To assess the effectiveness o
usually want to know two key statistics about the system’s returned results for a query:

Precision: What fraction of the returned results are relevant to the information need?
Recall: What fraction of the relevant documents in the collection were returned by the
system?

The matrix is extremely sparse, that is, it has few non-zero entries. A much better
representation is to record only the things that do occur, that is, the 1 positions.

Inverted index
We keep a dictionary of terms. Then for each term, we have a list that records which
documents the term occurs in,. Each item in the list - which records that a term appeared in
a document - is conventionally called a posting. The list is then called a postings list (or
inverted list), and all the postings lists taken together are referred to as the postings.

1.2 A first take at building an inverted index
1. Collect the documents to be indexed
2. Tokenize the text, turning each document into a list of tokens.
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the
indexing terms.
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.

1.3 Processing Boolean queries
Processing the simple conjunctive query:
aA ND b
1. Locate a in the Dictionary
2. Retrieve its postings
3. Locate b in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists
peration is the crucial one: we need to efficiently intersect postings lists so
The intersection o
as to be able to quickly find documents that contain both terms. (merging postings lists).

, Query optimization i s the process of selecting how to organize the work of answering a
query so that the least total amount of work needs to be done by the system. If we start by
intersecting the two smallest postings lists, then all intermediate results must be no bigger
than the smallest postings list, and we are therefore likely to do the least amount of total
work.

The extended Boolean model versus ranked retrieval
In ranked retrieval models such as the vector space model, users largely use free text
queries, m eaning, just typing one or more words rather than using a precise language with
operators for building up query expressions, and the system decides which documents best
satisfy the query.
The extended Boolean model has additional operators (next to the basic Boolean
operations: AND, OR and NOT), such as term proximity operators.
A proximity operator is a way of specifying that two terms in a query must occur close to
each other in a document, where closeness may be measured by limiting the allowed
number of intervening words or by reference to a structural unit such as a sentence or
paragraph.
Boolean queries are precise: a document either matches the query or it does not. This offers
the user greater control and transparency over what is retrieved. And some domains, such
as legal materials, allow an effective means of document ranking within a Boolean model.
A general problem with Boolean search is that using AND operators tends to produce high
precision but low recall searches, while using OR operators gives low precisions but high
recall searches, and it is difficult or impossible to find a satisfactory middle ground.

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper cdh. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65539 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Samenvatting

Samenvatting Introduction to Information Retrieval

Document informatie

Onderwerpen

Gekoppeld boek

Meer samenvattingen voor studieboek

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud