100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Samenvatting Introduction to Information Retrieval €3,99
In winkelwagen

Samenvatting

Samenvatting Introduction to Information Retrieval

 62 keer bekeken  1 keer verkocht

Samenvatting van het boek Introduction to Information Retrieval

Voorbeeld 3 van de 23  pagina's

  • Nee
  • H1 tm h6
  • 13 april 2020
  • 23
  • 2009/2010
  • Samenvatting
book image

Titel boek:

Auteur(s):

  • Uitgave:
  • ISBN:
  • Druk:
Alle documenten voor dit vak (3)
avatar-seller
cdh
Introduction to Information Retrieval
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze

Chapter 1 Boolean Retrieval
Information retrieval (IR) is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually stored
on computers).
The term “unstructured data” refers to data which does not have clear, semantically overt,
easy-for-a-computer structure. In reality, almost no data are truly “unstructured”.
IR is also used to facilitate “semistructured” search such as finding a document where the
​ nd the body contains ​threading.
title contains ​java a
Information retrieval systems can also be distinguished by the scale at which they operate,
and it is useful to distinguish three prominent scales.
● In ​web search, ​the system has to provide search over billions of documents stored on
millions of computers.
○ Distinctive issues: needing to gather documents for indexing, being able to
build systems that work efficiently at this enormous scale, and handling
particular aspects of the web.
● Personal information retrieval.
○ Distinctive issues: handling the broad range of document types on a typical
personal computer, and making the search system maintenance free and
sufficiently lightweight in terms of startup, processing, and disk space usage
that it can run on one machine without annoying its owner.
● Enterprise, institutional, and domain-specific search, w ​ here retrieval might be
provided. Documents will typically be stored on centralized file systems and one or a
handful of dedicated machines will provide search over the collection.

1.1 An example information retrieval problem
● Grepping ​: For a computer the simplest form of document retrieval by doing a linear
scan through document.
● Indexing : ​Avoid linearly scanning the texts for each query by indexing the documents
in advance.
○ The result is a binary term-document ​incidence matrix.
○ Terms a ​ re the indexed units; they are usually words.
○ Depending on whether we look at the matrix rows or columns, we can have a
vector for each term, which shows the documents it appears in, or a vector for
each document, showing the terms that occur in it.

,The ​Boolean retrieval model i​ s a model for information retrieval in which we can pose any
query which is in the form of a Boolean expression of terms, that is, in which terms are
combined with the operators ​AND, OR, ​and ​NOT.

In the most standard information retrieval test, ​ad hoc retrieval task, ​a system aims to
provide documents from within the collection that are relevant to an arbitrary user
information need, communicated to the system by means of a one-off, user-initiated query.
An ​information need i​ s the topic about which the user desires to know more, and is
differentiated from a ​query, w​ hich is what the user conveys to the computer in an attempt to
communicate the information need. A document is ​relevant ​if it is one that the user perceives
as containing information of value with respect to their personal information need.

​ f an IR system (i.e., the quality of its search results), a user will
To assess the ​effectiveness o
usually want to know two key statistics about the system’s returned results for a query:

Precision: ​What fraction of the returned results are relevant to the information need?
Recall: ​What fraction of the relevant documents in the collection were returned by the
system?

The matrix is extremely sparse, that is, it has few non-zero entries. A much better
representation is to record only the things that do occur, that is, the 1 positions.

Inverted index
We keep a ​dictionary ​of terms. Then for each term, we have a list that records which
documents the term occurs in,. Each item in the list - which records that a term appeared in
a document - is conventionally called a ​posting. ​The list is then called a ​postings list ​(or
inverted list), and all the postings lists taken together are referred to as the ​postings.

1.2 A first take at building an inverted index
1. Collect the documents to be indexed
2. Tokenize the text, turning each document into a list of tokens.
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the
indexing terms.
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.

1.3 Processing Boolean queries
Processing the ​simple conjunctive query:
aA ​ ND b
1. Locate ​a​ in the Dictionary
2. Retrieve its postings
3. Locate ​b​ in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists
​ peration is the crucial one: we need to efficiently intersect postings lists so
The ​intersection o
as to be able to quickly find documents that contain both terms. (​merging postings lists).

, Query optimization i​ s the process of selecting how to organize the work of answering a
query so that the least total amount of work needs to be done by the system. If we start by
intersecting the two smallest postings lists, then all intermediate results must be no bigger
than the smallest postings list, and we are therefore likely to do the least amount of total
work.

The extended Boolean model versus ranked retrieval
In ranked retrieval models such as the vector space model, users largely use ​free text
queries, m​ eaning, just typing one or more words rather than using a precise language with
operators for building up query expressions, and the system decides which documents best
satisfy the query.
The extended Boolean model has additional operators (next to the basic Boolean
operations: AND, OR and NOT), such as term proximity operators.
A ​proximity operator ​is a way of specifying that two terms in a query must occur close to
each other in a document, where closeness may be measured by limiting the allowed
number of intervening words or by reference to a structural unit such as a sentence or
paragraph.
Boolean queries are precise: a document either matches the query or it does not. This offers
the user greater control and transparency over what is retrieved. And some domains, such
as legal materials, allow an effective means of document ranking within a Boolean model.
A general problem with Boolean search is that using AND operators tends to produce high
precision but low recall searches, while using OR operators gives low precisions but high
recall searches, and it is difficult or impossible to find a satisfactory middle ground.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper cdh. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 50843 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€3,99  1x  verkocht
  • (0)
In winkelwagen
Toegevoegd