College aantekeningen

Information Retrieval Summary

Name: Information Retrieval Summary 2021-2022
SKU: doc_1444004
Rating: 3.80 (5 reviews)
Author: bastudent

5 beoordelingen

22 keer verkocht

Vak
Information Retrieval

Instelling
Vrije Universiteit Amsterdam (VU)

Detailed document on all the lectures 1-13.

[Meer zien]

Voorbeeld 4 van de 44 pagina's

Bekijk voorbeeld

Geupload op 15 december 2021
Aantal pagina's 44
Geschreven in 2021/2022
Type College aantekeningen
Docent(en) Tobia kuhn
Bevat Alle colleges

information retrieval
ir

5 beoordelingen

Door: waterdijk • 1 jaar geleden

It is not a summary. It is basically a document that has all the slides written on it without any reasoning between them. The slides were more useful to study.

Door: ilmajaganjac1 • 2 jaar geleden

Door: ramapamudji • 2 jaar geleden

Door: berendwiewel • 3 jaar geleden

Door: jorissnl • 3 jaar geleden

Volgen

bastudent Lid sinds 4 jaar 59 documenten verkocht

€7,49

In winkelwagen

Opslaan

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

INFORMATION RETRIEVAL

WEEK 1
LECTURE 1: Introduction

What is information retrieval?
Information retrieval: information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
that satisfies an information need from within large collections (usually stored on computers).

History of information retrieval
- Information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar
professional searchers. Nowadays, everyone engages in IR every day e.g. search engines or searching an inbox.
- Before computers information retrieval existed in some form. E.g. at the library, one could find books by searching through
index cards.
- Automated information retrieval:
o The first idea for an automated information retrieval system was described in 1945 by Vannevar Bush in ‘As We May
Think’.
o Microfilm stored in a large desk that could be accessed through knobs and screens that would show the microfilm
recordings. It was a mechanical process that could pull out the right book. It also made use of hyperlinks that you could
use to get from one book to the next or find books about a certain topic.
o In the 1960s the field of information retrieval emerged when text documents became available in digital form.
- Evolution of information retrieval:
o 1960-70s: the era of Boolean retrieval
o 1975: the first vector space model
o 1980s: large document database systems run by companies became available e.g. LexisNexis and MedLine
o 1990s: FTP search and the dawn of Web search e.g. Lycos, Yahoo, and Altavista.
- Information retrieval in the 2000s:
o Link analysis and ranking e.g. Google
o Question answering
o Multimedia IR (image and video analysis)
o Cross-language IR
o Semantic Web Technologies e.g. DBpedia
- Information retrieval since 2010:
o Categorization and clustering, and recommendation:
• iTunes - “top songs"
• Amazon - “people who bought this also bought"
• IBM's Watson system recommendations in Netflix, Spotify, YouTube, etc.
o Machine Learning for IR: learn to rank
o Knowledge graphs using Semantic Web Technologies e.g. the structured facts and photos that appear when you search
for someone on the Web.

Information retrieval vs. databases
- Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style
searching.
Information retrieval Databases
(Mostly) unstructured data e.g. text Structured data e.g. tables
Set of keywords (loose semantics) Well defined query (regular expression, SQL)
Incomplete query specification, partial matching Complete query specification, exact match
Relevant items for result, errors are tolerable Single error results in a failure
Probabilistic models Deterministic models
- Unstructured data: data which does not have clear, semantically overt, easy-for-a-computer structure. No data is truly
unstructured e.g. headings, paragraphs etc.
- Structured data: a relational database such as those used to maintain product inventories and personnel records.

IR scale
1. Web search: the system has to provide search over billions of documents stored on millions of computers.
2. Enterprise, institutional, and domain-specific search: retrieval might be provided for collections such as a corporation’s
internal documents, a database of patents, or research articles on biochemistry.
3. Personal information retrieval: consumer operating systems have integrated IR e.g. email provide search but also text
classification.

1

,What is needed to build a search engine?
- The most important part of a search engine or the information retrieval framework is the index.

The information retrieval framework

What makes a search engine good?
- Search speed:
o How fast does it search?
o Latency as a function of index size.
- Index speed:
o How fast does it index?
o Number of documents per hour.
- Expressiveness of query language:
o Ability to express complex information needs.
o Speed on complex queries.
- Disk space requirements
- User interface features
- User happiness

Search speed (latency) from fastest to slowest:
1. Main memory reference (read random byte from memory).
2. Zip 1KB of data (compress 1000 bytes in memory).
3. SSD random read (read random byte from solid-state drive).
4. Round trip within same datacenter (send one byte to another
computer in the same fast datacenter network and back).
5. Hard disk seek (read random byte from hard disk).
6. Send one byte from the Netherlands to California and back.

2

,LECTURE 2: Indexing and Boolean retrieval

Searching unstructured data
- Example of a Boolean query: Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
- A simple and naïve first attempt to process this query: brute force-approach in the form of a linear search through text
documents, called grepping. Useful for simple querying of words however, not efficient for large document collections,
flexible matching operations, or ranked retrieval.
- Therefore, we need the information retrieval framework, where a lot of preliminary effort is required to create a structure
that makes it efficient to answer a query.
- The index is the main structure to be created.

Term-document incidence matrix
- Incidence matrix: a binary-term document.
o Terms are the indexed units that are usually words.
o Depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the
documents it appears in, or a vector for each document, showing the terms that occur in it.

- If we take the rows, we have a vector with 0/1 values for each term.
- To answer the query:
o We take the vectors for Brutus and the vector for Caesar and the
complemented (NOT = 0 and 1 switched) vector for Calpurnia.
o We then apply a bit-wise logical AND
110100 AND 110111 AND 101111 = 100100
o The answer is the first and the fourth document: Antony and Cleopatra and Hamlet.

Bigger collections
- With a slightly more realistic example: if N = 1 000 000 documents of each 1 000 words (tokens) long, and 1 word is on
average 6 bytes, then the document collection is 6 GB in size (1 000 000 000 bytes = 1 GB).
- In this case, our vocabulary V consists of about |V|= 500 000 distinct words (terms).

The matrix is getting very big
- 500 000 (row for each distinct term) x 1 000 000 (column for each document) matrix has half-a-trillion 0s and 1s, which is
too many to fit in a computer’s memory. 500 000 000 000 = 500Gb/8 = 62.5GB (1 byte (B) = 8 bits (b)).
- The matrix is extremely sparse.
- Sparse matrix: a matrix with few non-zero entries, thus comprised mostly of zeros.
o The number of words in each document is relatively small compared to the number of distinct terms. E.g. each column
can have a maximum of 1000 1s and there are at least 499 000 0s. Therefore, 99.8% of the cells are zero.
o A term-document matrix is extremely sparse.
- A much better representation is to record only the things that do occur, that is, the 1 position. This is central to the idea of
inverted index.

Inverted index
- Records only the 1 positions and is also known as an inverted file.
- There is a dictionary of items.
- For each term (word) t, we store a list of all documents that contain t, called
a postings list, which is a variable-size list.
- Data structure of a postings list:
o On disk: a continuous run of postings without explicit pointers is best.
o In memory: can use linked lists or variable-length arrays.

3

, - The only way to access a list is to go through the list from the start till the end.
- Identify each document by a document ID: a unique serial document number.
- Each item in the list, which records that a term appeared in a document is called a posting.
- Although there are still 500 000 terms, the length of each postings list is much smaller than before, hence the size is smaller.

Inverted index construction

- Steps for building the index after collecting the documents to be indexed:
1. Token sequence: tokenize the text, turning each document into a list of tokens.
o (Normalized token, document ID) pairs
2. Sort: do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms.
o Sort by terms (alphabetically) and then by document ID.

3. Dictionary and postings: index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.
o Multiple term entries in a single document are merged.
o Split into dictionary and postings.
o Document frequency information is added.
• Document frequency: the number of documents which contain the term. Same as the length of each posting.
We use this information for improving query time efficiency and, later, for weighting in ranked retrieval
models.

- Since a term often occurs in a number of documents, this data organization reduces the storage requirements of the index.
- This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
- We pay for storage of both the dictionary and the postings lists:
o Dictionary is kept in memory.
o Postings lists are kept on disk since it is larger.

4

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper bastudent. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65539 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Laatst bekeken door jou

Samenvatting ·

(2)

College aantekeningen

Information Retrieval Summary

Document informatie

Onderwerpen

Geschreven voor

5 beoordelingen

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

In een paar klikken geregeld

Direct to-the-point

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?

Laatst bekeken door jou

Samenvatting ·

TIO HEM1 Samenvatting VTT Gastvrij powerpoints en boek Gastvrij! 1 Werken in de horeca