100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Information Retrieval Summary €7,49   In winkelwagen

College aantekeningen

Information Retrieval Summary

5 beoordelingen
 263 keer bekeken  22 keer verkocht

Detailed document on all the lectures 1-13.

Voorbeeld 4 van de 44  pagina's

  • 15 december 2021
  • 44
  • 2021/2022
  • College aantekeningen
  • Tobia kuhn
  • Alle colleges
Alle documenten voor dit vak (1)

5  beoordelingen

review-writer-avatar

Door: waterdijk • 11 maanden geleden

It is not a summary. It is basically a document that has all the slides written on it without any reasoning between them. The slides were more useful to study.

review-writer-avatar

Door: ilmajaganjac1 • 2 jaar geleden

review-writer-avatar

Door: ramapamudji • 2 jaar geleden

review-writer-avatar

Door: berendwiewel • 2 jaar geleden

review-writer-avatar

Door: jorissnl • 2 jaar geleden

avatar-seller
bastudent
INFORMATION RETRIEVAL

WEEK 1
LECTURE 1: Introduction

What is information retrieval?
Information retrieval: information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
that satisfies an information need from within large collections (usually stored on computers).

History of information retrieval
- Information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar
professional searchers. Nowadays, everyone engages in IR every day e.g. search engines or searching an inbox.
- Before computers information retrieval existed in some form. E.g. at the library, one could find books by searching through
index cards.
- Automated information retrieval:
o The first idea for an automated information retrieval system was described in 1945 by Vannevar Bush in ‘As We May
Think’.
o Microfilm stored in a large desk that could be accessed through knobs and screens that would show the microfilm
recordings. It was a mechanical process that could pull out the right book. It also made use of hyperlinks that you could
use to get from one book to the next or find books about a certain topic.
o In the 1960s the field of information retrieval emerged when text documents became available in digital form.
- Evolution of information retrieval:
o 1960-70s: the era of Boolean retrieval
o 1975: the first vector space model
o 1980s: large document database systems run by companies became available e.g. LexisNexis and MedLine
o 1990s: FTP search and the dawn of Web search e.g. Lycos, Yahoo, and Altavista.
- Information retrieval in the 2000s:
o Link analysis and ranking e.g. Google
o Question answering
o Multimedia IR (image and video analysis)
o Cross-language IR
o Semantic Web Technologies e.g. DBpedia
- Information retrieval since 2010:
o Categorization and clustering, and recommendation:
• iTunes - “top songs"
• Amazon - “people who bought this also bought"
• IBM's Watson system recommendations in Netflix, Spotify, YouTube, etc.
o Machine Learning for IR: learn to rank
o Knowledge graphs using Semantic Web Technologies e.g. the structured facts and photos that appear when you search
for someone on the Web.

Information retrieval vs. databases
- Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style
searching.
Information retrieval Databases
(Mostly) unstructured data e.g. text Structured data e.g. tables
Set of keywords (loose semantics) Well defined query (regular expression, SQL)
Incomplete query specification, partial matching Complete query specification, exact match
Relevant items for result, errors are tolerable Single error results in a failure
Probabilistic models Deterministic models
- Unstructured data: data which does not have clear, semantically overt, easy-for-a-computer structure. No data is truly
unstructured e.g. headings, paragraphs etc.
- Structured data: a relational database such as those used to maintain product inventories and personnel records.

IR scale
1. Web search: the system has to provide search over billions of documents stored on millions of computers.
2. Enterprise, institutional, and domain-specific search: retrieval might be provided for collections such as a corporation’s
internal documents, a database of patents, or research articles on biochemistry.
3. Personal information retrieval: consumer operating systems have integrated IR e.g. email provide search but also text
classification.


1

,What is needed to build a search engine?
- The most important part of a search engine or the information retrieval framework is the index.

The information retrieval framework




What makes a search engine good?
- Search speed:
o How fast does it search?
o Latency as a function of index size.
- Index speed:
o How fast does it index?
o Number of documents per hour.
- Expressiveness of query language:
o Ability to express complex information needs.
o Speed on complex queries.
- Disk space requirements
- User interface features
- User happiness

Search speed (latency) from fastest to slowest:
1. Main memory reference (read random byte from memory).
2. Zip 1KB of data (compress 1000 bytes in memory).
3. SSD random read (read random byte from solid-state drive).
4. Round trip within same datacenter (send one byte to another
computer in the same fast datacenter network and back).
5. Hard disk seek (read random byte from hard disk).
6. Send one byte from the Netherlands to California and back.




2

,LECTURE 2: Indexing and Boolean retrieval

Searching unstructured data
- Example of a Boolean query: Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
- A simple and naïve first attempt to process this query: brute force-approach in the form of a linear search through text
documents, called grepping. Useful for simple querying of words however, not efficient for large document collections,
flexible matching operations, or ranked retrieval.
- Therefore, we need the information retrieval framework, where a lot of preliminary effort is required to create a structure
that makes it efficient to answer a query.
- The index is the main structure to be created.

Term-document incidence matrix
- Incidence matrix: a binary-term document.
o Terms are the indexed units that are usually words.
o Depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the
documents it appears in, or a vector for each document, showing the terms that occur in it.




- If we take the rows, we have a vector with 0/1 values for each term.
- To answer the query:
o We take the vectors for Brutus and the vector for Caesar and the
complemented (NOT = 0 and 1 switched) vector for Calpurnia.
o We then apply a bit-wise logical AND
110100 AND 110111 AND 101111 = 100100
o The answer is the first and the fourth document: Antony and Cleopatra and Hamlet.

Bigger collections
- With a slightly more realistic example: if N = 1 000 000 documents of each 1 000 words (tokens) long, and 1 word is on
average 6 bytes, then the document collection is 6 GB in size (1 000 000 000 bytes = 1 GB).
- In this case, our vocabulary V consists of about |V|= 500 000 distinct words (terms).

The matrix is getting very big
- 500 000 (row for each distinct term) x 1 000 000 (column for each document) matrix has half-a-trillion 0s and 1s, which is
too many to fit in a computer’s memory. 500 000 000 000 = 500Gb/8 = 62.5GB (1 byte (B) = 8 bits (b)).
- The matrix is extremely sparse.
- Sparse matrix: a matrix with few non-zero entries, thus comprised mostly of zeros.
o The number of words in each document is relatively small compared to the number of distinct terms. E.g. each column
can have a maximum of 1000 1s and there are at least 499 000 0s. Therefore, 99.8% of the cells are zero.
o A term-document matrix is extremely sparse.
- A much better representation is to record only the things that do occur, that is, the 1 position. This is central to the idea of
inverted index.

Inverted index
- Records only the 1 positions and is also known as an inverted file.
- There is a dictionary of items.
- For each term (word) t, we store a list of all documents that contain t, called
a postings list, which is a variable-size list.
- Data structure of a postings list:
o On disk: a continuous run of postings without explicit pointers is best.
o In memory: can use linked lists or variable-length arrays.

3

, - The only way to access a list is to go through the list from the start till the end.
- Identify each document by a document ID: a unique serial document number.
- Each item in the list, which records that a term appeared in a document is called a posting.
- Although there are still 500 000 terms, the length of each postings list is much smaller than before, hence the size is smaller.

Inverted index construction




- Steps for building the index after collecting the documents to be indexed:
1. Token sequence: tokenize the text, turning each document into a list of tokens.
o (Normalized token, document ID) pairs
2. Sort: do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms.
o Sort by terms (alphabetically) and then by document ID.




3. Dictionary and postings: index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.
o Multiple term entries in a single document are merged.
o Split into dictionary and postings.
o Document frequency information is added.
• Document frequency: the number of documents which contain the term. Same as the length of each posting.
We use this information for improving query time efficiency and, later, for weighting in ranked retrieval
models.




- Since a term often occurs in a number of documents, this data organization reduces the storage requirements of the index.
- This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
- We pay for storage of both the dictionary and the postings lists:
o Dictionary is kept in memory.
o Postings lists are kept on disk since it is larger.




4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper bastudent. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 67474 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€7,49  22x  verkocht
  • (5)
  Kopen