It is not a summary. It is basically a document that has all the slides written on it without any reasoning between them. The slides were more useful to study.
By: ilmajaganjac1 • 2 year ago
By: ramapamudji • 2 year ago
By: berendwiewel • 2 year ago
By: jorissnl • 2 year ago
Seller
Follow
bastudent
Reviews received
Content preview
INFORMATION RETRIEVAL
WEEK 1
LECTURE 1: Introduction
What is information retrieval?
Information retrieval: information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
that satisfies an information need from within large collections (usually stored on computers).
History of information retrieval
- Information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar
professional searchers. Nowadays, everyone engages in IR every day e.g. search engines or searching an inbox.
- Before computers information retrieval existed in some form. E.g. at the library, one could find books by searching through
index cards.
- Automated information retrieval:
o The first idea for an automated information retrieval system was described in 1945 by Vannevar Bush in ‘As We May
Think’.
o Microfilm stored in a large desk that could be accessed through knobs and screens that would show the microfilm
recordings. It was a mechanical process that could pull out the right book. It also made use of hyperlinks that you could
use to get from one book to the next or find books about a certain topic.
o In the 1960s the field of information retrieval emerged when text documents became available in digital form.
- Evolution of information retrieval:
o 1960-70s: the era of Boolean retrieval
o 1975: the first vector space model
o 1980s: large document database systems run by companies became available e.g. LexisNexis and MedLine
o 1990s: FTP search and the dawn of Web search e.g. Lycos, Yahoo, and Altavista.
- Information retrieval in the 2000s:
o Link analysis and ranking e.g. Google
o Question answering
o Multimedia IR (image and video analysis)
o Cross-language IR
o Semantic Web Technologies e.g. DBpedia
- Information retrieval since 2010:
o Categorization and clustering, and recommendation:
• iTunes - “top songs"
• Amazon - “people who bought this also bought"
• IBM's Watson system recommendations in Netflix, Spotify, YouTube, etc.
o Machine Learning for IR: learn to rank
o Knowledge graphs using Semantic Web Technologies e.g. the structured facts and photos that appear when you search
for someone on the Web.
Information retrieval vs. databases
- Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style
searching.
Information retrieval Databases
(Mostly) unstructured data e.g. text Structured data e.g. tables
Set of keywords (loose semantics) Well defined query (regular expression, SQL)
Incomplete query specification, partial matching Complete query specification, exact match
Relevant items for result, errors are tolerable Single error results in a failure
Probabilistic models Deterministic models
- Unstructured data: data which does not have clear, semantically overt, easy-for-a-computer structure. No data is truly
unstructured e.g. headings, paragraphs etc.
- Structured data: a relational database such as those used to maintain product inventories and personnel records.
IR scale
1. Web search: the system has to provide search over billions of documents stored on millions of computers.
2. Enterprise, institutional, and domain-specific search: retrieval might be provided for collections such as a corporation’s
internal documents, a database of patents, or research articles on biochemistry.
3. Personal information retrieval: consumer operating systems have integrated IR e.g. email provide search but also text
classification.
1
,What is needed to build a search engine?
- The most important part of a search engine or the information retrieval framework is the index.
The information retrieval framework
What makes a search engine good?
- Search speed:
o How fast does it search?
o Latency as a function of index size.
- Index speed:
o How fast does it index?
o Number of documents per hour.
- Expressiveness of query language:
o Ability to express complex information needs.
o Speed on complex queries.
- Disk space requirements
- User interface features
- User happiness
Search speed (latency) from fastest to slowest:
1. Main memory reference (read random byte from memory).
2. Zip 1KB of data (compress 1000 bytes in memory).
3. SSD random read (read random byte from solid-state drive).
4. Round trip within same datacenter (send one byte to another
computer in the same fast datacenter network and back).
5. Hard disk seek (read random byte from hard disk).
6. Send one byte from the Netherlands to California and back.
2
,LECTURE 2: Indexing and Boolean retrieval
Searching unstructured data
- Example of a Boolean query: Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
- A simple and naïve first attempt to process this query: brute force-approach in the form of a linear search through text
documents, called grepping. Useful for simple querying of words however, not efficient for large document collections,
flexible matching operations, or ranked retrieval.
- Therefore, we need the information retrieval framework, where a lot of preliminary effort is required to create a structure
that makes it efficient to answer a query.
- The index is the main structure to be created.
Term-document incidence matrix
- Incidence matrix: a binary-term document.
o Terms are the indexed units that are usually words.
o Depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the
documents it appears in, or a vector for each document, showing the terms that occur in it.
- If we take the rows, we have a vector with 0/1 values for each term.
- To answer the query:
o We take the vectors for Brutus and the vector for Caesar and the
complemented (NOT = 0 and 1 switched) vector for Calpurnia.
o We then apply a bit-wise logical AND
110100 AND 110111 AND 101111 = 100100
o The answer is the first and the fourth document: Antony and Cleopatra and Hamlet.
Bigger collections
- With a slightly more realistic example: if N = 1 000 000 documents of each 1 000 words (tokens) long, and 1 word is on
average 6 bytes, then the document collection is 6 GB in size (1 000 000 000 bytes = 1 GB).
- In this case, our vocabulary V consists of about |V|= 500 000 distinct words (terms).
The matrix is getting very big
- 500 000 (row for each distinct term) x 1 000 000 (column for each document) matrix has half-a-trillion 0s and 1s, which is
too many to fit in a computer’s memory. 500 000 000 000 = 500Gb/8 = 62.5GB (1 byte (B) = 8 bits (b)).
- The matrix is extremely sparse.
- Sparse matrix: a matrix with few non-zero entries, thus comprised mostly of zeros.
o The number of words in each document is relatively small compared to the number of distinct terms. E.g. each column
can have a maximum of 1000 1s and there are at least 499 000 0s. Therefore, 99.8% of the cells are zero.
o A term-document matrix is extremely sparse.
- A much better representation is to record only the things that do occur, that is, the 1 position. This is central to the idea of
inverted index.
Inverted index
- Records only the 1 positions and is also known as an inverted file.
- There is a dictionary of items.
- For each term (word) t, we store a list of all documents that contain t, called
a postings list, which is a variable-size list.
- Data structure of a postings list:
o On disk: a continuous run of postings without explicit pointers is best.
o In memory: can use linked lists or variable-length arrays.
3
, - The only way to access a list is to go through the list from the start till the end.
- Identify each document by a document ID: a unique serial document number.
- Each item in the list, which records that a term appeared in a document is called a posting.
- Although there are still 500 000 terms, the length of each postings list is much smaller than before, hence the size is smaller.
Inverted index construction
- Steps for building the index after collecting the documents to be indexed:
1. Token sequence: tokenize the text, turning each document into a list of tokens.
o (Normalized token, document ID) pairs
2. Sort: do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms.
o Sort by terms (alphabetically) and then by document ID.
3. Dictionary and postings: index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.
o Multiple term entries in a single document are merged.
o Split into dictionary and postings.
o Document frequency information is added.
• Document frequency: the number of documents which contain the term. Same as the length of each posting.
We use this information for improving query time efficiency and, later, for weighting in ranked retrieval
models.
- Since a term often occurs in a number of documents, this data organization reduces the storage requirements of the index.
- This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
- We pay for storage of both the dictionary and the postings lists:
o Dictionary is kept in memory.
o Postings lists are kept on disk since it is larger.
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller bastudent. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.04. You're not tied to anything after your purchase.