100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary Data Analysis & Retrieval Midterm €8,99
In winkelwagen

Samenvatting

Summary Data Analysis & Retrieval Midterm

 19 keer bekeken  1 keer verkocht

All subjects that are discussed in the first part of the Data Analysis & Retrieval (INFOB3DAR) course for the midterm, clearly summarized. Based on the lectures and the book.

Voorbeeld 3 van de 22  pagina's

  • Nee
  • Onbekend
  • 1 juni 2022
  • 22
  • 2021/2022
  • Samenvatting
book image

Titel boek:

Auteur(s):

  • Uitgave:
  • ISBN:
  • Druk:
Alle documenten voor dit vak (2)
avatar-seller
Suniht
Data-analysis & retrieval
Midterm

Indexing 2

MapReduce 5

Scoring & Ranking 7
Top-k searching 9

Frequent item sets 11

Linear algebra 13
Gaussian elimination 14

PageRank 15

Approximate string matching 16

,Indexing
Text searching
- Collection: Fixed set of documents
- Goal: retrieving documents relevant to the user’s information need
- User’s need for information usually expressed by one or more search terms

Quality measures:
- Precision: fraction of retrieved documents that is relevant to user’s information need
(also called selectivity)
- Recall: fraction of relevant docs in collection that are retrieved (also called sensitivity)

Boolean retrieval
- Basic model for IR (Information Retrieval)
- Uses logical operators (AND, OR, NOT) and brackets
- Term-document incidence matrix: matrix that shows if a term appears in a document
(with 1 or 0 for true or false)
- For each term, you get a bit array where each bit is determined by whether or not
the term is contained in the corresponding document
- You can use the above mentioned bit array in bitwise operations to run queries
- For example, get documents from a collection of 6 that meet the query ‘Brutus
AND Caesar BUT NOT Calpurnia’ could correspond to the bitwise operation
110100 && 110111 && (NOT 010000) = 100100
- Problem: collections are often rather large, too large for the use of such a matrix
- Solved by the sparse matrix approach
- Sparse matrix: documents identified by unique docID and terms are organised in a
dictionary, with each term having its own posting list
- Posting list: ordered list of documents containing the corresponding term
- Dictionary is implemented as a hash table or tree like structure
- Implementations of postings lists:
- Internal memory, static situation: arrays
- Internal memory, dynamic situation: linked list
- External memory: linked list (block structure)




Tree like structures
- B-tree: binary tree, but with a maximum of 4 branches leading out from a node (4
branches needs 3 values to determine which branch to follow
- Trie (prefix tree): leafs are the terms, built up by various nodes adding prefixes

, Indexing process
1. From the documents to be indexed, get the relevant tokens (terms)
2. Modify the tokens to be more general (no capital letters etc.)
3. Get posting lists for the terms using indexer

Boolean query processing
- Query = term1 AND term2
1. Locate postings list for p1 for term1
2. Locate postings list p2 for term2
3. Calculate the intersection of p1 and p2 by list merging
- List merging: keep only the docID’s that occur in all input lists (intersection)
1. Get docID of both lists
2. If they are equal, add to result
3. Otherwise, get next docID of list where current docID is lower than other
4. Repeat steps 2 and 3 with new docID
- Query = term1 AND NOT term2
- Go through the first list and add only the docID’s that do not occur in the other
input list to the result
- Query = term1 AND term2 AND … AND termn
- 𝑛! possibilities of merging one by one
- Use the length of postings list, merge lists with smallest length first
- Skip pointers: pointer to docID further down the list
- Use when the docID that the pointer points to
is lower than the current docID of the other list
- For example, on the right, the pointer from 11
to 31 will be used when merging the two lists
- Many skip pointers → more comparisons,
more skips, higher memory cost
- Few skip pointers → fewer comparisons, less frequent skips, longer
jumps, lower memory cost
- Rule of thumb: 𝑛 skip pointers for postings list of length 𝑛
- Instead of merging the lists one by one, you can merge them all at the same time
by making a slight adjustment to the algorithm for list merging
- This approach allows more efficient use of skip pointers

Phrase queries
- Juxtapositions of terms: difference between “fight club” and “fight” AND “club”
- Solution 1: biword index (make an index for each term)

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper Suniht. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €8,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 52355 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€8,99  1x  verkocht
  • (0)
In winkelwagen
Toegevoegd