Summary 2-page Overview Data Analysis & Retrieval Midterm (INFOB3DAR)
24 views 1 purchase
Course
Data Analyse En Retrieval (INFOB3DAR)
Institution
Universiteit Utrecht (UU)
2-page overview (cheatsheet) of all the subjects discussed in the first part of the Data Analysis & Retrieval course, very concisely summarized. Summary of the summary. Includes several examples of MapReduce algorithms.
Precision/selec vity: frac on of retrieved Scoring & Ranking No Random Access Algorithm (NRA)
docs that is relevant to user’s need Term frequency ( t,d): how many mes does - Can’t find correspondin oid value other list
Recall/sensi vity: relevant frac on of all doc term occur in document - Store range of values of f in buffer, update
Log-frequency weigh ng (wt,d): when necessary, so if max-B decreased
Indexing wt,d = 0 if t,d = 0 since last round, update buffer entries for
Boolean retrieval wt,d = 1 + log( t,d) if t,d > 0 which value of B was not determined yet
- Term-document incidence matrix: shows - Upper limit of range is needed when lower
what docs a term appears in with 1 and 0 Document frequency (dft): nr of documents limit of a tuple exceeds upper limit of all
- ‘x AND y BUT NOT z’ ⇒ 1101 && 1100 in collec on that contain the term t entries in buffer not yet in top-k (will
&& (NOT 0101) = 1000 Inverse df (idft): idft = log(N / dft), where N is definitely have higher score than those)
- Sparse matrix: only store docIds that do the total nr of documents in collec on
contain the term in a pos ng list - Measure of rareness of a term
- Dic onary as hash table/B-tree/trie - log(1) = 0, so term that occurs in every
document does not contribute to score
List merging
1. Add current docID to result if in both lists -idf weigh ng: wt,d ⋅ idft
2. Otherwise, get next docID of list where - Can’t handle single term queries
current is lower than other - Favours long documents
- Skip pointer: when docID that can
be skipped to is lower than current Vector-space model
docID in other list, skip to it - Each doc represented as D-dimensional
vector, with D the nr of all known terms
Posi onal index: each term has a list of - Similarity is defined as the angle between
tuples <docID, list of posi ons> query vector and doc vector, normalised
- “to be or not to be”, ‘be’ must appear - Use the cosine of the angle
twice, 4 posi ons apart - Inner product: x • y = ||x||⋅||y||⋅ cos(𝛉)
𝐷 Frequent item sets
Wild-card queries - ||𝑥|| = ||𝑥||2 = ∑ 𝑥𝑖
2
Table r(U) with items U, columns are the
- pre* ⇒ all terms between pre - prf 𝑖=1 items,rows are transac ons.
- *post ⇒ inverted B-tree, tsop - tsoq - s(X): support of set of items X; nr of
- pre*post ⇒ (1) intersect pre* and *post Top-k searching transac ons where all of X was bought
(2) permuterm index Monotone: func on whose return value will - s(XY): rela on X → Y, s(XY) = s(X ∪ Y)
not increase/decrease when the value of - confidence: s(XY)/s(X)
Permuterm index one of the params is increased - Of all t’s that contain X, s(XY)/s(X) of
- For term hello, create entries hello$, them also contain Y
ello$h, llo$he, …, o$hell, connected to A ributes which are arguments of f are
same pos ng list as hello represented by separate lists. Algorithm
- Query he*o ⇒ add $, rotate un l * at end, For main-memory RDBMS with column 1. Find all frequent sets Z (exceeding some
e*o$h - o$he*, matches o$hello store, sorted access by efficient internal threshold for their support)
sor ng techniques, random access by index - Exceeding support threshold ensures
MapReduce structure or duplicate list sorted by oid. that you can split Z up into subsets
Map :: <k,v> ⇒ [<k1’,v1’>,...,<km’,vm’>] that together form rela on that also
- Input split into disjoint chunks of key-value Threshold Algorithm (TA) supports threshold (union of subsets)
pairs, assigned to 1 Map worker - Max-A/B: max value of A or B that could 2. For all non-empty subsets X of Z, test if X
- Map worker performs calc on each pair be found in coming rounds → (Z - X) holds with sufficient confidence
- Outputs from all workers grouped on key - Threshold T: max possible future value of f
- With f = A + B, T = max-A + max-B Gaussian elimina on
Reduce :: <k’,[v1’,...,vn’]> ⇒ [<k1’,v1’’>,...] - Buffer: tuples [oid,f], where f is calc using - Get all columns from le to right to
- Par on func on: hash(key) % R, where R the values for that oid in both lists represent the iden ty matrix, the
is nr of reduce tasks (>> nr of workers) - Par al top-k: set of tuples in buffer f ≥ T augmented column is then the solu on
- Sets of pairs distributed over Reduce - Update these values each round and move - Apply rules:
workers, which perform calc on each pair tuples with f ≥ T to the top-k - Interchanging two rows
- Reduce the list of values to just one value - Mul plying row by non-zero constant
- Adding mul ple of one row to other
Failure handling - First row is a, second b, third c
- Robustness: mask hardware failures by 1. b -= - 2a 5. c -= b
replica ng files in different racks 2. c -= a 6. c *= -½
- Master pings workers periodically 3. b *= -1 7. b -= 3c
- Map worker fail: completed/in-progress 4. a -= b 8. a += c
tasks set to idle and rescheduled to others
- Reduce worker fail: in-progress tasks idle
- Master fails: all is lost, ripperoni
Early combining
- k mes emit(<w,1>), just one emit(<w,k>)
- Init_Map() ⇒ Map() ⇒ Finalize_Map()
- Possible outcomes:
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller Suniht. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.25. You're not tied to anything after your purchase.