Precision/selec vity: frac on of retrieved Scoring & Ranking No Random Access Algorithm (NRA)
docs that is relevant to user’s need Term frequency ( t,d): how many mes does - Can’t find correspondin oid value other list
Recall/sensi vity: relevant frac on of all doc term occur in document - Store range of values of f in buffer, update
Log-frequency weigh ng (wt,d): when necessary, so if max-B decreased
Indexing wt,d = 0 if t,d = 0 since last round, update buffer entries for
Boolean retrieval wt,d = 1 + log( t,d) if t,d > 0 which value of B was not determined yet
- Term-document incidence matrix: shows - Upper limit of range is needed when lower
what docs a term appears in with 1 and 0 Document frequency (dft): nr of documents limit of a tuple exceeds upper limit of all
- ‘x AND y BUT NOT z’ ⇒ 1101 && 1100 in collec on that contain the term t entries in buffer not yet in top-k (will
&& (NOT 0101) = 1000 Inverse df (idft): idft = log(N / dft), where N is definitely have higher score than those)
- Sparse matrix: only store docIds that do the total nr of documents in collec on
contain the term in a pos ng list - Measure of rareness of a term
- Dic onary as hash table/B-tree/trie - log(1) = 0, so term that occurs in every
document does not contribute to score
List merging
1. Add current docID to result if in both lists -idf weigh ng: wt,d ⋅ idft
2. Otherwise, get next docID of list where - Can’t handle single term queries
current is lower than other - Favours long documents
- Skip pointer: when docID that can
be skipped to is lower than current Vector-space model
docID in other list, skip to it - Each doc represented as D-dimensional
vector, with D the nr of all known terms
Posi onal index: each term has a list of - Similarity is defined as the angle between
tuples <docID, list of posi ons> query vector and doc vector, normalised
- “to be or not to be”, ‘be’ must appear - Use the cosine of the angle
twice, 4 posi ons apart - Inner product: x • y = ||x||⋅||y||⋅ cos(𝛉)
𝐷 Frequent item sets
Wild-card queries - ||𝑥|| = ||𝑥||2 = ∑ 𝑥𝑖
2
Table r(U) with items U, columns are the
- pre* ⇒ all terms between pre - prf 𝑖=1 items,rows are transac ons.
- *post ⇒ inverted B-tree, tsop - tsoq - s(X): support of set of items X; nr of
- pre*post ⇒ (1) intersect pre* and *post Top-k searching transac ons where all of X was bought
(2) permuterm index Monotone: func on whose return value will - s(XY): rela on X → Y, s(XY) = s(X ∪ Y)
not increase/decrease when the value of - confidence: s(XY)/s(X)
Permuterm index one of the params is increased - Of all t’s that contain X, s(XY)/s(X) of
- For term hello, create entries hello$, them also contain Y
ello$h, llo$he, …, o$hell, connected to A ributes which are arguments of f are
same pos ng list as hello represented by separate lists. Algorithm
- Query he*o ⇒ add $, rotate un l * at end, For main-memory RDBMS with column 1. Find all frequent sets Z (exceeding some
e*o$h - o$he*, matches o$hello store, sorted access by efficient internal threshold for their support)
sor ng techniques, random access by index - Exceeding support threshold ensures
MapReduce structure or duplicate list sorted by oid. that you can split Z up into subsets
Map :: <k,v> ⇒ [<k1’,v1’>,...,<km’,vm’>] that together form rela on that also
- Input split into disjoint chunks of key-value Threshold Algorithm (TA) supports threshold (union of subsets)
pairs, assigned to 1 Map worker - Max-A/B: max value of A or B that could 2. For all non-empty subsets X of Z, test if X
- Map worker performs calc on each pair be found in coming rounds → (Z - X) holds with sufficient confidence
- Outputs from all workers grouped on key - Threshold T: max possible future value of f
- With f = A + B, T = max-A + max-B Gaussian elimina on
Reduce :: <k’,[v1’,...,vn’]> ⇒ [<k1’,v1’’>,...] - Buffer: tuples [oid,f], where f is calc using - Get all columns from le to right to
- Par on func on: hash(key) % R, where R the values for that oid in both lists represent the iden ty matrix, the
is nr of reduce tasks (>> nr of workers) - Par al top-k: set of tuples in buffer f ≥ T augmented column is then the solu on
- Sets of pairs distributed over Reduce - Update these values each round and move - Apply rules:
workers, which perform calc on each pair tuples with f ≥ T to the top-k - Interchanging two rows
- Reduce the list of values to just one value - Mul plying row by non-zero constant
- Adding mul ple of one row to other
Failure handling - First row is a, second b, third c
- Robustness: mask hardware failures by 1. b -= - 2a 5. c -= b
replica ng files in different racks 2. c -= a 6. c *= -½
- Master pings workers periodically 3. b *= -1 7. b -= 3c
- Map worker fail: completed/in-progress 4. a -= b 8. a += c
tasks set to idle and rescheduled to others
- Reduce worker fail: in-progress tasks idle
- Master fails: all is lost, ripperoni
Early combining
- k mes emit(<w,1>), just one emit(<w,k>)
- Init_Map() ⇒ Map() ⇒ Finalize_Map()
- Possible outcomes: