Summary

Summary 2-page Overview Data Analysis & Retrieval Midterm (INFOB3DAR)

24 views 1 purchase

Course
Data Analyse En Retrieval (INFOB3DAR)

Institution
Universiteit Utrecht (UU)

2-page overview (cheatsheet) of all the subjects discussed in the first part of the Data Analysis & Retrieval course, very concisely summarized. Summary of the summary. Includes several examples of MapReduce algorithms.

[Show more]

Preview 1 out of 2 pages

View example

Uploaded on June 1, 2022
Number of pages 2
Written in 2021/2022
Type Summary

indexing
list merging
positional index
permuterm index
mapreduce
boolean retrieval
scoring
ranking
term frequency
document frequency
idf
tfidf
top k searching
nra
freq
inverse document frequency
tf
ta

Institution
Universiteit Utrecht (UU)
Education
Informatica
Course
Data Analyse En Retrieval (INFOB3DAR)

Suniht

Member since 3 year 82 documents sold

$3.25

Added

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Precision/selec vity: frac on of retrieved Scoring & Ranking No Random Access Algorithm (NRA)
docs that is relevant to user’s need Term frequency ( t,d): how many mes does - Can’t ﬁnd correspondin oid value other list
Recall/sensi vity: relevant frac on of all doc term occur in document - Store range of values of f in buﬀer, update
Log-frequency weigh ng (wt,d): when necessary, so if max-B decreased
Indexing wt,d = 0 if t,d = 0 since last round, update buﬀer entries for
Boolean retrieval wt,d = 1 + log( t,d) if t,d > 0 which value of B was not determined yet
- Term-document incidence matrix: shows - Upper limit of range is needed when lower
what docs a term appears in with 1 and 0 Document frequency (dft): nr of documents limit of a tuple exceeds upper limit of all
- ‘x AND y BUT NOT z’ ⇒ 1101 && 1100 in collec on that contain the term t entries in buﬀer not yet in top-k (will
&& (NOT 0101) = 1000 Inverse df (idft): idft = log(N / dft), where N is deﬁnitely have higher score than those)
- Sparse matrix: only store docIds that do the total nr of documents in collec on
contain the term in a pos ng list - Measure of rareness of a term
- Dic onary as hash table/B-tree/trie - log(1) = 0, so term that occurs in every
document does not contribute to score
List merging
1. Add current docID to result if in both lists -idf weigh ng: wt,d ⋅ idft
2. Otherwise, get next docID of list where - Can’t handle single term queries
current is lower than other - Favours long documents
- Skip pointer: when docID that can
be skipped to is lower than current Vector-space model
docID in other list, skip to it - Each doc represented as D-dimensional
vector, with D the nr of all known terms
Posi onal index: each term has a list of - Similarity is deﬁned as the angle between
tuples <docID, list of posi ons> query vector and doc vector, normalised
- “to be or not to be”, ‘be’ must appear - Use the cosine of the angle
twice, 4 posi ons apart - Inner product: x • y = ||x||⋅||y||⋅ cos(𝛉)
𝐷 Frequent item sets
Wild-card queries - ||𝑥|| = ||𝑥||2 = ∑ 𝑥𝑖
2
Table r(U) with items U, columns are the
- pre* ⇒ all terms between pre - prf 𝑖=1 items,rows are transac ons.
- *post ⇒ inverted B-tree, tsop - tsoq - s(X): support of set of items X; nr of
- pre*post ⇒ (1) intersect pre* and *post Top-k searching transac ons where all of X was bought
(2) permuterm index Monotone: func on whose return value will - s(XY): rela on X → Y, s(XY) = s(X ∪ Y)
not increase/decrease when the value of - conﬁdence: s(XY)/s(X)
Permuterm index one of the params is increased - Of all t’s that contain X, s(XY)/s(X) of
- For term hello, create entries hello$, them also contain Y
ello$h, llo$he, …, o$hell, connected to A ributes which are arguments of f are
same pos ng list as hello represented by separate lists. Algorithm
- Query he*o ⇒ add $, rotate un l * at end, For main-memory RDBMS with column 1. Find all frequent sets Z (exceeding some
e*o$h - o$he*, matches o$hello store, sorted access by eﬃcient internal threshold for their support)
sor ng techniques, random access by index - Exceeding support threshold ensures
MapReduce structure or duplicate list sorted by oid. that you can split Z up into subsets
Map :: <k,v> ⇒ [<k1’,v1’>,...,<km’,vm’>] that together form rela on that also
- Input split into disjoint chunks of key-value Threshold Algorithm (TA) supports threshold (union of subsets)
pairs, assigned to 1 Map worker - Max-A/B: max value of A or B that could 2. For all non-empty subsets X of Z, test if X
- Map worker performs calc on each pair be found in coming rounds → (Z - X) holds with suﬃcient conﬁdence
- Outputs from all workers grouped on key - Threshold T: max possible future value of f
- With f = A + B, T = max-A + max-B Gaussian elimina on
Reduce :: <k’,[v1’,...,vn’]> ⇒ [<k1’,v1’’>,...] - Buﬀer: tuples [oid,f], where f is calc using - Get all columns from le to right to
- Par on func on: hash(key) % R, where R the values for that oid in both lists represent the iden ty matrix, the
is nr of reduce tasks (>> nr of workers) - Par al top-k: set of tuples in buﬀer f ≥ T augmented column is then the solu on
- Sets of pairs distributed over Reduce - Update these values each round and move - Apply rules:
workers, which perform calc on each pair tuples with f ≥ T to the top-k - Interchanging two rows
- Reduce the list of values to just one value - Mul plying row by non-zero constant
- Adding mul ple of one row to other
Failure handling - First row is a, second b, third c
- Robustness: mask hardware failures by 1. b -= - 2a 5. c -= b
replica ng ﬁles in diﬀerent racks 2. c -= a 6. c *= -½
- Master pings workers periodically 3. b *= -1 7. b -= 3c
- Map worker fail: completed/in-progress 4. a -= b 8. a += c
tasks set to idle and rescheduled to others
- Reduce worker fail: in-progress tasks idle
- Master fails: all is lost, ripperoni

Early combining
- k mes emit(<w,1>), just one emit(<w,k>)
- Init_Map() ⇒ Map() ⇒ Finalize_Map()
- Possible outcomes:

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Suniht. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $3.25. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

64438 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller

Exam (elaborations) ·

Summary

Summary 2-page Overview Data Analysis & Retrieval Midterm (INFOB3DAR)

Document information

Subjects

Written for