Samenvatting

Summary 2-page Overview Data Analysis & Retrieval Midterm (INFOB3DAR)

24 keer bekeken 1 keer verkocht

Instelling
Universiteit Utrecht (UU)

2-page overview (cheatsheet) of all the subjects discussed in the first part of the Data Analysis & Retrieval course, very concisely summarized. Summary of the summary. Includes several examples of MapReduce algorithms.

[Meer zien]

Voorbeeld 1 van de 2 pagina's

Bekijk voorbeeld

Geupload op 1 juni 2022
Aantal pagina's 2
Geschreven in 2021/2022
Type Samenvatting

indexing
list merging
positional index
permuterm index
mapreduce
boolean retrieval
scoring
ranking
term frequency
document frequency
idf
tfidf
top k searching
nra
freq
inverse document frequency
tf
ta

Volgen

Suniht Lid sinds 3 jaar 87 documenten verkocht

€2,99

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Precision/selec vity: frac on of retrieved Scoring & Ranking No Random Access Algorithm (NRA)
docs that is relevant to user’s need Term frequency ( t,d): how many mes does - Can’t ﬁnd correspondin oid value other list
Recall/sensi vity: relevant frac on of all doc term occur in document - Store range of values of f in buﬀer, update
Log-frequency weigh ng (wt,d): when necessary, so if max-B decreased
Indexing wt,d = 0 if t,d = 0 since last round, update buﬀer entries for
Boolean retrieval wt,d = 1 + log( t,d) if t,d > 0 which value of B was not determined yet
- Term-document incidence matrix: shows - Upper limit of range is needed when lower
what docs a term appears in with 1 and 0 Document frequency (dft): nr of documents limit of a tuple exceeds upper limit of all
- ‘x AND y BUT NOT z’ ⇒ 1101 && 1100 in collec on that contain the term t entries in buﬀer not yet in top-k (will
&& (NOT 0101) = 1000 Inverse df (idft): idft = log(N / dft), where N is deﬁnitely have higher score than those)
- Sparse matrix: only store docIds that do the total nr of documents in collec on
contain the term in a pos ng list - Measure of rareness of a term
- Dic onary as hash table/B-tree/trie - log(1) = 0, so term that occurs in every
document does not contribute to score
List merging
1. Add current docID to result if in both lists -idf weigh ng: wt,d ⋅ idft
2. Otherwise, get next docID of list where - Can’t handle single term queries
current is lower than other - Favours long documents
- Skip pointer: when docID that can
be skipped to is lower than current Vector-space model
docID in other list, skip to it - Each doc represented as D-dimensional
vector, with D the nr of all known terms
Posi onal index: each term has a list of - Similarity is deﬁned as the angle between
tuples <docID, list of posi ons> query vector and doc vector, normalised
- “to be or not to be”, ‘be’ must appear - Use the cosine of the angle
twice, 4 posi ons apart - Inner product: x • y = ||x||⋅||y||⋅ cos(𝛉)
𝐷 Frequent item sets
Wild-card queries - ||𝑥|| = ||𝑥||2 = ∑ 𝑥𝑖
2
Table r(U) with items U, columns are the
- pre* ⇒ all terms between pre - prf 𝑖=1 items,rows are transac ons.
- *post ⇒ inverted B-tree, tsop - tsoq - s(X): support of set of items X; nr of
- pre*post ⇒ (1) intersect pre* and *post Top-k searching transac ons where all of X was bought
(2) permuterm index Monotone: func on whose return value will - s(XY): rela on X → Y, s(XY) = s(X ∪ Y)
not increase/decrease when the value of - conﬁdence: s(XY)/s(X)
Permuterm index one of the params is increased - Of all t’s that contain X, s(XY)/s(X) of
- For term hello, create entries hello$, them also contain Y
ello$h, llo$he, …, o$hell, connected to A ributes which are arguments of f are
same pos ng list as hello represented by separate lists. Algorithm
- Query he*o ⇒ add $, rotate un l * at end, For main-memory RDBMS with column 1. Find all frequent sets Z (exceeding some
e*o$h - o$he*, matches o$hello store, sorted access by eﬃcient internal threshold for their support)
sor ng techniques, random access by index - Exceeding support threshold ensures
MapReduce structure or duplicate list sorted by oid. that you can split Z up into subsets
Map :: <k,v> ⇒ [<k1’,v1’>,...,<km’,vm’>] that together form rela on that also
- Input split into disjoint chunks of key-value Threshold Algorithm (TA) supports threshold (union of subsets)
pairs, assigned to 1 Map worker - Max-A/B: max value of A or B that could 2. For all non-empty subsets X of Z, test if X
- Map worker performs calc on each pair be found in coming rounds → (Z - X) holds with suﬃcient conﬁdence
- Outputs from all workers grouped on key - Threshold T: max possible future value of f
- With f = A + B, T = max-A + max-B Gaussian elimina on
Reduce :: <k’,[v1’,...,vn’]> ⇒ [<k1’,v1’’>,...] - Buﬀer: tuples [oid,f], where f is calc using - Get all columns from le to right to
- Par on func on: hash(key) % R, where R the values for that oid in both lists represent the iden ty matrix, the
is nr of reduce tasks (>> nr of workers) - Par al top-k: set of tuples in buﬀer f ≥ T augmented column is then the solu on
- Sets of pairs distributed over Reduce - Update these values each round and move - Apply rules:
workers, which perform calc on each pair tuples with f ≥ T to the top-k - Interchanging two rows
- Reduce the list of values to just one value - Mul plying row by non-zero constant
- Adding mul ple of one row to other
Failure handling - First row is a, second b, third c
- Robustness: mask hardware failures by 1. b -= - 2a 5. c -= b
replica ng ﬁles in diﬀerent racks 2. c -= a 6. c *= -½
- Master pings workers periodically 3. b *= -1 7. b -= 3c
- Map worker fail: completed/in-progress 4. a -= b 8. a += c
tasks set to idle and rescheduled to others
- Reduce worker fail: in-progress tasks idle
- Master fails: all is lost, ripperoni

Early combining
- k mes emit(<w,1>), just one emit(<w,k>)
- Init_Map() ⇒ Map() ⇒ Finalize_Map()
- Possible outcomes:

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper Suniht. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €2,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 52355 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

Samenvatting

Summary 2-page Overview Data Analysis & Retrieval Midterm (INFOB3DAR)

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?