100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Cheat sheet for Language and AI (JBC090) €3,49
In winkelwagen

Overig

Cheat sheet for Language and AI (JBC090)

 1 keer verkocht

A cheat sheet for the course Language and AI (JBC090) for the major Data science at the TU/e and TiU.

Voorbeeld 1 van de 2  pagina's

  • 14 januari 2024
  • 2
  • 2023/2024
  • Overig
  • Onbekend
Alle documenten voor dit vak (4)
avatar-seller
NienkeUr
Preliminaries RegEx System evaluation
Bag-of-words Is the word in the sentence: yes (1) or no (0) False positive Matches when it shouldn’t
. Single char | Or
Words are the features (vocabulary V) False negative Doesn’t match when it should
Documents are instances (𝐷 = {𝑑1 , … , 𝑑𝑁 }) [abc] Any of these chars [^ab] None of these chars
Mean reciprocal rank 𝑄: # 𝑜𝑓 𝑞𝑢𝑒𝑟𝑖𝑒𝑠
𝑋 represents the feature space (numbers) [a-z] Any char in range [^a-z] No char in this range 𝑟𝑎𝑛𝑘𝑖 : rank of first relevant result
Binary BOW: compact and straightforward, but uninformative ^ Starts with $ Ends with ($be →to be) Average precision rel(k) : 1 when item at rank k is relevant
Each 𝑑𝑖 ∈ 𝐷 can be represented as a term frequency vector ? 0 or 1 time (greedy) ?? 0 or 1 time (lazy) P(k) : Precision at the same rank
⃗⃗⃗
𝑑𝑖 = 〈𝑡𝑓(𝑡1 , 𝑑𝑖 ), … , 𝑡𝑓(𝑡𝑀 , 𝑑𝑖 )〉 * 0+ times (greedy) *? 0+ times (lazy) Cohen’s 𝜿 indicates how much annotators agree on the
𝒕𝒇(𝒕, 𝒅) The frequency of term 𝑡𝑚 ∈ 𝑇 for doc 𝑑 + 1+ times (greedy) + 1+ times (lazy) label. Many annotators are used to minimize bias resulting
These generally have a logarithmic distribution n to m times from cultural background etc.
{𝑛} n times {𝑛, 𝑚}
Inverse document frequency is better as rare items are more {𝑛, } n+ times Normalization
informative (𝑑𝑓𝑡 is the # of docs term 𝑡 occurs in)
Tf-idf Combines term and inverse document frequency \S Not a white space Common approaches for normalization:
\s White space
 Case folding The → the
\d Digit \D Not a digit
To account for document length: use L2-normalization  Lemmatization is → be, reading → read
\w Letter \W Not a letter  (Porter) Stemming cats → cat, copy → copi
Jaccard coeff Compute document similarity
Cosine similarity Document similarity (𝑝 ⋅ 𝑞 = ∑(𝑝𝑖 × 𝑞𝑖 )) \b Word boundary (a change from \w to \W or inverse)
Types are unique tokens. The larger the work, the more types.
Uses length normalization
Heap’s/Herdan’s law |𝑉| = 𝑘𝑁𝛽 with 0 < 𝛽 < 1
\ Escape character (yes\? → “yes?” Not “ye”)
N-grams N is document length
() Group ((do|get) that → “do that”)
Min edit distance Min # of editing operations needed to
n-grams sequences of 𝑛 words (unigram, bigram,…) (?:x) Passive group. transform one string into the other
bigram Count for 𝑥 and 𝑦 how often 𝑥 follows 𝑦 Match on 𝑥 but does not include 𝑥 in output
Bigram prob : ÷ by total # occurrences of 𝑦 (?=x) Look ahead (we (?=ignore) → “we”) Naïve bayes
𝐶( 𝐼 𝑙𝑖𝑘𝑒 𝑦𝑜𝑢 )
Probability Use frequency counts 𝑃( 𝑦𝑜𝑢 ∣ 𝐼 𝑙𝑖𝑘𝑒 ) = (?<=x) Look behind ((?<=ignore) this → “this”) Assumption: All features contribute independently to the
𝐶( 𝐼 𝑙𝑖𝑘𝑒 )
classification, regardless of correlations
Markov assumption 𝑃( 𝑤𝑛 ∣ 𝑤1 ∶ 𝑛−1 ) ≈ 𝑃( 𝑤𝑛 ∣ 𝑤𝑛−1 ) < em > Hello World </em > Assumption: Position does not matter, feature only
P (sequence) 𝑃(𝑤1 ∶ 𝑛 ) ≈ ∏𝑛𝑘=1 𝑃(𝑤𝑘 ∣ 𝑤𝑘−1 ) Lazy matching Match the shortest string possible encodes word identity
<. +> : < em > Hello World </em >
Use Bernoulli NB if the features are binary
Perplexity Evaluate LM on unseen test set. The higher 𝑃() Greedy matching Match the longest string possible
for test set sequences, the lower the perplexity <. +? > : < em > SVMs
Sampling Generating sentences by choosing each next
word according to its likelihood. Evaluation # correctly matched tokens / total # of tokens  Decision boundary should have an optimal distance to
the groups of classes → use support vectors
Smoothing Logistic regression  Using kernels, we can classify non-linear relations

 Out-of-voc words: words in test set not in 𝑉 Fit a line that separates two groups of data (Discriminative) Kernels transform the data, to fit a linear decision boundary
→ add <UNK> as token in the training set Train with stochastic gradient descent and Cross-Entropy loss.
 Unseen sequences (P=0): If a subsequence has P=0, Predict by assigning the label with the highest probability Generalization (- Overfitting)
P(sequence) will always equal 0.  Caused by data imbalance : Incorrect splits or large
→ Use smoothing on the probabilities KNN class differences. Use stratification or adjust the sample
Laplace 𝐶(𝑤𝑛−1 ) : Count how often each word occurs Store training matrix 𝑋𝑡𝑟𝑎𝑖𝑛 in memory  Caused by hyperparameters : must be properly tuned
𝐶(𝑤𝑛−1 𝑤𝑛 ) : Count how often 𝑤𝑛−1 𝑤𝑛 occurs Calculate distance between 𝑋𝑡𝑒𝑠𝑡 and all 𝑥𝑡𝑟𝑎𝑖𝑛 ∈ 𝑋𝑡𝑟𝑎𝑖𝑛 Data prep is important : better representation = less noise
Backoff Get probabilities from first lower order gram Choose 𝑘 vectors from 𝑋𝑡𝑟𝑎𝑖𝑛 with highest similarity to 𝑋𝑡𝑒𝑠𝑡 = better models
Interpolation Get probabilities from all grams, weighted Assign the majority label to 𝑋𝑡𝑒𝑠𝑡 Reproducibility : find (human) errors and strengthen findings
using 𝜆 Labels can be weighted by: K-fold CV Large std in validation sets may indicate
Kneser-Ney Weigh the grams by the amount of contexts sensitivity to what data the model receives
 Frequency → Majority is preferred
they have appeared in
 Inverse frequency → Rarer classes are preferred
Sparse vs Dense
 Distance → Closer is more important
Choosing the right metric  Not all algorithms deal well with sparse matrices
Accuracy Only works with balanced data Linear regression 𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿  Runtime is generally much higher for sparse matrices
 Memory consumption vs effectiveness trade-off is poor
Precision/recall Maybe tricky to interpret Evaluate using RMSE and R-squared
 Effect of additional features is often minor
High P – low R : all pos samples are correct, few predicted
High R – low P : many positive predictions, few correctly Meaning as counts
Embeddings
F1-score Must choose micro, macro, positive class ... Distributional similarity : For row word 𝑥, count for all column
Skipgram Embeds target word, predicts 𝑛 words around it.
Omission score Delete word and measure the impact words 𝑦 how often it occurs in the context of 𝑥. If two row
Find most related words, predict the context words
word counts are similarly distributed, they are likely similar.
CBOW Predicts ‘masked’ target words.
Decision trees i.e. Similar words often share similar contexts
Embeds multiple words, predicts one word
If-else statements which splits on the feature with highest IG PPMI Positive pointwise mutual information (continuous bag of words)
How often two events 𝑥 and 𝑦 occur together
Information Gain (IG) is computed with Entropy or Gini Index compared to independent occurrences.
Document embeddings
SVD Can be used to transform sparse vectors into denser
SGNS (Skip Gram with Negative Sampling) Take the average or sum of the word embeddings.
vectors – finds the hidden underlying semantic structure
SGNS A fixed embedding is learned for each word (Latent Semantic Analysis) TF*IDF and LSA are document*word spaces.
How likely is word w to occur near word v Vector semantics are short, dense vectors that embed words Dealing with noise
Word2Vec : binary classification with logistic regressions in a particular space
 Normalize, split it, or get rid of the noise
 Target word + neighboring context word are pos samples  Make models less sensitive
 𝑘 other random words in lexicon are neg samples Neural Networks (Representation learning) o Add more data! Bigger datasets →more robust models.
 Log reg is trained to distinguish between cases
NNs Stack LogReg models with added non-linearity ▪ Synonym replacement
 Learned weights are used as embeddings
Learns representations, no need to hand-craft features We can’t know if synonyms are correct
It learns 2 embeddings for each word e.g. change words, do Word2Vec on both
FFN Fully connected neurons without cycles. Has input,
sentences and check for similarity
1. The target embedding 𝑤𝑖 ∈ 𝑊 𝑑×|𝑉𝑤 | hidden and output neurons. (Feed Forward Network)
▪ Random word insertion
2. The context embedding 𝑤𝑖 ∈ 𝐶 |𝑉𝑤 |×𝑑 Pre-training Using existing word embeddings as input
They are often added together, or 𝐶 is discarded completely. ▪ Random word swaps
The context window size 𝐿 is a tunable parameter Neural networks vs N-grams: ▪ Randomly delete words
 NNs can encode longer histories These may however change sentence definition
Embeddings are evaluated using the parallelogram method  NNs can generalize over different meanings
𝑏̂ ∗ = arg min 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑥, 𝑎∗ − 𝑎 + 𝑏)  NNs are more accurate at word prediction Noise may cause both over and underfitting. Up- and down-
x
 N-grams are faster to train due to lower complexity sampling can be used to fix skewed distributions.
𝑒. 𝑔. arg min(𝑥, 𝑘𝑖𝑛𝑔 − 𝑚𝑎𝑛 + 𝑤𝑜𝑚𝑎𝑛) ≈ 𝑞𝑢𝑒𝑒𝑛
x  N-grams are more interpretable
 N-grams are less data hungry Temporal event extraction
Bias in the text is amplified and reproduced by embeddings:
For small tasks: N-grams are preferred
Allocational harm When systems allocate resources\ Finding when the events happened.
values unfairly to different groups The embedding of a particular word is created by Temporal expressions : absolute points in time, relative
multiplying the one-hot vector by embedding matrix 𝐸. time and durations.
Representational harm Harm caused by demeaning or
ignoring certain social groups 𝑒 = [𝐸𝑥𝑡−3 ; 𝐸𝑥𝑡−2 ; 𝐸𝑥𝑡−1 ] ℎ = 𝜎(𝑊𝑒 + 𝑏) Events are verbs, noun-phrases,…(e.g. said, the increase, …)

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper NienkeUr. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 64450 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen
€3,49  1x  verkocht
  • (0)
In winkelwagen
Toegevoegd