Preliminaries RegEx System evaluation
Bag-of-words Is the word in the sentence: yes (1) or no (0) False positive Matches when it shouldn’t
. Single char | Or
Words are the features (vocabulary V) False negative Doesn’t match when it should
Documents are instances (𝐷 = {𝑑1 , … , 𝑑𝑁 }) [abc] Any of these chars [^ab] None of these chars
Mean reciprocal rank 𝑄: # 𝑜𝑓 𝑞𝑢𝑒𝑟𝑖𝑒𝑠
𝑋 represents the feature space (numbers) [a-z] Any char in range [^a-z] No char in this range 𝑟𝑎𝑛𝑘𝑖 : rank of first relevant result
Binary BOW: compact and straightforward, but uninformative ^ Starts with $ Ends with ($be →to be) Average precision rel(k) : 1 when item at rank k is relevant
Each 𝑑𝑖 ∈ 𝐷 can be represented as a term frequency vector ? 0 or 1 time (greedy) ?? 0 or 1 time (lazy) P(k) : Precision at the same rank
⃗⃗⃗
𝑑𝑖 = 〈𝑡𝑓(𝑡1 , 𝑑𝑖 ), … , 𝑡𝑓(𝑡𝑀 , 𝑑𝑖 )〉 * 0+ times (greedy) *? 0+ times (lazy) Cohen’s 𝜿 indicates how much annotators agree on the
𝒕𝒇(𝒕, 𝒅) The frequency of term 𝑡𝑚 ∈ 𝑇 for doc 𝑑 + 1+ times (greedy) + 1+ times (lazy) label. Many annotators are used to minimize bias resulting
These generally have a logarithmic distribution n to m times from cultural background etc.
{𝑛} n times {𝑛, 𝑚}
Inverse document frequency is better as rare items are more {𝑛, } n+ times Normalization
informative (𝑑𝑓𝑡 is the # of docs term 𝑡 occurs in)
Tf-idf Combines term and inverse document frequency \S Not a white space Common approaches for normalization:
\s White space
Case folding The → the
\d Digit \D Not a digit
To account for document length: use L2-normalization Lemmatization is → be, reading → read
\w Letter \W Not a letter (Porter) Stemming cats → cat, copy → copi
Jaccard coeff Compute document similarity
Cosine similarity Document similarity (𝑝 ⋅ 𝑞 = ∑(𝑝𝑖 × 𝑞𝑖 )) \b Word boundary (a change from \w to \W or inverse)
Types are unique tokens. The larger the work, the more types.
Uses length normalization
Heap’s/Herdan’s law |𝑉| = 𝑘𝑁𝛽 with 0 < 𝛽 < 1
\ Escape character (yes\? → “yes?” Not “ye”)
N-grams N is document length
() Group ((do|get) that → “do that”)
Min edit distance Min # of editing operations needed to
n-grams sequences of 𝑛 words (unigram, bigram,…) (?:x) Passive group. transform one string into the other
bigram Count for 𝑥 and 𝑦 how often 𝑥 follows 𝑦 Match on 𝑥 but does not include 𝑥 in output
Bigram prob : ÷ by total # occurrences of 𝑦 (?=x) Look ahead (we (?=ignore) → “we”) Naïve bayes
𝐶( 𝐼 𝑙𝑖𝑘𝑒 𝑦𝑜𝑢 )
Probability Use frequency counts 𝑃( 𝑦𝑜𝑢 ∣ 𝐼 𝑙𝑖𝑘𝑒 ) = (?<=x) Look behind ((?<=ignore) this → “this”) Assumption: All features contribute independently to the
𝐶( 𝐼 𝑙𝑖𝑘𝑒 )
classification, regardless of correlations
Markov assumption 𝑃( 𝑤𝑛 ∣ 𝑤1 ∶ 𝑛−1 ) ≈ 𝑃( 𝑤𝑛 ∣ 𝑤𝑛−1 ) < em > Hello World </em > Assumption: Position does not matter, feature only
P (sequence) 𝑃(𝑤1 ∶ 𝑛 ) ≈ ∏𝑛𝑘=1 𝑃(𝑤𝑘 ∣ 𝑤𝑘−1 ) Lazy matching Match the shortest string possible encodes word identity
<. +> : < em > Hello World </em >
Use Bernoulli NB if the features are binary
Perplexity Evaluate LM on unseen test set. The higher 𝑃() Greedy matching Match the longest string possible
for test set sequences, the lower the perplexity <. +? > : < em > SVMs
Sampling Generating sentences by choosing each next
word according to its likelihood. Evaluation # correctly matched tokens / total # of tokens Decision boundary should have an optimal distance to
the groups of classes → use support vectors
Smoothing Logistic regression Using kernels, we can classify non-linear relations
Out-of-voc words: words in test set not in 𝑉 Fit a line that separates two groups of data (Discriminative) Kernels transform the data, to fit a linear decision boundary
→ add <UNK> as token in the training set Train with stochastic gradient descent and Cross-Entropy loss.
Unseen sequences (P=0): If a subsequence has P=0, Predict by assigning the label with the highest probability Generalization (- Overfitting)
P(sequence) will always equal 0. Caused by data imbalance : Incorrect splits or large
→ Use smoothing on the probabilities KNN class differences. Use stratification or adjust the sample
Laplace 𝐶(𝑤𝑛−1 ) : Count how often each word occurs Store training matrix 𝑋𝑡𝑟𝑎𝑖𝑛 in memory Caused by hyperparameters : must be properly tuned
𝐶(𝑤𝑛−1 𝑤𝑛 ) : Count how often 𝑤𝑛−1 𝑤𝑛 occurs Calculate distance between 𝑋𝑡𝑒𝑠𝑡 and all 𝑥𝑡𝑟𝑎𝑖𝑛 ∈ 𝑋𝑡𝑟𝑎𝑖𝑛 Data prep is important : better representation = less noise
Backoff Get probabilities from first lower order gram Choose 𝑘 vectors from 𝑋𝑡𝑟𝑎𝑖𝑛 with highest similarity to 𝑋𝑡𝑒𝑠𝑡 = better models
Interpolation Get probabilities from all grams, weighted Assign the majority label to 𝑋𝑡𝑒𝑠𝑡 Reproducibility : find (human) errors and strengthen findings
using 𝜆 Labels can be weighted by: K-fold CV Large std in validation sets may indicate
Kneser-Ney Weigh the grams by the amount of contexts sensitivity to what data the model receives
Frequency → Majority is preferred
they have appeared in
Inverse frequency → Rarer classes are preferred
Sparse vs Dense
Distance → Closer is more important
Choosing the right metric Not all algorithms deal well with sparse matrices
Accuracy Only works with balanced data Linear regression 𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿 Runtime is generally much higher for sparse matrices
Memory consumption vs effectiveness trade-off is poor
Precision/recall Maybe tricky to interpret Evaluate using RMSE and R-squared
Effect of additional features is often minor
High P – low R : all pos samples are correct, few predicted
High R – low P : many positive predictions, few correctly Meaning as counts
Embeddings
F1-score Must choose micro, macro, positive class ... Distributional similarity : For row word 𝑥, count for all column
Skipgram Embeds target word, predicts 𝑛 words around it.
Omission score Delete word and measure the impact words 𝑦 how often it occurs in the context of 𝑥. If two row
Find most related words, predict the context words
word counts are similarly distributed, they are likely similar.
CBOW Predicts ‘masked’ target words.
Decision trees i.e. Similar words often share similar contexts
Embeds multiple words, predicts one word
If-else statements which splits on the feature with highest IG PPMI Positive pointwise mutual information (continuous bag of words)
How often two events 𝑥 and 𝑦 occur together
Information Gain (IG) is computed with Entropy or Gini Index compared to independent occurrences.
Document embeddings
SVD Can be used to transform sparse vectors into denser
SGNS (Skip Gram with Negative Sampling) Take the average or sum of the word embeddings.
vectors – finds the hidden underlying semantic structure
SGNS A fixed embedding is learned for each word (Latent Semantic Analysis) TF*IDF and LSA are document*word spaces.
How likely is word w to occur near word v Vector semantics are short, dense vectors that embed words Dealing with noise
Word2Vec : binary classification with logistic regressions in a particular space
Normalize, split it, or get rid of the noise
Target word + neighboring context word are pos samples Make models less sensitive
𝑘 other random words in lexicon are neg samples Neural Networks (Representation learning) o Add more data! Bigger datasets →more robust models.
Log reg is trained to distinguish between cases
NNs Stack LogReg models with added non-linearity ▪ Synonym replacement
Learned weights are used as embeddings
Learns representations, no need to hand-craft features We can’t know if synonyms are correct
It learns 2 embeddings for each word e.g. change words, do Word2Vec on both
FFN Fully connected neurons without cycles. Has input,
sentences and check for similarity
1. The target embedding 𝑤𝑖 ∈ 𝑊 𝑑×|𝑉𝑤 | hidden and output neurons. (Feed Forward Network)
▪ Random word insertion
2. The context embedding 𝑤𝑖 ∈ 𝐶 |𝑉𝑤 |×𝑑 Pre-training Using existing word embeddings as input
They are often added together, or 𝐶 is discarded completely. ▪ Random word swaps
The context window size 𝐿 is a tunable parameter Neural networks vs N-grams: ▪ Randomly delete words
NNs can encode longer histories These may however change sentence definition
Embeddings are evaluated using the parallelogram method NNs can generalize over different meanings
𝑏̂ ∗ = arg min 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑥, 𝑎∗ − 𝑎 + 𝑏) NNs are more accurate at word prediction Noise may cause both over and underfitting. Up- and down-
x
N-grams are faster to train due to lower complexity sampling can be used to fix skewed distributions.
𝑒. 𝑔. arg min(𝑥, 𝑘𝑖𝑛𝑔 − 𝑚𝑎𝑛 + 𝑤𝑜𝑚𝑎𝑛) ≈ 𝑞𝑢𝑒𝑒𝑛
x N-grams are more interpretable
N-grams are less data hungry Temporal event extraction
Bias in the text is amplified and reproduced by embeddings:
For small tasks: N-grams are preferred
Allocational harm When systems allocate resources\ Finding when the events happened.
values unfairly to different groups The embedding of a particular word is created by Temporal expressions : absolute points in time, relative
multiplying the one-hot vector by embedding matrix 𝐸. time and durations.
Representational harm Harm caused by demeaning or
ignoring certain social groups 𝑒 = [𝐸𝑥𝑡−3 ; 𝐸𝑥𝑡−2 ; 𝐸𝑥𝑡−1 ] ℎ = 𝜎(𝑊𝑒 + 𝑏) Events are verbs, noun-phrases,…(e.g. said, the increase, …)