Week 1
NLP – algorithms to process, analyse and understand structure and meaning texts in a natural
language. Key aspects are: information extraction and retrieval, text classification, extracting
sentiment et cetera.
However, interpreting language is hard because of:
- Ambiguity: sentences/words can have multiple meanings
- Compositionality: certain expressions need to be composed in a set way (eg gin & tonic
instead of tonic & gin).
- Co-reference: to which object/person does the pronoun belong
- Non-standard language: symbols, abbreviations and acronyms
- Neologisms: newly coined words or expressions (eg bromance/retweet)
Morphology
Every NLP task needs to do text normalization:
- Tokenizing words (separating the individual words/symbols in sentences)
- Normalizing word formats
- Segmenting sentences
Lemma – words that have the same stem (basis of the word) and a rough word sense (eg cat – cats,
be – is)
Type – an element of the vocabulary (different words, different symbols et cetera). |V| is the size of
the vocabulary.
Token – an instance of that type in running text (the number of occurrences of that type). N is the
number of tokens.
type
The number of tokens is always at least the number of types, and the token ratio is a measure of
complexity.
Tokenization
Compound splitter – splits a single word into main words (useful in German)
Often before we can start tokenizing, we first need pre-process the data:
- Normalization
• Text and query terms must have the same form (eg. USA must match U.S.A.)
• Define equivalence classes (eg. Terms may not have a . in them)
- Case folding: reduce all letters to lower case. Disadvantage: lose information
- Remove spaces for compound words or collocations (eg New York → NewYork)
- Lemmatization: change every word to its base form (eg. Am/is/are → be). Preferred to
stemming with small training sets.
- Morphology: reduce words to its morpheme (smallest meaningful unit of word eg. Walked =
walk + ed).
- Stemming: reduce terms to their stem (chop off the affixes). It is aggressive because it can
chop too much to which the word loses its meaning.
Different ways to form a word:
- Inflection: modifying a word with an affix to change its grammatical function (book → books)
- Derivation: adding an affix to a stem to create a new word (great → greatly)
- Compounding: combining two stems (key+board = keyboard, lawsuit, bookcase)
,Another problem with tokenization is deciding if punctuation means the end of a sentence or not. For
! or ? this is fairly easy, but for the . it can be difficult → it is often used in abbreviations. To solve this,
a decision tree or binary classifier can be implemented.
Basic text processing
Regular expressions – a formal language for specifying text strings.
Special symbols in regular expressions:
Symbol Meaning
[…] Any letter/digit inside can be chosen or in range […-…]
[^..] Negation of whatever comes after the ^
..? Whatever comes before ? is optional
..* 0 or more times or whatever comes before *
..+ 1 or more times or whatever comes before *
. Anything can be placed on the position of .
^… Start of string/line
…$ Denotes the end of string/line
\.$ End of string/line must be whatever comes after \ (so now .)
..|.. Disjunction, can be both
..[].. Whatever is outside the brackets must be in the string/line
\b…\b No other string before and after whatever is inside
Regular expressions are often used as features in machine learning classifiers. It can also be used in
capturing generalizations. They’re often the first model for any text processing text.
False positive error (Type I): matching strings that we should not have matched. This is reduced by
increasing accuracy or precision.
False negative error (Type II): not matching things that we should have matched. This is reduced by
increasing coverage or recall.
Similarity of strings
Minimum edit distance – minimum number of editing operations (insertion, deletion, substitution) to
change one string into the other. Levenshtein distance – substitution has a cost of 2.
To find the minimum edit distance we search for a path (sequence of edits) from the start string to
the final string:
- Initial state: word we’re transforming
- Operators: insert, delete, substitute
- Goal state: word we’re trying to get
- Path cost: the number of edits (that we try to minimize)
For string x of length n and string y of length m, D(i, j) is the edit distance between X[1…i] and Y[1…j].
The edit distance between X and Y is thus D(n, m).
Dynamic programming – a tabular computation of D(n, m) that solves problems by combining
solutions to subproblems. It’s a bottom-up algorithm that computes D(i, j) for small i, j and computes
a larger D(i, j) based on previously computed smaller values.
DP for Levenshtein minimum edit distance:
- Initialization:
D(i, 0) = i
D(0, j) = j
, - Recurrence relation:
For each i = 1…M
For each j = 1…N
D(i − 1, j) + 1
D(i, j − 1) + 1
D(i, j) = min
2 if X(i) ≠ Y(j)
D(i − 1, j − 1) + {
{ 0 if X(i) = Y(j)
LEFT
ptr(i, j) = { DOWN
DIAGONAL
- Termination:
D(N, M) is distance
Where the edit distance table is:
The edit distance is not sufficient, since often each
character needs to be aligned to the other string.
This is done by keeping a backtrace (ptr bit in formula): every time we enter a cell, we remember
where we came from and once we reach the end, we trace back the path from the upper right corner
to read off the alignment.
Performance minEdit: O(nm) for time and space. Performance backtrace: O(n+m).
An optimal alignment is composed of optimal subalignments.
Weighted edit distance – weights are added to computation. This is useful for spell correction (some
letters are more likely to be mistyped) and biology (certain kinds of deletions or insertions are more
likely). So instead of the + 1 and + 2/0, we have the weight of the insertion, deletion and substitution.
Language Models
Probabilistic language models compute a probability of a sentence or a sequence of words. Usually
it’s P(w1, w2, …, wn-1, wn):
- P(w1 w2 … wn ) = ∏i P(wi |w1 w2 … wi−1 )
P(A,B)
- P(B|A) = P(A) ⟺ P(A, B) = P(A)P(B|A)
Markov assumption – we can compute the probability of a sentence by computing the conditional
probability of the last few words: P(𝑤𝑖 | w1 w2 … wn ) ≈ ∏i P(wi |wi−k … wi−1 ). Example: P(the|water
is so transparent that) ≈ P(the|that). Multiple models:
- Unigram model: P(w1 w2 … wn ) ≈ ∏i P(wi )
- Bigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−1 )
- Trigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−2 , 𝑤𝑖−1 )
𝑐𝑜𝑢𝑛𝑡(wi−1 ,wi)
- Where P(wi |wi−1 ) = (using maximum likelihood estimate)
𝑐𝑜𝑢𝑛𝑡(wi−1 )
- Transform probabilities to log space to avoid underflow: Log(p1 x p2) = log(p1) + log(p2).
N-gram models are insufficient, because language has long-distance dependencies. Also, they only
work well for word prediction if the test corpus looks like the training corpus.
Smoothing – a method used for when the test set is not (as) similar to training set (i.e. bigrams with
probability 0 occur, which gives errors because we can’t divide by 0)
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller FamkeNouwens. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.12. You're not tied to anything after your purchase.