100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Samenvatting Natural Language Processing (NLP) €7,49
In winkelwagen

Samenvatting

Samenvatting Natural Language Processing (NLP)

 0 keer verkocht

Samenvatting van het vak Natural Language Processing, gegeven aan de Universiteit Maastricht

Voorbeeld 3 van de 23  pagina's

  • 23 oktober 2021
  • 23
  • 2018/2019
  • Samenvatting
Alle documenten voor dit vak (1)
avatar-seller
FamkeNouwens
Natural Language Processing Famke Nouwens

Week 1
NLP – algorithms to process, analyse and understand structure and meaning texts in a natural
language. Key aspects are: information extraction and retrieval, text classification, extracting
sentiment et cetera.
However, interpreting language is hard because of:
- Ambiguity: sentences/words can have multiple meanings
- Compositionality: certain expressions need to be composed in a set way (eg gin & tonic
instead of tonic & gin).
- Co-reference: to which object/person does the pronoun belong
- Non-standard language: symbols, abbreviations and acronyms
- Neologisms: newly coined words or expressions (eg bromance/retweet)
Morphology
Every NLP task needs to do text normalization:
- Tokenizing words (separating the individual words/symbols in sentences)
- Normalizing word formats
- Segmenting sentences
Lemma – words that have the same stem (basis of the word) and a rough word sense (eg cat – cats,
be – is)
Type – an element of the vocabulary (different words, different symbols et cetera). |V| is the size of
the vocabulary.
Token – an instance of that type in running text (the number of occurrences of that type). N is the
number of tokens.
type
The number of tokens is always at least the number of types, and the token ratio is a measure of
complexity.
Tokenization
Compound splitter – splits a single word into main words (useful in German)
Often before we can start tokenizing, we first need pre-process the data:
- Normalization
• Text and query terms must have the same form (eg. USA must match U.S.A.)
• Define equivalence classes (eg. Terms may not have a . in them)
- Case folding: reduce all letters to lower case. Disadvantage: lose information
- Remove spaces for compound words or collocations (eg New York → NewYork)
- Lemmatization: change every word to its base form (eg. Am/is/are → be). Preferred to
stemming with small training sets.
- Morphology: reduce words to its morpheme (smallest meaningful unit of word eg. Walked =
walk + ed).
- Stemming: reduce terms to their stem (chop off the affixes). It is aggressive because it can
chop too much to which the word loses its meaning.
Different ways to form a word:
- Inflection: modifying a word with an affix to change its grammatical function (book → books)
- Derivation: adding an affix to a stem to create a new word (great → greatly)
- Compounding: combining two stems (key+board = keyboard, lawsuit, bookcase)

,Another problem with tokenization is deciding if punctuation means the end of a sentence or not. For
! or ? this is fairly easy, but for the . it can be difficult → it is often used in abbreviations. To solve this,
a decision tree or binary classifier can be implemented.
Basic text processing
Regular expressions – a formal language for specifying text strings.
Special symbols in regular expressions:
Symbol Meaning
[…] Any letter/digit inside can be chosen or in range […-…]
[^..] Negation of whatever comes after the ^
..? Whatever comes before ? is optional
..* 0 or more times or whatever comes before *
..+ 1 or more times or whatever comes before *
. Anything can be placed on the position of .
^… Start of string/line
…$ Denotes the end of string/line
\.$ End of string/line must be whatever comes after \ (so now .)
..|.. Disjunction, can be both
..[].. Whatever is outside the brackets must be in the string/line
\b…\b No other string before and after whatever is inside


Regular expressions are often used as features in machine learning classifiers. It can also be used in
capturing generalizations. They’re often the first model for any text processing text.
False positive error (Type I): matching strings that we should not have matched. This is reduced by
increasing accuracy or precision.
False negative error (Type II): not matching things that we should have matched. This is reduced by
increasing coverage or recall.
Similarity of strings
Minimum edit distance – minimum number of editing operations (insertion, deletion, substitution) to
change one string into the other. Levenshtein distance – substitution has a cost of 2.
To find the minimum edit distance we search for a path (sequence of edits) from the start string to
the final string:
- Initial state: word we’re transforming
- Operators: insert, delete, substitute
- Goal state: word we’re trying to get
- Path cost: the number of edits (that we try to minimize)
For string x of length n and string y of length m, D(i, j) is the edit distance between X[1…i] and Y[1…j].
The edit distance between X and Y is thus D(n, m).
Dynamic programming – a tabular computation of D(n, m) that solves problems by combining
solutions to subproblems. It’s a bottom-up algorithm that computes D(i, j) for small i, j and computes
a larger D(i, j) based on previously computed smaller values.
DP for Levenshtein minimum edit distance:
- Initialization:
D(i, 0) = i
D(0, j) = j

, - Recurrence relation:
For each i = 1…M
For each j = 1…N
D(i − 1, j) + 1
D(i, j − 1) + 1
D(i, j) = min
2 if X(i) ≠ Y(j)
D(i − 1, j − 1) + {
{ 0 if X(i) = Y(j)
LEFT
ptr(i, j) = { DOWN
DIAGONAL
- Termination:
D(N, M) is distance
Where the edit distance table is:
The edit distance is not sufficient, since often each
character needs to be aligned to the other string.




This is done by keeping a backtrace (ptr bit in formula): every time we enter a cell, we remember
where we came from and once we reach the end, we trace back the path from the upper right corner
to read off the alignment.
Performance minEdit: O(nm) for time and space. Performance backtrace: O(n+m).
An optimal alignment is composed of optimal subalignments.
Weighted edit distance – weights are added to computation. This is useful for spell correction (some
letters are more likely to be mistyped) and biology (certain kinds of deletions or insertions are more
likely). So instead of the + 1 and + 2/0, we have the weight of the insertion, deletion and substitution.
Language Models
Probabilistic language models compute a probability of a sentence or a sequence of words. Usually
it’s P(w1, w2, …, wn-1, wn):
- P(w1 w2 … wn ) = ∏i P(wi |w1 w2 … wi−1 )
P(A,B)
- P(B|A) = P(A) ⟺ P(A, B) = P(A)P(B|A)

Markov assumption – we can compute the probability of a sentence by computing the conditional
probability of the last few words: P(𝑤𝑖 | w1 w2 … wn ) ≈ ∏i P(wi |wi−k … wi−1 ). Example: P(the|water
is so transparent that) ≈ P(the|that). Multiple models:
- Unigram model: P(w1 w2 … wn ) ≈ ∏i P(wi )
- Bigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−1 )
- Trigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−2 , 𝑤𝑖−1 )
𝑐𝑜𝑢𝑛𝑡(wi−1 ,wi)
- Where P(wi |wi−1 ) = (using maximum likelihood estimate)
𝑐𝑜𝑢𝑛𝑡(wi−1 )
- Transform probabilities to log space to avoid underflow: Log(p1 x p2) = log(p1) + log(p2).
N-gram models are insufficient, because language has long-distance dependencies. Also, they only
work well for word prediction if the test corpus looks like the training corpus.
Smoothing – a method used for when the test set is not (as) similar to training set (i.e. bigrams with
probability 0 occur, which gives errors because we can’t divide by 0)

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper FamkeNouwens. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65507 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen
€7,49
  • (0)
In winkelwagen
Toegevoegd