100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Natural Language Processing 2021/2022 Summary of the lectures and reading material €4,99   In winkelwagen

Samenvatting

Natural Language Processing 2021/2022 Summary of the lectures and reading material

 22 keer bekeken  0 keer verkocht

A summary of the lectures and the reading material

Voorbeeld 4 van de 36  pagina's

  • 15 juni 2022
  • 36
  • 2021/2022
  • Samenvatting
Alle documenten voor dit vak (4)
avatar-seller
ninavangulik
Summary Natural Lange Processing

Analyzing Language

Variation, Ambiguity, and creativity
Variation à The same meaning can be expressed in different ways
o Lexical variation (taalkundige variatie) à different words for one word (i.e. 92 words
for walking)
o Syntactic variation (grammatical) à The subject and in/direct object can change
o Language use varies à the use of the language can differ between different:
§ Across domains (e.g., jargon)
§ Over time (e.g., semantic change)
§ Across regions (e.g., dialect, language varieties)
§ Across socioeconomic groups (e.g., slang)
§ Across speakers (e.g., in age, gender, native language)
§ With respect to situational context (e.g., formal/non-formal, tired, stressed)
Ambiguity à The same meaning can have different meanings
o Phonetics à here/hear, two/too/to, write/right
o Lexical ambiguity (word ambiguities) à bat (animal or baseball bat?)
o Semantic ambiguity (zins ambiguities) à John and Mary are married (to each other,
or both married?)
o Syntactic ambiguity à I’ve been chasing the kid on the bicycle (who is riding the
bike?)
Creativity
o Language is compositional à making sentences that are never been said before
o Language users are creative
§ Neologisms: hangry
§ Compounds (a thing that is composed of two or more separate elements à
combining two free morphemes to create a new word): olive oil vs baby oil
§ Sarcasm and irony: That’s just what I needed today

Definition token, word, sub-word, lemma, morpheme, POS-tag, named entity, content
word and function word

Token = The words which are present in the text (including punctuation)
o “a good wine is a wine that you like.” à 10 tokens
Type = Amount of different tokens/distinct words
o “a good wine is a wine that you like.”à 8 types
Word = Amount of tokens excluding punctuations
o Wine, good, is, a
Sub-word = A small meaning full word that belongs to a bigger word. Frequently used words
are not split, and rare words are split into smaller meaningful subwords.
o Lowercase à Lower ##case
Lemma = The canonical form, dictionary form or citation form of a word
o Nouns: citation form is singular
§ Mice à mouse
o Verbs: citation form is infinitive

, § Driving, drives, driven, drove à drive
o Adjectives: citation form is the positive
§ Happy, happier, happiest à happy
Morpheme = The smallest meaningful lexical item in a language (can be combined to derive
new words)
Free-morpheme = Can appear in isolation
o woman in womanly
Bound-morpheme = Can’t appear in isolation
o (-ed, -s, -es, -sent)
o sent in dissent
o Two jobs:
§ Inflectional morphology à conveys grammatical information, such as
number, tense, agreement or case.
§ Derivational morphology à derivation, the process that creates new words.
N-gram = sequences of n tokens
POS-tag = The identification of words, Noun, verbs, adjectives, adverbs etc.
Named entity = a name, organization or date
Content word = The open class words, where new words can be added. The new words are
always an adj, adv, intj, noun, propn or verb.
Function word = The closed class words, the words that conveys most of the meaning of a
sentence: adp, aux, cconj, det, num part, pron, sconj




Text normalization, sentence segmentation, tokenization, byte-pair encoding,
Lemmatization, POS-tagging and named entity recognition

Text normalization = Removing the layout
o Disadvantage = a lot of data is lost
o Ways to clean (which depends on the task)
§ Uppercase vs lowercase vs true case
§ Normalize punctuation
§ Remove/replace emojis and urls
§ Spelling correction
§ Replace numbers with NUM, links with URL
§ Anonymization
Sentence segmentation = splitting sentences out of a full text. For example, separating by
punctuation
Tokenization (word segmentation) = splitting a sentence into its parts (tokens)
o There are some tricky cases where you have to choose: [isn’t], [is] [not] etc.
o In other languages or alphabets it can even be more complicated, for example in
France with l’ensamble etc.

, o Needs to be run before any other language processing à therefore needs to be very
fast
Unknown word problem = words in test data which did not occur in training data
o Solution = simplification = splitting unknown words into subwords
o Idea = frequent tokens are unique, less frequent ones are decomposed into parts
o Byte-pair encoding
Byte-pair encoding = Ensuring that the most common words are represented in the
vocabulary as a singly token while the rare words are broken down into two or more
subword tokens
o Two parts: token learner and token segmenter
§ Token learner = takes raw training corpus and induces a vocabulary, a set of
tokens.
§ Token segmenter = takes raw test sentence and segments it into the tokens
in the vocabulary
o initialize vocabulary with set of characters (symbols)
§ Count which symbols occur most frequently next to each other in the
dataset
§ Merge the symbols into a single symbol in the dataset and add it to the
vocabulary

Repeat with updated vocabulary
o Stop when vocabulary contains k symbols
§ Where k is the number of new characters, novel tokens.
§ Vocabulary grows with k new symbols




Lemmatization = mapping words to their lemma
o Disadvantage = lose information when it is used
o 'walk', 'walked', 'walks' or 'walking' à walk
POS-tagging = give every word in a sentence their POS-tag (noun, verbs etc)
Named entity recognition = Identifying proper names (detecting names, dates and
organisations)
o Mention = introduction of a named entity
§ Abraham Lincoln was …
o References = Expressions referencing a previously introduced named entity
§ Lincoln, the president, he ..
o Co-reference resolution = identifying references to the same entity
o Entity linking: mapping mentions and references to entities in a database
BIO-Tagging = a NER which indicate range of named entities (starting point without ending
point)

, Most frequent Class Baseline
Most Frequent Class Baseline: Always compare a classifier against a baseline at least as good
as the most frequent class baseline (assigning each token to the class it occurred in most
often in the training set).

The 17 universal POS-tags
Function words à a, of, and, to, a, in, is, for, that, as, …
o Consist of:
§ Determiners
• Those, these, that, this …
§ Pronominal
§ Conjunctions
§ Prepositions
§ Complementizers (two sentence combining with each other)
• That, whether
o Meaning is grammatical
o Not productive
o Highly frequent
o There is a big overlap between frequent function words and stop words.
Content words à beautiful, Australian, fishing, children, buy, fast, umbrella…
o Nouns, verbs, adjectives, adverbs
o Meaning is informative, grounded in the world
o Productive
o The more specific, the less frequent
o Endocentric compounds = have a meaning in total (yogapants, dogfood)
o Exocentric compounds = does not have a meaning (facebook, redhead)

Open Class words Close class words Other

Adj (big, old, green) Adp (in, to, under) Punct (., , , ())

Adv (very, well, tomorrow) Aux (has, was, should) Sym ($, %, +)

Intj (oh, um, yes, hello) Cconj (and, or, but) X (xfgh, pdl, jklw)

Noun (girl, cat, tree) Det (an, an, the, this)

Propn (Mary, HBO, London) Num (0, one, second)

Verb (run, ate, eating) Part (up, down, on, off)

Pron (I, you, he, she)

Sconj (that, which)

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper ninavangulik. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 64438 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€4,99
  • (0)
  Kopen