100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Natural Language Processing 2021/2022 Summary of the lectures and reading material $5.42   Add to cart

Summary

Natural Language Processing 2021/2022 Summary of the lectures and reading material

 22 views  0 purchase
  • Course
  • Institution

A summary of the lectures and the reading material

Preview 4 out of 36  pages

  • June 15, 2022
  • 36
  • 2021/2022
  • Summary
avatar-seller
Summary Natural Lange Processing

Analyzing Language

Variation, Ambiguity, and creativity
Variation à The same meaning can be expressed in different ways
o Lexical variation (taalkundige variatie) à different words for one word (i.e. 92 words
for walking)
o Syntactic variation (grammatical) à The subject and in/direct object can change
o Language use varies à the use of the language can differ between different:
§ Across domains (e.g., jargon)
§ Over time (e.g., semantic change)
§ Across regions (e.g., dialect, language varieties)
§ Across socioeconomic groups (e.g., slang)
§ Across speakers (e.g., in age, gender, native language)
§ With respect to situational context (e.g., formal/non-formal, tired, stressed)
Ambiguity à The same meaning can have different meanings
o Phonetics à here/hear, two/too/to, write/right
o Lexical ambiguity (word ambiguities) à bat (animal or baseball bat?)
o Semantic ambiguity (zins ambiguities) à John and Mary are married (to each other,
or both married?)
o Syntactic ambiguity à I’ve been chasing the kid on the bicycle (who is riding the
bike?)
Creativity
o Language is compositional à making sentences that are never been said before
o Language users are creative
§ Neologisms: hangry
§ Compounds (a thing that is composed of two or more separate elements à
combining two free morphemes to create a new word): olive oil vs baby oil
§ Sarcasm and irony: That’s just what I needed today

Definition token, word, sub-word, lemma, morpheme, POS-tag, named entity, content
word and function word

Token = The words which are present in the text (including punctuation)
o “a good wine is a wine that you like.” à 10 tokens
Type = Amount of different tokens/distinct words
o “a good wine is a wine that you like.”à 8 types
Word = Amount of tokens excluding punctuations
o Wine, good, is, a
Sub-word = A small meaning full word that belongs to a bigger word. Frequently used words
are not split, and rare words are split into smaller meaningful subwords.
o Lowercase à Lower ##case
Lemma = The canonical form, dictionary form or citation form of a word
o Nouns: citation form is singular
§ Mice à mouse
o Verbs: citation form is infinitive

, § Driving, drives, driven, drove à drive
o Adjectives: citation form is the positive
§ Happy, happier, happiest à happy
Morpheme = The smallest meaningful lexical item in a language (can be combined to derive
new words)
Free-morpheme = Can appear in isolation
o woman in womanly
Bound-morpheme = Can’t appear in isolation
o (-ed, -s, -es, -sent)
o sent in dissent
o Two jobs:
§ Inflectional morphology à conveys grammatical information, such as
number, tense, agreement or case.
§ Derivational morphology à derivation, the process that creates new words.
N-gram = sequences of n tokens
POS-tag = The identification of words, Noun, verbs, adjectives, adverbs etc.
Named entity = a name, organization or date
Content word = The open class words, where new words can be added. The new words are
always an adj, adv, intj, noun, propn or verb.
Function word = The closed class words, the words that conveys most of the meaning of a
sentence: adp, aux, cconj, det, num part, pron, sconj




Text normalization, sentence segmentation, tokenization, byte-pair encoding,
Lemmatization, POS-tagging and named entity recognition

Text normalization = Removing the layout
o Disadvantage = a lot of data is lost
o Ways to clean (which depends on the task)
§ Uppercase vs lowercase vs true case
§ Normalize punctuation
§ Remove/replace emojis and urls
§ Spelling correction
§ Replace numbers with NUM, links with URL
§ Anonymization
Sentence segmentation = splitting sentences out of a full text. For example, separating by
punctuation
Tokenization (word segmentation) = splitting a sentence into its parts (tokens)
o There are some tricky cases where you have to choose: [isn’t], [is] [not] etc.
o In other languages or alphabets it can even be more complicated, for example in
France with l’ensamble etc.

, o Needs to be run before any other language processing à therefore needs to be very
fast
Unknown word problem = words in test data which did not occur in training data
o Solution = simplification = splitting unknown words into subwords
o Idea = frequent tokens are unique, less frequent ones are decomposed into parts
o Byte-pair encoding
Byte-pair encoding = Ensuring that the most common words are represented in the
vocabulary as a singly token while the rare words are broken down into two or more
subword tokens
o Two parts: token learner and token segmenter
§ Token learner = takes raw training corpus and induces a vocabulary, a set of
tokens.
§ Token segmenter = takes raw test sentence and segments it into the tokens
in the vocabulary
o initialize vocabulary with set of characters (symbols)
§ Count which symbols occur most frequently next to each other in the
dataset
§ Merge the symbols into a single symbol in the dataset and add it to the
vocabulary

Repeat with updated vocabulary
o Stop when vocabulary contains k symbols
§ Where k is the number of new characters, novel tokens.
§ Vocabulary grows with k new symbols




Lemmatization = mapping words to their lemma
o Disadvantage = lose information when it is used
o 'walk', 'walked', 'walks' or 'walking' à walk
POS-tagging = give every word in a sentence their POS-tag (noun, verbs etc)
Named entity recognition = Identifying proper names (detecting names, dates and
organisations)
o Mention = introduction of a named entity
§ Abraham Lincoln was …
o References = Expressions referencing a previously introduced named entity
§ Lincoln, the president, he ..
o Co-reference resolution = identifying references to the same entity
o Entity linking: mapping mentions and references to entities in a database
BIO-Tagging = a NER which indicate range of named entities (starting point without ending
point)

, Most frequent Class Baseline
Most Frequent Class Baseline: Always compare a classifier against a baseline at least as good
as the most frequent class baseline (assigning each token to the class it occurred in most
often in the training set).

The 17 universal POS-tags
Function words à a, of, and, to, a, in, is, for, that, as, …
o Consist of:
§ Determiners
• Those, these, that, this …
§ Pronominal
§ Conjunctions
§ Prepositions
§ Complementizers (two sentence combining with each other)
• That, whether
o Meaning is grammatical
o Not productive
o Highly frequent
o There is a big overlap between frequent function words and stop words.
Content words à beautiful, Australian, fishing, children, buy, fast, umbrella…
o Nouns, verbs, adjectives, adverbs
o Meaning is informative, grounded in the world
o Productive
o The more specific, the less frequent
o Endocentric compounds = have a meaning in total (yogapants, dogfood)
o Exocentric compounds = does not have a meaning (facebook, redhead)

Open Class words Close class words Other

Adj (big, old, green) Adp (in, to, under) Punct (., , , ())

Adv (very, well, tomorrow) Aux (has, was, should) Sym ($, %, +)

Intj (oh, um, yes, hello) Cconj (and, or, but) X (xfgh, pdl, jklw)

Noun (girl, cat, tree) Det (an, an, the, this)

Propn (Mary, HBO, London) Num (0, one, second)

Verb (run, ate, eating) Part (up, down, on, off)

Pron (I, you, he, she)

Sconj (that, which)

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller ninavangulik. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.42. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

64438 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$5.42
  • (0)
  Add to cart