NLP Tech 2021
Notes of learning goals
Name: VUnetID:
1 Lecture 1
1. Provide a definition and an example for the following terms: token, word, sub-word, lemma, morpheme,
POS-tag, named entity, content word, function word.
• Token: A token is an instance of a sequence of characters in some particular document that are
grouped together as a useful semantic unit for processing.
• Word: A word in a language.
• Sub-word: a part of a longer word, which is often useful in NLP tasks where a long word is often a
non-common word, which can be composed in sub-words which are more common (e.g. superman
composed to super and man).
• Lemma: a set of lexical forms having the same stem, the same major part-of-speech, and the same
word sense (sing, sung, sang are forms of the verb sing).
• Morpheme: A ”morpheme” is a short segment of language that meets three basic criteria: 1. It
is a word or a part of a word that has meaning. 2. It cannot be divided into smaller meaningful
segments without changing its meaning or leaving a meaningless remainder. 3. It has relatively the
same stable meaning in different verbal environments. For example, take the word “dancing”. This
word can be divided into two separate morphemes: dance, plus the suffix -ing. You can tell that
these are morphemes because they cannot be divided any further — da and nce are meaningless in
English, as are i and ng. It is however very difficult to choose the right morpheme for NLP tasks; for
instance in Arabic language “waldiyn” you could break it up into either “w aldiyn” it means: ‘and
the religion’ or “waldiyn” ‘parents’. Lemmatization or sub-word segmentation might be a bit simpler
and more useful.
• POS-tag: taking a sequence of words and assigning each word a part of speech tag like NOUN or
VERB.
• Named Entity: for anything that can be referred to with a proper name, such as a person, a location,
or an organization (roughly speaking).
• Content word: words that have meaning. Nouns, main verbs, adjectives and adverbs are usually
content words. ‘We flew over the mountains at dawn’.
• Function word: a word that expresses a grammatical or structural relationship with other words in
a sentence (the, over, and).
2. Explain the difference between two related terms (of the list above).
• Lemma vs Morpheme: a lemma is the base form of the word. For example, from ”dancing”, the lemma
is ”dance”, the morphemes are dance, plus the suffix -ing. You can tell that these are morphemes
because they cannot be divided any further — da and nce are meaningless in English, as are i and
ng. In essence, a lemma is a more distinguishable form of a morpheme.
3. Explain the functionality of the following analysis steps: text normalization, sentence segmentation, tok-
enization, byte-pair encoding, lemmatization, POS-tagging, named entity recognition.
• Text normalization: Normalizing texts makes it easier to run through a standard NLP pipeline. Think
of removing bold markups, title, emojis and urls, numbers to NUM if those things do not significantly
affect your research question (e.g., emojis could be useful when doing sentiment analysis, so then we
would want to keep them in our dataset).
1
,– Notes of learning goals 2
• Sentence segmentation: Define boundaries to where a sentence starts and ends to get more context
about the sentences. We can assume that each sentence in English is a separate thought or idea. It
will be a lot easier to write a program to understand a single sentence than to understand a whole
paragraph.
• Tokenization: Tokenizing words so the computer can understand them. Computers read in numbers,
so we should convert words to numbers to the computer could use them. Descisions on how to
tokenize certain words can be difficult; often a frequency-based model is used.
• Byte-pair encoding: A statistical approach for decomposing words into subwords.
• Lemmatization: Convert different forms of a lemma to the lemma itself; sung, sang and sing will all
be converted to sing, cars car’s and car’s will be converted to car. This is useful since they have the
same meaning and will make a NLP task simpler.
• POS-tagging: Assigning each word in a sentence to one of the part-op-speeches defined in figure1
(for linguistic tasks the word classes are further defined as in figure ??.)
• Named entity recognition: Recognize if a word belongs to a Person, Organization, Location or Geo-
Political Entity. Think of Washington:
4. Explain the differences between the 17 word classes distinguished by the universal part-of-speech tags and
provide an example for each class.
• See figure 1.
Figure 1: From section 8.1 of book.
5. Manually determine the part-of-speech tag for a word in a given context.
• earnings growth took a back/JJ seat
• a small building in the back/NN
• a clear majority of senators back/VBP the bill
• Dave began to back/VB toward the door
• enable the country to buy back/RP debt
, – Notes of learning goals 3
• I was twenty-one back/RB then
• Words can have a lot of tags, however many words are easy to disambiguate, because their different
tags aren’t equally likely. For example, a can be a determiner or the letter a, but the determiner
sense is much more likely.
6. Explain the concept of ambiguity in part-of-speech tagging.
• Take as example lemmatization with the word ’saw’. As a verb, its lemma is see, so we would convert
the word to ’see’. However, the meaning could also be just a ’saw’. In Spanish, we would have quiero
(I want) and quieres (you want) which would be converted to the lemma querer, from which we lose
information about the pronoun, I or you.
7. Provide examples for different named entity classes.
• Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY
$6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines],
a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
[ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies
to most routes where it competes against discount carriers, such as[LOC Chicago] to [LOC Dallas]
and [LOC Denver] to [LOC San Francisco].
8. Explain the BIO tagging scheme.
• Figure 2 shows a sentence represented with BIO tagging, as well as variants called IO tagging and
BIOES tagging. In BIO tagging we label any token that begins a span of interest with the label B,
tokens that occur inside a span are tagged with an I, and any tokens outside of any span of interest
are labeled O. While there is only one O tag, we’ll have distinct B and I tags for each named entity
class. The number of tags is thus 2n+1 tags, where n is the number of entity types. BIO tagging can
represent exactly the same information as the bracketed notation, but has the advantage that we can
represent the task in the same simple sequence modeling way as part-of-speech tagging: assigning a
single label yi to each input word xi
Figure 2: BIO tagging together with IO and BIOES tagging. IO tagging loses some information by eliminating
the B tag, and BIOES tagging adds an end tag E for the end of a span, and a span tag S for a span consisting
of only one word.
9. Explain the concept of an NLP shared task and provide examples.
• These are big, broad tasks where the whole NLP community could be working on. Think of Speech
recognition, text classification; see http://nlpprogress.com/ for more.
10. Analyze a dataset using the NLP pipeline spaCy.