100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Natural Language Processing Technology (L_AAMAALG005): Summary of learning goals $6.51   Add to cart

Summary

Natural Language Processing Technology (L_AAMAALG005): Summary of learning goals

 31 views  2 purchases
  • Course
  • Institution

This document includes answers to all learning goals for each lecture of Natural Language Processing Technology, INCLUDING reading comprehension answers!

Preview 3 out of 22  pages

  • November 5, 2021
  • 22
  • 2020/2021
  • Summary
avatar-seller
NLP Tech 2021

Notes of learning goals

Name: VUnetID:




1 Lecture 1
1. Provide a definition and an example for the following terms: token, word, sub-word, lemma, morpheme,
POS-tag, named entity, content word, function word.

• Token: A token is an instance of a sequence of characters in some particular document that are
grouped together as a useful semantic unit for processing.
• Word: A word in a language.
• Sub-word: a part of a longer word, which is often useful in NLP tasks where a long word is often a
non-common word, which can be composed in sub-words which are more common (e.g. superman
composed to super and man).
• Lemma: a set of lexical forms having the same stem, the same major part-of-speech, and the same
word sense (sing, sung, sang are forms of the verb sing).
• Morpheme: A ”morpheme” is a short segment of language that meets three basic criteria: 1. It
is a word or a part of a word that has meaning. 2. It cannot be divided into smaller meaningful
segments without changing its meaning or leaving a meaningless remainder. 3. It has relatively the
same stable meaning in different verbal environments. For example, take the word “dancing”. This
word can be divided into two separate morphemes: dance, plus the suffix -ing. You can tell that
these are morphemes because they cannot be divided any further — da and nce are meaningless in
English, as are i and ng. It is however very difficult to choose the right morpheme for NLP tasks; for
instance in Arabic language “waldiyn” you could break it up into either “w aldiyn” it means: ‘and
the religion’ or “waldiyn” ‘parents’. Lemmatization or sub-word segmentation might be a bit simpler
and more useful.
• POS-tag: taking a sequence of words and assigning each word a part of speech tag like NOUN or
VERB.
• Named Entity: for anything that can be referred to with a proper name, such as a person, a location,
or an organization (roughly speaking).
• Content word: words that have meaning. Nouns, main verbs, adjectives and adverbs are usually
content words. ‘We flew over the mountains at dawn’.
• Function word: a word that expresses a grammatical or structural relationship with other words in
a sentence (the, over, and).

2. Explain the difference between two related terms (of the list above).
• Lemma vs Morpheme: a lemma is the base form of the word. For example, from ”dancing”, the lemma
is ”dance”, the morphemes are dance, plus the suffix -ing. You can tell that these are morphemes
because they cannot be divided any further — da and nce are meaningless in English, as are i and
ng. In essence, a lemma is a more distinguishable form of a morpheme.

3. Explain the functionality of the following analysis steps: text normalization, sentence segmentation, tok-
enization, byte-pair encoding, lemmatization, POS-tagging, named entity recognition.
• Text normalization: Normalizing texts makes it easier to run through a standard NLP pipeline. Think
of removing bold markups, title, emojis and urls, numbers to NUM if those things do not significantly
affect your research question (e.g., emojis could be useful when doing sentiment analysis, so then we
would want to keep them in our dataset).




1

,– Notes of learning goals 2


• Sentence segmentation: Define boundaries to where a sentence starts and ends to get more context
about the sentences. We can assume that each sentence in English is a separate thought or idea. It
will be a lot easier to write a program to understand a single sentence than to understand a whole
paragraph.
• Tokenization: Tokenizing words so the computer can understand them. Computers read in numbers,
so we should convert words to numbers to the computer could use them. Descisions on how to
tokenize certain words can be difficult; often a frequency-based model is used.
• Byte-pair encoding: A statistical approach for decomposing words into subwords.
• Lemmatization: Convert different forms of a lemma to the lemma itself; sung, sang and sing will all
be converted to sing, cars car’s and car’s will be converted to car. This is useful since they have the
same meaning and will make a NLP task simpler.
• POS-tagging: Assigning each word in a sentence to one of the part-op-speeches defined in figure1
(for linguistic tasks the word classes are further defined as in figure ??.)
• Named entity recognition: Recognize if a word belongs to a Person, Organization, Location or Geo-
Political Entity. Think of Washington:
4. Explain the differences between the 17 word classes distinguished by the universal part-of-speech tags and
provide an example for each class.
• See figure 1.




Figure 1: From section 8.1 of book.


5. Manually determine the part-of-speech tag for a word in a given context.
• earnings growth took a back/JJ seat
• a small building in the back/NN
• a clear majority of senators back/VBP the bill
• Dave began to back/VB toward the door
• enable the country to buy back/RP debt

, – Notes of learning goals 3


• I was twenty-one back/RB then
• Words can have a lot of tags, however many words are easy to disambiguate, because their different
tags aren’t equally likely. For example, a can be a determiner or the letter a, but the determiner
sense is much more likely.
6. Explain the concept of ambiguity in part-of-speech tagging.
• Take as example lemmatization with the word ’saw’. As a verb, its lemma is see, so we would convert
the word to ’see’. However, the meaning could also be just a ’saw’. In Spanish, we would have quiero
(I want) and quieres (you want) which would be converted to the lemma querer, from which we lose
information about the pronoun, I or you.
7. Provide examples for different named entity classes.
• Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY
$6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines],
a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
[ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies
to most routes where it competes against discount carriers, such as[LOC Chicago] to [LOC Dallas]
and [LOC Denver] to [LOC San Francisco].
8. Explain the BIO tagging scheme.

• Figure 2 shows a sentence represented with BIO tagging, as well as variants called IO tagging and
BIOES tagging. In BIO tagging we label any token that begins a span of interest with the label B,
tokens that occur inside a span are tagged with an I, and any tokens outside of any span of interest
are labeled O. While there is only one O tag, we’ll have distinct B and I tags for each named entity
class. The number of tags is thus 2n+1 tags, where n is the number of entity types. BIO tagging can
represent exactly the same information as the bracketed notation, but has the advantage that we can
represent the task in the same simple sequence modeling way as part-of-speech tagging: assigning a
single label yi to each input word xi




Figure 2: BIO tagging together with IO and BIOES tagging. IO tagging loses some information by eliminating
the B tag, and BIOES tagging adds an end tag E for the end of a span, and a span tag S for a span consisting
of only one word.

9. Explain the concept of an NLP shared task and provide examples.
• These are big, broad tasks where the whole NLP community could be working on. Think of Speech
recognition, text classification; see http://nlpprogress.com/ for more.

10. Analyze a dataset using the NLP pipeline spaCy.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller timdeboer. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.51. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

64438 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$6.51  2x  sold
  • (0)
  Add to cart