,1 Lecture 1 - Analyzing Language
In the first week, we will learn about linguistic pre-processing and natural language processing
pipelines. We will acquire the most relevant terminology for analyzing language and the practical
skills to analyze a dataset using the spaCy package.
1.1 Term definition
Provide a definition and an example for the following terms: token, word, sub-word,
lemma, morpheme, POS-tag, named entity, content word, function word.
1.1.1 Token
A token is a string of contiguous characters between two spaces, or between a space and punctuation
marks. A token can also be an integer, real, or a number with a colon (time, for example: 2:00).
All other symbols are tokens themselves except apostrophes and quotation marks in a word (with
no space), which in many cases symbolize acronyms or citations. A token can present a single word
or a group of words (in morphologically rich languages such as Hebrew).
Example: ”They picknicked by the pool, then lay back on the grass and looked at the stars” has
16 tokens.
1.1.2 Word
A word is a single distinct meaningful element of speech or writing, used with others (or sometimes
alone) to form a sentence and typically shown with a space on either side when written or printed.
Some words can be treated as such even though they contain spaces.
Example: New York, rock ’n’ roll
1.1.3 Sub-word
More frequent tokens are unique, less frequent tokens are decomposed into subwords. Subwords are
sets of tokens that include tokens smaller than words.
Example: ”I was supernervous and started stuttering” –> [’I’, ’was’, ’super’, ’##ner’, ’##vous’,
’and’, ’started’, ’s’, ’##tu’, ’##ttering’]
1.1.4 Lemma
A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the
same word sense.
Example: happier, happiest –> happy
1.1.5 Morpheme
A morpheme is a single unit of meaning that cannot be further divided.
Example: ”un-”, ”break”, ”-able” in the word ”unbreakable”.
1.1.6 POS-tag
Part-of-speech tagging is the process of assigning a part-of-speech to each word in part-of-speech
tagging a text.
Example: ”Janet(NOUN) will(AUX) back(VERB) the(DET) bill(NOUN)”
3
, 1.1.7 Named entity
Anything that can be referred to with a proper name:
Type Tag Sample Categories Example
People PER people, characters Turing is a giant of computer science
Organization ORG companies, sport teams The IPCC warned about the cyclone
Location LOC regions, mountains, seas Mt. Santias is in Sunshine Canyon
Geo-Political Entity GPE countries, states Palo Alto is reasing the fees for parking
1.1.8 Content word
Content words (or open class words) are words that possess semantic content and contribute
to the meaning of the sentence in which they occur. They include:
Open class words Example
ADJ big, old, green
ADV up, down, tomorrow, very
INTJ ouch, bravo
NOUN girl, cat, tree
PROPN Mary, John, London
VERB run, eat, running
1.1.9 Function word
Function words(or closed class words) are words a word whose purpose is to contribute to the
syntax rather than the meaning of a sentence Thus they form important elements in the structures
of sentences. They include:
Closed class words Example
ADP in, to, during
AUX should, must
CCONJ and, or, but
DET a, the
NUM 0,1, one, seventy
PART [en]not, [de] nicht, [en]’s
PRON mine, yours, myself
SCONJ if, while
Explain the difference between two related terms (of the list above).
1.2 Analysis steps
Explain the functionality of the following analysis steps: text normalization, sentence
segmentation, tokenization, byte-pair encoding, lemmatization, POS-tagging, named
entity recognition.
1.2.1 Text normalization
Normalizing text means converting it to a more convenient, standard form.
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller cdh. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.25. You're not tied to anything after your purchase.