Lecture 1: Introduction
Computational linguistics: algorithms that model language data, e.g., similarity, information value
and sequence probabilities (mathematical view)
Natural Language Processing (NLP): engineering to address aspects of natural language, e.g.,
tokenization, lemmatization, compound splitting, syntactic splitting, entity detection, sentiment
analysis… (engineering view)
NLP Toolkits: software packages and resources that provide and/or combine collections of NLP
modules
Language applications: machine translation, summarization, chat bots, text mining
Text mining: from unstructured text to structured data (information or knowledge)
Lecture 2: Linguistic and Natural Language Processing
Subdiscipline Medium or unit Natural Language Model
Phonetics, phonology Sounds Automatic speech recognition
Morphology Words, word formation Part-of-speech taggers,
lemmatizes, compound
splitters
Syntax Sentences, grammatical structure and Syntactic parsers, chunkers
function
Semantics Meaning Semantic parsers
Pragmatics Language use in context Context and domain models
Methods Introspection, behaviorism, empirical
(experimental and stochastic),
mathematical models
Resources Lexicons (dictionary as database),
grammars, data collections and
annotations, data models, annotations
We use minimal information to express a lot (e.g., riots in Amsterdam, exactly know which). Without
context, data (spoken words etc.) is difficult to understand.
- Morphology: study of form and structure of words. Words are composed of morphemes.
Morpheme is the smallest meaning-bearing unit (e.g., talked contains of 2 morphemes: -talk
(activity) and -ed (past)
Different types of morphemes:
- Free morphemes: occur independently (e.g., boy, sing)
- Bound morphemes: attached to another morpheme, and cannot be used independently
(English -s: boys, Dutch -s/en: appels/appelen)
- Affix: prefixes (e.g., gelopen), infixes (e.g., burgemeesterspost) suffixes (e.g., loopje)
Some other basic terms:
- Root or Base: an un-analysable morpheme, expressing the basic lexical content of a word.
Also defined as ‘what is left of a complex form when affixes are stripped’
- Stem: consists of at least a root. It can contain a derivational affix(es) “aardigste” “aardig”
/ “aard”
, - Lemma: an entry in a dictionary. Single form for nouns (“stemmetje” “stem”) and
infinitive form for verbs “stemde” “stemmen”)
The difference between stem and lemma is that stem does not have to be an actual word, whereas
lemma is an actual language word.
Words have part-of-speech (PoS), which specifies the typical phrase structures in which they can be
the head. Open Class (open to word formation and neologisms). Noun (N, boat), Verb (V, float),
Adjective (A, large/fast), Adverb (very/largely). New words are invented veery day and other words
are forgotten. Millions of open class words if we include specialized language. Closed Class (you can
not invent a new closed class word). Pronoun (PRN, he/him/…), Preposition (P, in, at, from…).
Relatively fixed, slowly change over generations; small set of less than a hundred words.
Word modification: given a root, base or stem derive different forms. Inflection: expresses syntactic
properties such as person (1, 2, 3), number (singular/plural), gender, tense… Derivation: changes
semantic and grammatical properties, e.g., incapable. Compounding: “beach head”. Combinations:
aircraft-carriers. Word formation is very productive, our lexicon is potentially infinite: the number of
unseen compounds detected in German and Dutch newspapers grows linearly with the number of
newspapers over time. The names for new chemical compounds and proteins grow rapidly every
year. New products launched every year.
Zipfian distribution (Zipfs law): the frequency of a word in a ranked list is the equal to the frequency
of the most frequent word, divided by the rank. Most frequent words also tend to be short and have
many different meanings.
Lexicon of forms: lists all common base forms with: their part-of-speech, inflectional paradigm
(plural, singular, person, tense) and typical (conventional) derived forms. Inflectional paradigms (-s, -
ed) and derivational morphemes (-ation, -ity, -ly).
Morphology in computation linguistics: analyzing complex words, defining their component parts
(ant+dis+establishment+…). Analysis of grammatical information, encoded in words: part-of-speech
= VERB and inflectional information = [PERSON 3, NUMBER singular, TENSE present]. Obtaining the
stem or root: to reduce the size of the data and to find the word in the lexicon.
Part-of-speech tagging: task is to assign the part-of-speech category to every token and add the
lemma. The main challenge is data sparseness for specific languages and domains. PoS-tagging has
an accuracy around 95-96% for all tokens when training and testing. Remaining issues: long distance
dependencies/genuine ambiguities, annotation errors and unknown words. Relatively high
proportion of sentences has at least one error These errors can propagate: wrong PoS may lead to
wrong word sense/named entity…
Multiword expression: fixed idioms (an apple a day keeps the doctor away), less fixed idioms
(shooting from the hip), slots (X, let alone Y), collocations (running engine, running a programme)
and selectional restrictions (a glass of …)
- Syntax: we experience a sentence as a complete grammatical structure. We can freely
combine words into phrases or constituents and we have a strong intuition about the
grammaticality of these structures within a sentence.
Phrase: a word or a group of words which functions as a single unit within a grammatical hierarchy.
A phrase is built around a head lexical item and has a certain syntactic behaviour (she Noun
Phrase (NP), the head is a pronoun. A very beautiful morning (NP, the head is a noun). Chases the cat
, Verb phrase (VP, head is a verb)). The head of a phrase is the element that determines the
syntactic function of the whole phrase.
Syntactic elements
Phrasal categories Lexical categories
Noun phrase (NP Noun (N)
Prepositional phrase (PP) Pronoun (Pr)
Verb phrase (VP) Adjective (A)
Adverbial phrase (AdvP) Adverb (Adv)
Adjectival phrase (AP) Verb (V)
Preposition (P)
A phrase structure can be nested. The nesting is hierarchically and have head – modifier relations.
For example:
- Very nice = Adjective Phrase or AP (head is an adjective (A))
- A very nice looping = NP (head is a noun (N))
- Performs a nice looping = VP (head is a verb (V))
- With a long stick = Prepositional phrase (head is preposition
(P))
- The cow performs a very nice looping with a long stick =
Sentence (S)
Phrase functions: subject, object, main verb, modifier, adjunct…
phrase functions and the different categories can be modelled inside
a syntactic tree:
Gram Subject: agreement with the main verb
Gram Objects: obligatory NPs or PPs to form a grammatical sentence
Syntax Tree with dependency labels:
Most important types of predicates in terms of obligatory arguments
(the complementation = that what is needed to obtain a grammatical
structure:
Valency Predicate Complementation Example
Intransitive walk.v NP.subject The cow walks
Transitive Perform.v NP.subject, NP.direct object The cow performs a loopring
Transitive Count.v NP.subject, PP(on).pp – object The cow is hoping for a big
applause
Transitive Be.v NP.subject, NP.object/AP.object This cow is a
phenomenom/this cow is
phenomenal
Ditransitive Give.v NP.subject, NP.direct object, The cow gives the spectators
NP.indirect object an unforgettable day
A lexicon provides a list of verbs with their complementation patterns
Phrase structure parsers: lookup words from a sentence in a sentence to find a candidate for a main
verb. Get the obligatory arguments of the verb. Match the structure of surrounding phrases with the