Julia Dobladez
Lecture Notes (2023/24)
Lecture 1: Introduction
Ambiguity of language - Many interpretation of sentences:
● They saw a kid with a telescope
○ They had a telescope and saw the kid with it
○ The kid had a telescope
● Flying planes can be dangerous
○ Planes that fly can be dangerous
○ Flying the planes can be dangerous
● Time flies like an arrow
○ Could be an idiom: time passes very quickly
Practical Challenges - Sentiment Analysis:
● Negation: “This movie is definitely not bad”
● Dependencies: “Whoever thinks this film is incredibly bad is an idiot”
● Sarcasm: “This is the best movie ever, lol”
● Context: “This movie is only slightly better than Jaws”
○ Only slightly better seems pessimistic
○ But Jaws is a very good movie so then it means the movie was very good
Main issues are Sarcasm detection and World knowledge. Hence, it is not trivial because the
algorithm does not have it and is not capable of searching in google.
Ex: DeepMoji has learned to understand emotions and sarcasm based on emojis. It is still
constrained, it was good when it had “flight is cancelled”, but when changed to “missed my
flight” it took it as positive.
Language Acquisition Research - Naive Engineer Solution to Challenges: Not possible to
do so because there is a memory limit. As it is the same thing as treating the machine the same
way we raise kids to acquire language:
1. Kids need to learn how to segment speech: stress, rhythm, spotting words
2. They start mapping tokens to objects
3. Vocabulary spurt (around 2 years and slowly at first)
4. Learn words in context of known words (fast mapping)
Grounded Learning - Solution to Challenges: Includes vision, speech, etc.
Nascent fields:
● Linguistics: study of morphology, syntax, pragmatics, semantics, phonetics, etc. How
does language work? Largely unsolved
● Psycholinguistics: neurolinguistics, language acquisition. How does the brain process
language? Largely unsolved
1
, Julia Dobladez
Lecture Notes (2023/24)
Related Fields:
● Natural Language Processing (NLP): building systems for specific tasks (machine
translation, text-to-speech, summarization, language generation, question-answering,
etc.) Typically uses machine learning (nowadays)
● Computational Linguistics: combining language technologies to do
linguistically-motivated research
● Text Mining: combining language technologies to extract information from text data
History of NLP:
● Up until the 70s: everyone was really interested in Chomsky's Language Faculty.
Language is innate, so let's not learn from data, let's build grammars, and ontologies.
We are borned with certain parameters in our brains that make us very good language
learners.
● 70 - 80s AI Winter: Machine Translation was more expensive than human translators. AI
was very popular but since the AI couldn’t reach goals, the fundings stopped.
● 80s-10s: Linguistics started to reject CHomsky’s framework. Started learning from
statistics, corporate analysis, data, algorithms were designed for particulate tasks
● 10s-20s: Deep learning started to take over NLP, takes too much time to tweak models
for specific tasks. Required hyperparameter tuning. Models were not possible to be used
for various tasks as they were very time consuming.
● 20s - now: Neural Networks dominate NLP → transformers exist.
Psycholinguistics for NLP:
Many modern day systems are:
● Black boxes: they need creativity and linguistic questions to be interpreted
○ Communication: important from an industry perspective to communicative why a
systems says y
● Biased: data is biased. Risk: harmful content and inferences in demos
● Imperfect: debugging models is required, task-specific expert knowledge required for
good test sets.
○ Deployment: NLP are often put in ‘heuristic prisons’
Models need to have interpretability, fairness, and consistency research.
Language is complex because it is not easy to represent, mathematically interpret, infer, or
understand. For both classification and retrieval (essential to Text Mining), we need good
language representations.
Stylometry: involves analysing the linguistic style of written text to identify patterns,
characteristics, or authorship. It's like a linguistic fingerprint, using various quantitative
techniques to study the unique elements of writing styles, including vocabulary, sentence
structure, punctuation usage, and more. Computers are very good at picking up on the patterns
that compose the stylometry of an author.
● Authorship Identification: the process of determining the likely author of a text when
the authorship is either unknown or in dispute. It involves analyzing various linguistic
2
, Julia Dobladez
Lecture Notes (2023/24)
features and patterns within the text to identify unique stylistic elements associated with
specific authors.
○ Feature Extraction
■ N-grams can serve as input. Weighting important
● Combination of n words
■ Distances may be used per author for qualitative analysis
○ Feature Selection
○ Model Building
○ Validation
● Can identify the cultural background, age, occupation, of the author by the way
they write
● Dog whistling
● Features in Stylometry that help in Authorship Identification
○ Gender
○ Age
○ Occupation
○ Education
○ Income
○ Nationality → translating from mother tongue to another language shows
difference
○ Politics
○ Religion
○ Mental health (?)
● Personality big 5 (names) found by researchers to give any correlation with anything
● Content words are the main issue when it comes to analysing Stylometry
○ Shave is considered male
○ Color considered female
○ Guns is considered male
○ Nails is considered female
○ It depends on statistics which come from the data
Applications of Stylometry:
● Forensics
● Author profiling → makes people wary of sharing information in public
○ Used in marketing
○ Ex: Cambridge Analytica case
● Political targeting
● Uncovering anonymous writers → through phrases that showed im multiple books
○ Ex: Robert Galbraith The Cuckoo Calling and JK Rowling Harry Potter → both
written by JK Rowling
Typical Prediction Pipeline:
● Author characteristics are identified and labelled
● Author profiling
3
, Julia Dobladez
Lecture Notes (2023/24)
● Plagiarism detection
● Genre or Topic classification
● Involves several steps:
○ Data Collection & Preprocessing: data gathering, text cleaning
○ Feature extraction: tokenization, feature selection, vectorization
○ Model building: Supervised learning, unsupervised learning
○ Model evaluation: cross-validation, metrics
○ Prediction & Interpretation
Assignment’s Data:
● Reddit Posts
● Distantly annotated
● Used flairs
○ Previous work: text reports
■ Extract labels from “I’m x”, “My faith…”, categorises authors
● Work under review → get sample
● Data sharing agreement
● Corpus contents have noise in it, and potential biases as the data comes from the same
subreddit
Psycholinguistics - debugging the task (gets the appropriate accuracy):
● Investigate bias in data and models
○ Look at data → there could be flaws
● Interpret and explaining models
○ Transparency and accountability
● Create challenging test sets
○ Apply on self characteristics, with the attributes we have????
Lecture 2: Collecting Data
Focused on text processing; a way to convert language to something more uniform.
Use the re library in Python for the processing.
New definition for FN and FP in text processing.
Pre-made data is unrealistic
● APIs and platforms (Kaggle, HuggingFace Data) are convenient
● Working with .csv (tabular) and otherwise structured data is a breeze
● Even structured data can be a mess
Real text data:
● Is obtained by scraping
● In essence is very simple
4