Class notes

Class notes Language and AI

0 purchase

Course
Language and AI

Institution
Technische Universiteit Eindhoven (TUE)

Class notes from lecture 1 to lecture 6, includes important images or screenshots from slides.

[Show more]

Preview 4 out of 31 pages

View example

Uploaded on August 19, 2024
Number of pages 31
Written in 2023/2024
Type Class notes
Professor(s) Chris emmery
Contains 1 to 6

language and i
summary
lecture notes

Institution
Technische Universiteit Eindhoven (TUE)
Education
Data Science
Course
Language and AI

$6.93

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Julia Dobladez
Lecture Notes (2023/24)

Lecture 1: Introduction

Ambiguity of language - Many interpretation of sentences:

● They saw a kid with a telescope
○ They had a telescope and saw the kid with it
○ The kid had a telescope
● Flying planes can be dangerous
○ Planes that fly can be dangerous
○ Flying the planes can be dangerous
● Time flies like an arrow
○ Could be an idiom: time passes very quickly

Practical Challenges - Sentiment Analysis:

● Negation: “This movie is definitely not bad”
● Dependencies: “Whoever thinks this film is incredibly bad is an idiot”
● Sarcasm: “This is the best movie ever, lol”
● Context: “This movie is only slightly better than Jaws”
○ Only slightly better seems pessimistic
○ But Jaws is a very good movie so then it means the movie was very good

Main issues are Sarcasm detection and World knowledge. Hence, it is not trivial because the
algorithm does not have it and is not capable of searching in google.
Ex: DeepMoji has learned to understand emotions and sarcasm based on emojis. It is still
constrained, it was good when it had “flight is cancelled”, but when changed to “missed my
flight” it took it as positive.

Language Acquisition Research - Naive Engineer Solution to Challenges: Not possible to
do so because there is a memory limit. As it is the same thing as treating the machine the same
way we raise kids to acquire language:
1. Kids need to learn how to segment speech: stress, rhythm, spotting words
2. They start mapping tokens to objects
3. Vocabulary spurt (around 2 years and slowly at first)
4. Learn words in context of known words (fast mapping)

Grounded Learning - Solution to Challenges: Includes vision, speech, etc.

Nascent fields:
● Linguistics: study of morphology, syntax, pragmatics, semantics, phonetics, etc. How
does language work? Largely unsolved
● Psycholinguistics: neurolinguistics, language acquisition. How does the brain process
language? Largely unsolved

1

, Julia Dobladez
Lecture Notes (2023/24)
Related Fields:
● Natural Language Processing (NLP): building systems for specific tasks (machine
translation, text-to-speech, summarization, language generation, question-answering,
etc.) Typically uses machine learning (nowadays)
● Computational Linguistics: combining language technologies to do
linguistically-motivated research
● Text Mining: combining language technologies to extract information from text data

History of NLP:
● Up until the 70s: everyone was really interested in Chomsky's Language Faculty.
Language is innate, so let's not learn from data, let's build grammars, and ontologies.
We are borned with certain parameters in our brains that make us very good language
learners.
● 70 - 80s AI Winter: Machine Translation was more expensive than human translators. AI
was very popular but since the AI couldn’t reach goals, the fundings stopped.
● 80s-10s: Linguistics started to reject CHomsky’s framework. Started learning from
statistics, corporate analysis, data, algorithms were designed for particulate tasks
● 10s-20s: Deep learning started to take over NLP, takes too much time to tweak models
for specific tasks. Required hyperparameter tuning. Models were not possible to be used
for various tasks as they were very time consuming.
● 20s - now: Neural Networks dominate NLP → transformers exist.

Psycholinguistics for NLP:
Many modern day systems are:
● Black boxes: they need creativity and linguistic questions to be interpreted
○ Communication: important from an industry perspective to communicative why a
systems says y
● Biased: data is biased. Risk: harmful content and inferences in demos
● Imperfect: debugging models is required, task-specific expert knowledge required for
good test sets.
○ Deployment: NLP are often put in ‘heuristic prisons’
Models need to have interpretability, fairness, and consistency research.

Language is complex because it is not easy to represent, mathematically interpret, infer, or
understand. For both classification and retrieval (essential to Text Mining), we need good
language representations.

Stylometry: involves analysing the linguistic style of written text to identify patterns,
characteristics, or authorship. It's like a linguistic fingerprint, using various quantitative
techniques to study the unique elements of writing styles, including vocabulary, sentence
structure, punctuation usage, and more. Computers are very good at picking up on the patterns
that compose the stylometry of an author.
● Authorship Identification: the process of determining the likely author of a text when
the authorship is either unknown or in dispute. It involves analyzing various linguistic

2

, Julia Dobladez
Lecture Notes (2023/24)
features and patterns within the text to identify unique stylistic elements associated with
specific authors.
○ Feature Extraction
■ N-grams can serve as input. Weighting important
● Combination of n words
■ Distances may be used per author for qualitative analysis
○ Feature Selection
○ Model Building
○ Validation
● Can identify the cultural background, age, occupation, of the author by the way
they write
● Dog whistling
● Features in Stylometry that help in Authorship Identification
○ Gender
○ Age
○ Occupation
○ Education
○ Income
○ Nationality → translating from mother tongue to another language shows
difference
○ Politics
○ Religion
○ Mental health (?)
● Personality big 5 (names) found by researchers to give any correlation with anything
● Content words are the main issue when it comes to analysing Stylometry
○ Shave is considered male
○ Color considered female
○ Guns is considered male
○ Nails is considered female
○ It depends on statistics which come from the data

Applications of Stylometry:
● Forensics
● Author profiling → makes people wary of sharing information in public
○ Used in marketing
○ Ex: Cambridge Analytica case
● Political targeting
● Uncovering anonymous writers → through phrases that showed im multiple books
○ Ex: Robert Galbraith The Cuckoo Calling and JK Rowling Harry Potter → both
written by JK Rowling

Typical Prediction Pipeline:
● Author characteristics are identified and labelled
● Author profiling

3

, Julia Dobladez
Lecture Notes (2023/24)
● Plagiarism detection
● Genre or Topic classification
● Involves several steps:
○ Data Collection & Preprocessing: data gathering, text cleaning
○ Feature extraction: tokenization, feature selection, vectorization
○ Model building: Supervised learning, unsupervised learning
○ Model evaluation: cross-validation, metrics
○ Prediction & Interpretation

Assignment’s Data:
● Reddit Posts
● Distantly annotated
● Used flairs
○ Previous work: text reports
■ Extract labels from “I’m x”, “My faith…”, categorises authors
● Work under review → get sample
● Data sharing agreement
● Corpus contents have noise in it, and potential biases as the data comes from the same
subreddit

Psycholinguistics - debugging the task (gets the appropriate accuracy):
● Investigate bias in data and models
○ Look at data → there could be flaws
● Interpret and explaining models
○ Transparency and accountability
● Create challenging test sets
○ Apply on self characteristics, with the attributes we have????

Lecture 2: Collecting Data

Focused on text processing; a way to convert language to something more uniform.

Use the re library in Python for the processing.

New definition for FN and FP in text processing.

Pre-made data is unrealistic
● APIs and platforms (Kaggle, HuggingFace Data) are convenient
● Working with .csv (tabular) and otherwise structured data is a breeze
● Even structured data can be a mess

Real text data:
● Is obtained by scraping
● In essence is very simple

4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller juliadobladez1. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.93. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

64450 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller