100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Class notes Language and AI €6,46   In winkelwagen

College aantekeningen

Class notes Language and AI

 4 keer bekeken  0 keer verkocht

Class notes from lecture 1 to lecture 6, includes important images or screenshots from slides.

Voorbeeld 4 van de 31  pagina's

  • 19 augustus 2024
  • 31
  • 2023/2024
  • College aantekeningen
  • Chris emmery
  • 1 to 6
Alle documenten voor dit vak (4)
avatar-seller
juliadobladez1
Julia Dobladez
Lecture Notes (2023/24)

Lecture 1: Introduction

Ambiguity of language - Many interpretation of sentences:

● They saw a kid with a telescope
○ They had a telescope and saw the kid with it
○ The kid had a telescope
● Flying planes can be dangerous
○ Planes that fly can be dangerous
○ Flying the planes can be dangerous
● Time flies like an arrow
○ Could be an idiom: time passes very quickly

Practical Challenges - Sentiment Analysis:

● Negation: “This movie is definitely not bad”
● Dependencies: “Whoever thinks this film is incredibly bad is an idiot”
● Sarcasm: “This is the best movie ever, lol”
● Context: “This movie is only slightly better than Jaws”
○ Only slightly better seems pessimistic
○ But Jaws is a very good movie so then it means the movie was very good

Main issues are Sarcasm detection and World knowledge. Hence, it is not trivial because the
algorithm does not have it and is not capable of searching in google.
Ex: DeepMoji has learned to understand emotions and sarcasm based on emojis. It is still
constrained, it was good when it had “flight is cancelled”, but when changed to “missed my
flight” it took it as positive.

Language Acquisition Research - Naive Engineer Solution to Challenges: Not possible to
do so because there is a memory limit. As it is the same thing as treating the machine the same
way we raise kids to acquire language:
1. Kids need to learn how to segment speech: stress, rhythm, spotting words
2. They start mapping tokens to objects
3. Vocabulary spurt (around 2 years and slowly at first)
4. Learn words in context of known words (fast mapping)

Grounded Learning - Solution to Challenges: Includes vision, speech, etc.

Nascent fields:
● Linguistics: study of morphology, syntax, pragmatics, semantics, phonetics, etc. How
does language work? Largely unsolved
● Psycholinguistics: neurolinguistics, language acquisition. How does the brain process
language? Largely unsolved




1

, Julia Dobladez
Lecture Notes (2023/24)
Related Fields:
● Natural Language Processing (NLP): building systems for specific tasks (machine
translation, text-to-speech, summarization, language generation, question-answering,
etc.) Typically uses machine learning (nowadays)
● Computational Linguistics: combining language technologies to do
linguistically-motivated research
● Text Mining: combining language technologies to extract information from text data

History of NLP:
● Up until the 70s: everyone was really interested in Chomsky's Language Faculty.
Language is innate, so let's not learn from data, let's build grammars, and ontologies.
We are borned with certain parameters in our brains that make us very good language
learners.
● 70 - 80s AI Winter: Machine Translation was more expensive than human translators. AI
was very popular but since the AI couldn’t reach goals, the fundings stopped.
● 80s-10s: Linguistics started to reject CHomsky’s framework. Started learning from
statistics, corporate analysis, data, algorithms were designed for particulate tasks
● 10s-20s: Deep learning started to take over NLP, takes too much time to tweak models
for specific tasks. Required hyperparameter tuning. Models were not possible to be used
for various tasks as they were very time consuming.
● 20s - now: Neural Networks dominate NLP → transformers exist.

Psycholinguistics for NLP:
Many modern day systems are:
● Black boxes: they need creativity and linguistic questions to be interpreted
○ Communication: important from an industry perspective to communicative why a
systems says y
● Biased: data is biased. Risk: harmful content and inferences in demos
● Imperfect: debugging models is required, task-specific expert knowledge required for
good test sets.
○ Deployment: NLP are often put in ‘heuristic prisons’
Models need to have interpretability, fairness, and consistency research.

Language is complex because it is not easy to represent, mathematically interpret, infer, or
understand. For both classification and retrieval (essential to Text Mining), we need good
language representations.

Stylometry: involves analysing the linguistic style of written text to identify patterns,
characteristics, or authorship. It's like a linguistic fingerprint, using various quantitative
techniques to study the unique elements of writing styles, including vocabulary, sentence
structure, punctuation usage, and more. Computers are very good at picking up on the patterns
that compose the stylometry of an author.
● Authorship Identification: the process of determining the likely author of a text when
the authorship is either unknown or in dispute. It involves analyzing various linguistic


2

, Julia Dobladez
Lecture Notes (2023/24)
features and patterns within the text to identify unique stylistic elements associated with
specific authors.
○ Feature Extraction
■ N-grams can serve as input. Weighting important
● Combination of n words
■ Distances may be used per author for qualitative analysis
○ Feature Selection
○ Model Building
○ Validation
● Can identify the cultural background, age, occupation, of the author by the way
they write
● Dog whistling
● Features in Stylometry that help in Authorship Identification
○ Gender
○ Age
○ Occupation
○ Education
○ Income
○ Nationality → translating from mother tongue to another language shows
difference
○ Politics
○ Religion
○ Mental health (?)
● Personality big 5 (names) found by researchers to give any correlation with anything
● Content words are the main issue when it comes to analysing Stylometry
○ Shave is considered male
○ Color considered female
○ Guns is considered male
○ Nails is considered female
○ It depends on statistics which come from the data

Applications of Stylometry:
● Forensics
● Author profiling → makes people wary of sharing information in public
○ Used in marketing
○ Ex: Cambridge Analytica case
● Political targeting
● Uncovering anonymous writers → through phrases that showed im multiple books
○ Ex: Robert Galbraith The Cuckoo Calling and JK Rowling Harry Potter → both
written by JK Rowling

Typical Prediction Pipeline:
● Author characteristics are identified and labelled
● Author profiling


3

, Julia Dobladez
Lecture Notes (2023/24)
● Plagiarism detection
● Genre or Topic classification
● Involves several steps:
○ Data Collection & Preprocessing: data gathering, text cleaning
○ Feature extraction: tokenization, feature selection, vectorization
○ Model building: Supervised learning, unsupervised learning
○ Model evaluation: cross-validation, metrics
○ Prediction & Interpretation

Assignment’s Data:
● Reddit Posts
● Distantly annotated
● Used flairs
○ Previous work: text reports
■ Extract labels from “I’m x”, “My faith…”, categorises authors
● Work under review → get sample
● Data sharing agreement
● Corpus contents have noise in it, and potential biases as the data comes from the same
subreddit

Psycholinguistics - debugging the task (gets the appropriate accuracy):
● Investigate bias in data and models
○ Look at data → there could be flaws
● Interpret and explaining models
○ Transparency and accountability
● Create challenging test sets
○ Apply on self characteristics, with the attributes we have????

Lecture 2: Collecting Data

Focused on text processing; a way to convert language to something more uniform.

Use the re library in Python for the processing.

New definition for FN and FP in text processing.

Pre-made data is unrealistic
● APIs and platforms (Kaggle, HuggingFace Data) are convenient
● Working with .csv (tabular) and otherwise structured data is a breeze
● Even structured data can be a mess

Real text data:
● Is obtained by scraping
● In essence is very simple


4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper juliadobladez1. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,46. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 85073 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€6,46
  • (0)
  Kopen