Samenvatting

JADS Master - Natural Language Processing Summary

4 keer verkocht

Instelling
Tilburg University (UVT)

Summary for the Natural Language Processing course of the Master Data Science and Entrepreneurship.

[Meer zien]

Voorbeeld 4 van de 31 pagina's

Bekijk voorbeeld

Geupload op 9 januari 2023
Aantal pagina's 31
Geschreven in 2021/2022
Type Samenvatting

jads
master
text analysis
text representation
topic modeling
natural language tasks
deep learning
vectorization models
auto encoders
recurrent neural networks
recur
natural language processing
summary

Volgen

tomdewildt Lid sinds 3 jaar 27 documenten verkocht

€5,49

In winkelwagen

Opslaan

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

1. Introduction
Natural Language Processing Approaches
● Rule-based (rationalism): hand-crafted rules, symbolic manipulation.
● Statistical (empiricism): data-driven (probabilistic or otherwise), shallow machine
learning.
● Massively parallel processing (deep learning): representation learning, human-like
performance.

▶ Natural language processing is about finding patterns in text and explaining them.

Natural Language Processing History
● 1950-1990: Symbolic NLP.
○ Using a collection of rules, a computer can emulate natural language
understanding by applying those rules to confronted data.
● 1990-2010: Statistical NLP.
○ Apply machine learning techniques to natural language processing.
● 2010-present: Neutral NLP.
○ Extension of statistical methods with representation learning and deep neural
networks.

Structured
Labeled data in a (relational) database.

Unstructured
Free text.

Semi-Structured
A mixture of structured and unstructured data (i.e. a database + free-text notes).

Natural Language Processing Challenge
● Ambiguity (open for interpretation).
● Variation: direct variation, spelling variation, synonyms & syntactic variation.
● World knowledge.
● Context:
○ Domain: document context, genre, purpose, and characteristics.
○ Knowledge: general and domain knowledge resources.
○ Text: use of linguistic information.

Natural Language Processing Tasks
● Text classification: spam filtering, topic modeling, sentiment analysis.
● Information retrieval: recommender systems, search engine, question answering,
summarization.
● Information extraction: template-filling, named entity recognition (NER), relationship
extraction, ontology extraction.

1

,Text Analysis Techniques

2. Text Analysis
Machine Learning
Use and develop computer systems that can learn and adapt without following explicit
instructions by using algorithms and statistical models to analyze and draw inferences from
patterns in data.

▶ Types of learning:
● Basic: supervised, unsupervised, reinforcement learning.
● Other: semi-supervised, transfer learning, active learning.
▶ Types of tasks:
● Classification. ● Co-occurrence grouping.
● Regression. ● Profiling.
● Similarity matching. ● Link prediction.
● Clustering. ● Data reduction.
● Anomaly detection. ● Causal modeling.
▶ Modeling methods:
● Linear regression. ● Mixture models.
● Logistic regression. ● Support vector machines.
● Decision trees. ● Neural networks.
● K-nearest neighbors. ● Fuzzy inference systems.
● Naive Bayes classification. ● Bayesian networks.

Measuring Classifier Performance
𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁)
𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑟𝑒𝑐𝑎𝑙𝑙
𝑓1 = 2 · 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

2

,Kappa Statistic
Used to measure inter-rater reliability for qualitative (categorical) items.

𝑎−𝑝
𝑘= 1−𝑝
● 𝑎: accuracy
● 𝑝: the probability of predicting the correct class due to chance.

▶ If 𝑘 = 1 → perfect model.
▶ If 𝑘 ≈ 0 → no better than random guessing.

Kappa Curves
Used to select the optimal prediction threshold.

▶ AUK: area under the kappa curve.

Experiment Design

Cross Validation
● Split data into groups of the same size.
● Hold aside one group for testing and use the remainder for training.
● Repeat for all groups.

CRISP-DM Framework

Natural Language Processing Terminology
● Text: series of symbols and characters.
● Token: a sequence of symbols (characters) that form a useful semantic unit of
processing.
● Document: a collection of tokens.
● Corpus: a collection of documents.

3

, ▶ Fix ambiguity → domain application.
▶ Fix variation → text normalization.

Domain Application
● Text type or communication context (i.e. letters, tweets, chats, reports, news stories,
scientific articles).
● Application domain: area of application.
○ Topics & content.
○ Vocabulary use: terminology, jargon, general.
○ Writing style: formal, informal.
○ Languages.
● Corpus characteristics:
○ Text format: annotations, text, XML, HTML.
○ Text encoding: ASCII, UTF-8.
○ Text unit: documents, paragraphs, sentences, phrases.
○ Text unit length.
○ Vocabulary richness/variations.
○ Document structure (i.e. articles, wikipedia, etc.).
○ Corpus homogeneity (i.e. wikipedia, news, etc.).

Domain Considerations
● Data size.
● Private & sensitive data.
● Ethical issues.

Corpus Statistics
● Document count.
● Word count.
● Word frequency.
● Lexical variation in the text (unique words / total words).
● Average sentence length.
● Average document length.

▶ For good understanding read some documents → look for patterns.

Preprocessing Text

Document Filtering
Select relevant documents (i.e. retrieve tweets with a certain hashtag).

Optical Character Recognition (OCR)
Converts scanned text images into text → may introduce a lot of errors.

4

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.