100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
JADS Master - Natural Language Processing Summary €5,49
In winkelwagen

Samenvatting

JADS Master - Natural Language Processing Summary

 4 keer verkocht

Summary for the Natural Language Processing course of the Master Data Science and Entrepreneurship.

Voorbeeld 4 van de 31  pagina's

  • 9 januari 2023
  • 31
  • 2021/2022
  • Samenvatting
Alle documenten voor dit vak (1)
avatar-seller
tomdewildt
1. Introduction
Natural Language Processing Approaches
● Rule-based (rationalism): hand-crafted rules, symbolic manipulation.
● Statistical (empiricism): data-driven (probabilistic or otherwise), shallow machine
learning.
● Massively parallel processing (deep learning): representation learning, human-like
performance.

▶ Natural language processing is about finding patterns in text and explaining them.

Natural Language Processing History
● 1950-1990: Symbolic NLP.
○ Using a collection of rules, a computer can emulate natural language
understanding by applying those rules to confronted data.
● 1990-2010: Statistical NLP.
○ Apply machine learning techniques to natural language processing.
● 2010-present: Neutral NLP.
○ Extension of statistical methods with representation learning and deep neural
networks.

Structured
Labeled data in a (relational) database.

Unstructured
Free text.

Semi-Structured
A mixture of structured and unstructured data (i.e. a database + free-text notes).

Natural Language Processing Challenge
● Ambiguity (open for interpretation).
● Variation: direct variation, spelling variation, synonyms & syntactic variation.
● World knowledge.
● Context:
○ Domain: document context, genre, purpose, and characteristics.
○ Knowledge: general and domain knowledge resources.
○ Text: use of linguistic information.

Natural Language Processing Tasks
● Text classification: spam filtering, topic modeling, sentiment analysis.
● Information retrieval: recommender systems, search engine, question answering,
summarization.
● Information extraction: template-filling, named entity recognition (NER), relationship
extraction, ontology extraction.




1

,Text Analysis Techniques




2. Text Analysis
Machine Learning
Use and develop computer systems that can learn and adapt without following explicit
instructions by using algorithms and statistical models to analyze and draw inferences from
patterns in data.

▶ Types of learning:
● Basic: supervised, unsupervised, reinforcement learning.
● Other: semi-supervised, transfer learning, active learning.
▶ Types of tasks:
● Classification. ● Co-occurrence grouping.
● Regression. ● Profiling.
● Similarity matching. ● Link prediction.
● Clustering. ● Data reduction.
● Anomaly detection. ● Causal modeling.
▶ Modeling methods:
● Linear regression. ● Mixture models.
● Logistic regression. ● Support vector machines.
● Decision trees. ● Neural networks.
● K-nearest neighbors. ● Fuzzy inference systems.
● Naive Bayes classification. ● Bayesian networks.

Measuring Classifier Performance
𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁)
𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑟𝑒𝑐𝑎𝑙𝑙
𝑓1 = 2 · 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙




2

,Kappa Statistic
Used to measure inter-rater reliability for qualitative (categorical) items.

𝑎−𝑝
𝑘= 1−𝑝
● 𝑎: accuracy
● 𝑝: the probability of predicting the correct class due to chance.

▶ If 𝑘 = 1 → perfect model.
▶ If 𝑘 ≈ 0 → no better than random guessing.

Kappa Curves
Used to select the optimal prediction threshold.

▶ AUK: area under the kappa curve.

Experiment Design




Cross Validation
● Split data into groups of the same size.
● Hold aside one group for testing and use the remainder for training.
● Repeat for all groups.

CRISP-DM Framework




Natural Language Processing Terminology
● Text: series of symbols and characters.
● Token: a sequence of symbols (characters) that form a useful semantic unit of
processing.
● Document: a collection of tokens.
● Corpus: a collection of documents.


3

, ▶ Fix ambiguity → domain application.
▶ Fix variation → text normalization.

Domain Application
● Text type or communication context (i.e. letters, tweets, chats, reports, news stories,
scientific articles).
● Application domain: area of application.
○ Topics & content.
○ Vocabulary use: terminology, jargon, general.
○ Writing style: formal, informal.
○ Languages.
● Corpus characteristics:
○ Text format: annotations, text, XML, HTML.
○ Text encoding: ASCII, UTF-8.
○ Text unit: documents, paragraphs, sentences, phrases.
○ Text unit length.
○ Vocabulary richness/variations.
○ Document structure (i.e. articles, wikipedia, etc.).
○ Corpus homogeneity (i.e. wikipedia, news, etc.).

Domain Considerations
● Data size.
● Private & sensitive data.
● Ethical issues.

Corpus Statistics
● Document count.
● Word count.
● Word frequency.
● Lexical variation in the text (unique words / total words).
● Average sentence length.
● Average document length.

▶ For good understanding read some documents → look for patterns.

Preprocessing Text




Document Filtering
Select relevant documents (i.e. retrieve tweets with a certain hashtag).

Optical Character Recognition (OCR)
Converts scanned text images into text → may introduce a lot of errors.


4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper tomdewildt. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65507 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen
€5,49  4x  verkocht
  • (0)
In winkelwagen
Toegevoegd