▶ Natural language processing is about finding patterns in text and explaining them.
Natural Language Processing History
● 1950-1990: Symbolic NLP.
○ Using a collection of rules, a computer can emulate natural language
understanding by applying those rules to confronted data.
● 1990-2010: Statistical NLP.
○ Apply machine learning techniques to natural language processing.
● 2010-present: Neutral NLP.
○ Extension of statistical methods with representation learning and deep neural
networks.
Structured
Labeled data in a (relational) database.
Unstructured
Free text.
Semi-Structured
A mixture of structured and unstructured data (i.e. a database + free-text notes).
Natural Language Processing Challenge
● Ambiguity (open for interpretation).
● Variation: direct variation, spelling variation, synonyms & syntactic variation.
● World knowledge.
● Context:
○ Domain: document context, genre, purpose, and characteristics.
○ Knowledge: general and domain knowledge resources.
○ Text: use of linguistic information.
Natural Language Processing Tasks
● Text classification: spam filtering, topic modeling, sentiment analysis.
● Information retrieval: recommender systems, search engine, question answering,
summarization.
● Information extraction: template-filling, named entity recognition (NER), relationship
extraction, ontology extraction.
1
,Text Analysis Techniques
2. Text Analysis
Machine Learning
Use and develop computer systems that can learn and adapt without following explicit
instructions by using algorithms and statistical models to analyze and draw inferences from
patterns in data.
,Kappa Statistic
Used to measure inter-rater reliability for qualitative (categorical) items.
𝑎−𝑝
𝑘= 1−𝑝
● 𝑎: accuracy
● 𝑝: the probability of predicting the correct class due to chance.
▶ If 𝑘 = 1 → perfect model.
▶ If 𝑘 ≈ 0 → no better than random guessing.
Kappa Curves
Used to select the optimal prediction threshold.
▶ AUK: area under the kappa curve.
Experiment Design
Cross Validation
● Split data into groups of the same size.
● Hold aside one group for testing and use the remainder for training.
● Repeat for all groups.
CRISP-DM Framework
Natural Language Processing Terminology
● Text: series of symbols and characters.
● Token: a sequence of symbols (characters) that form a useful semantic unit of
processing.
● Document: a collection of tokens.
● Corpus: a collection of documents.
Corpus Statistics
● Document count.
● Word count.
● Word frequency.
● Lexical variation in the text (unique words / total words).
● Average sentence length.
● Average document length.
▶ For good understanding read some documents → look for patterns.
Preprocessing Text
Document Filtering
Select relevant documents (i.e. retrieve tweets with a certain hashtag).
Optical Character Recognition (OCR)
Converts scanned text images into text → may introduce a lot of errors.
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller tomdewildt. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.88. You're not tied to anything after your purchase.