What is n-gram? - answer An n-gram model is a type of probabilistic language model for
predicting the next item in such a sequence in the form of a (n − 1)-order Markov model
Two benefits of n-gram models (and algorithms that use them) are simplicity and
scalability - with larger n, a model can store more context with a well-understood space-
time tradeoff, enabling small experiments to scale up efficiently.
approximate matching - answer Approximate string matching (or fuzzy string searching)
is the technique of finding strings that match a pattern approximately (rather than
exactly).
- The problem of approximate string matching is typically divided into two sub-problems:
finding approximate substring matches inside a given string and finding dictionary
strings that match the pattern approximately.
What is NLP? - answer- NLP is automated way to understand or analyze the natural
languages and extract required information from such data by applying machine
learning Algorithms.
List some Components of NLP? - answerBelow are the few major components of NLP.
- Entity extraction: It involves segmenting a sentence to identify and extract entities,
such as a person (real or fictional), organization, geographies, events, etc.
- Syntactic analysis: It refers to the proper ordering of words.
- Pragmatic analysis: Pragmatic Analysis is part of the process of extracting information
from text.
List some areas of NLP? - answerNatural Language Processing can be used for
Semantic Analysis
Automatic summarization
Text classification
Question Answering
Some real-life example of NLP is IOS Siri, the Google assistant, Amazon echo.
Define the NLP Terminology? - answerNLP Terminology is based on the following
factors:
Weights and Vectors: TF-IDF, length(TF-IDF, doc), Word Vectors, Google Word Vectors
Text Structure: Part-Of-Speech Tagging, Head of sentence, Named entities
, Sentiment Analysis: Sentiment Dictionary, Sentiment Entities, Sentiment Features
Text Classification: Supervised Learning, Train Set, Dev(=Validation) Set, Test Set, Text
Features, LDA.
Machine Reading: Entity Extraction, Entity Linking,dbpedia, FRED (lib) / Pikes
What is the significance of TF-IDF? - answer- TFIDF stands for term frequency-inverse
document frequency.
- Tf-idf is one of the most popular term-weighting schemes.
- TFIDF reflects how important a word is to a document in a collection or in the
collection of a set.
- TFIDF is used in recommender systems, search engines, stop-words filtering, text
summarization and classification.
Why IDF:
- IDF is a measure if a word is common or rare across all documents. Inverse document
frequency
e.g. "the", 'brown', 'cow'. 'the' is so common, term frequency will tend to incorrectly
emphasize documents which happen to use the word "the" more frequently, without
giving enough weight to the more meaningful terms "brown" and "cow". The term "the"
is not a good keyword to distinguish relevant and non-relevant documents and terms,
unlike the less-common words "brown" and "cow". Hence an inverse document
frequency factor is incorporated which diminishes the weight of terms that occur very
frequently in the document set and increases the weight of terms that occur rarely.
What is part of speech (POS) tagging? - answerA Part-Of-Speech Tagger (POS
Tagger) is a piece of software that reads text in some language and assigns parts of
speech to each word (and other token), such as noun, verb, adjective, etc.
PoS taggers use an algorithm to label terms in text bodies.
- These taggers make more complex categories than those defined as basic PoS, with
tags such as "noun-plural" or even more complex labels. Part-of-speech categorization
is taught to school-age children in English grammar, where children perform basic PoS
tagging as part of their education.
What is Lemmatization in NLP? - answerStemming and Lemmatization are Text
Normalization (or sometimes called Word Normalization) techniques in the field of
Natural Language Processing that are used to prepare text, words, and documents for
further processing. -- Both stemming and lemmatization is to reduce forms to a common
base form. am, are, is --> be
car, cars, car's, cars' --> car Stemming or lemmatization?
- When should I use Stemming and when should I use Lemmatization?
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller julianah420. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $13.49. You're not tied to anything after your purchase.