types and tokens - Answer-• Types: The unique words or n-grams in the text (i.e. vocabulary)
• Tokens: The words or n-grams in the text
• Relevancy: When designing NLP systems, we need to consider the relationship between types and
tokens e.g. for Information Retrieval systems, the size of the vocabulary will determine how much it
takes to store the inverted index
Zipf's Law - Answer-• If you rank words into decreasing frequency order, then plot using log scales the
word rank (r) on the x axis against frequency (f) on the y axis, you end up with a straight line plot which
is called Zipf's Law
• i.e. f . r = k (for some constant k) (2 marks for correct formula)
• Relevancy: Lots of words will be very frequent (good for training purposes) but also lots of words will
occur only once or twice (not so good)
Heap's Law - Answer-• Growth of vocabulary (number of types) is proportional to the number of tokens
• Typically V = kNB where V is the number of types, N is the number of tokens and B is a constant (2
marks for correct formula)
• Relevancy: There will always be new words no matter how large the training text
Zero frequency problem - Answer-• high chance that many words will never occur in the training corpus
• mention of Zipf's Law and/or vocabulary growth (typically V = kNB )
,• Relevancy: There may not be examples in the training data, so we have to do something to cope with
OOV (Out of Vocabulary) words
Sparse data problem - Answer-• Many words occur very infrequently
• i.e. high chance that they will never occur in the training corpus
• mention of Zipf's Law and/or vocabulary growth (typically V = kNB )
• Relevancy: We need lots of training data to overcome this problem to ensure we have enough to train
our statistical models
N-grams - Answer-• A sequence of N words, characters or symbols
• Relevancy: Most statistical NLP systems use an N-gram approach
Discuss the adequacies and inadequacies of using N-grams for Natural Language Processing - Answer-
Common objections and strengths:
• Since we do not ourselves assign probabilities, why should machines (but not so clear this is the case)
• Models are crude word-counting affairs
• But one must distinguish between statistical models and statistical methods
• Problems with sparse data
• Inability to take account of burstiness of words
• Failure to capture unbounded dependencies
• But are mathematically well-grounded
• Provide empirical means for predicting the most likely interpretations
• Have a learning component
• Requires little or no knowledge of the semantics of the domain
• Rule-based approaches are too brittle to deal with a variety of constructions
In the following sentence, identify the nouns, verbs, prepositions, and the noun phrases:
, • Now is the time for all good men to come to the aid of their party. - Answer-Now/RB is/VBZ the/DT
time/NN for/IN all/DT good/JJ men/NNS to/TO come/VB to/TO the/DT aid/NN of/IN their/PRP$
party/NN
So, parts of speech are as follows:
Nouns: time, men, aid, party (2 marks)
Prepositions: for, of (1 mark)
Verbs: is, come (1 mark)
Noun phrases: the time, all good men, the aid, their party (2 marks)
Describe two different approaches to automatically annotating text with parts of speech. - Answer-
Approach 1: Rule-based tagging (1 mark)
- Uses hand-written set of rules to assign the tags to words (1 mark).
-Typically more than 1000 hand-written rules, but may be machine-learned.
Approach 2: Stochastic or statistical / n-gram-based tagging (1 mark)
- Based on probability of certain tag occurring given previous tags and words as context (0.5 marks)
- Requires a training corpus (0.5 marks)
The following text is being used as a training corpus:
• a man a man a plan a plan a canal a canal panama panama
Build a word-based bigram model from this training corpus. Show all the bigrams (the word and its
prediction), their frequency (token) counts and their type counts. - Answer-prediction count types
a → man 2 3
a → plan 2 3
a → canal 2 3
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller Zanaya. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $14.99. You're not tied to anything after your purchase.