100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Natural Language Processing (CS4990). Top Exam Questions and answers, 100% Accurate, verified CA$14.96   Add to cart

Exam (elaborations)

Natural Language Processing (CS4990). Top Exam Questions and answers, 100% Accurate, verified

 18 views  0 purchase

Natural Language Processing (CS4990). Top Exam Questions and answers, 100% Accurate, verified Why Python? - Shallow learning curve - Good string handling - Combines OO, aspect-oriented and FP paradigms - Extensive standard libraries (e.g. NLTK) - Great support for Deep Learning Human lan...

[Show more]

Preview 4 out of 46  pages

  • June 13, 2023
  • 46
  • 2022/2023
  • Exam (elaborations)
  • Questions & answers
All documents for this subject (7)
avatar-seller
PassPoint02
Natural Language Processing (CS4990).
Top Exam Questions and answers, 100%
Accurate, verified

Why Python?

- Shallow learning curve
- Good string handling
- Combines OO, aspect-oriented and FP paradigms
- Extensive standard libraries (e.g. NLTK)
- Great support for Deep Learning

Human language

Ultimate interface for interaction and communication.
But something to understand, because it's:
- highly ambiguous at all level
- complex and a subtle use of context to convey meaning
- fuzzy and probabilistic

Understanding a language requires domain knowledge, discourse knowledge, world knowledge and
linguistic knowledge

Word level ambiguity

- Spelling (e.g. colour vs color)
- Pronunciation
• 1 word can have multiple pronunciations (e.g. abstract, desert)
• Multiple words can share the same pronunciation (e.g. flower/flour)
- Meaning (1 word can have multiple meanings, i.e. homonyms; e.g. date, crane, leaves)

Natural Language Processing (NLP)

A subfield of linguistics, CompSci, Information Engineering and AI concerned with the interactions
between computers and human (natural) languages, in particular how to program computers to process
and analyse large amounts of natural language data

NLP tasks & applications

- Writing assistance (spell/grammar/style checking, auto completion).
- Text classification (spam detection, sentiment analyses, fake news/propaganda detection, news topic
classification, customer reviews category classification).
- Information retrieval (search engine)
- NL Understanding (argumentation mining, question-answering, NL inference,

,humorous/ironic/metaphoric language analysis).
- NL generation (document summarisation, machine translation, sentence paraphrasing/simplification,
dialogue/exercise generation)




NLP limits & outlook

- Language problems are hard - for most of them, there's still no fully accurate solution (like Physics,
History and Psychology).

Data types (based on structures)

- Structured data
- Semi-structured data
- Unstructured data




Corpus (=body)

A large body of text.
It usually contains raw text and any metadata associated with the text (e.g. timestamp, source, index,
...).
It's also known as a dataset




Text cleaning & normalisation

Remove useless information (e.g. email headers) and extract useful information (e.g. words, word
sequences, verbs, nouns, adjectives, names, locations, orgs, ...).

1. Tokenization (sentence, words)
2. Stemming / Lemmatization
3. Stop-words removal

,Tokenization

Process of splitting sentences into their constituents, i.e. tokens (generally done by white-space or
punctuation character separation in English), which are meaningful segments.

Type

Element in the vocabulary. Also known as the form or spelling of the token (including words and
punctuation) independently of its specific occurrences in a text.

Token

Instance of a type in a text, which is a sequence of characters that is treated as a single group (i.e. words
and punctuation).

E.g. To be or not to be
- 2x to, be
- 1x or, not

Simple tokenization

Split with white-space (for English texts).

Pros: simple and natively supported by Python.

Cons: it fails to tokenize punctuation and hyphenated words (e.g. "state-of-the-art").




Natural Language Tool Kit (NLTK)

(FOSS) Python library to make programs that work with NL.
It can perform different operations such as tokenization, stemming, classification, parsing, tagging and
semantic reasoning.

Word tokenizer (from NLTK)

NLTK' standard tokenizer.

Pros: successfully tokenizes punctuations, split hashtags into separate words (e.g. #70thRepublic_Day
into "#" and "70thRepublic_Day")

, Cons: it fails to identify widely used symbol combinations (e.g. ":)" is split into 2 symbols)




Tweet tokenizer (from NLTK)

Pros: correctly handles hashtags and mentions (`@somone`)

Cons: it fails at abbreviations (e.g. U.K)




Sentence tokenization

For long documents, we may not be interested in words but instead in sentences therein:
- Check whether a sentence's sentiment is positive or negative.
- Check whether a sentence contains propaganda content.
- Check the grammatical correctness of a sentence
- ...




Stemming

Process of reducing the inflection in words to their root forms such as mapping a group of words to the
same stem even if the stem itself isn't a valid word in the language.

NLTK includes 2 widely used ones: Port Stemmer and Lancaster Stemmer (younger and more
aggressive); they both regard an input text as a single word.

Pros: quick to run (because it's based on simple rules) and suitable for processing a large amount of text

Cons: the resulting words may not carry any meaning (or be actual words)

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller PassPoint02. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for CA$14.96. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

83750 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
CA$14.96
  • (0)
  Add to cart