Text Retrieval & Mining
Week 1: Bag of Words
Bag-of-Words is a family of text representations, where text vectors are built by
observing and counting the words that appear in a text.
We study 2 types of BoW vectors:
Raw Count: actually count the number of occurrences of each word in a text
TF-IDF: adjust the raw count to favour words that appear a lot in a few
documents, as opposed to those who appear a lot in all documents
Definitions
Document and Corpus:
• Document is the smallest unit of text of your use case
• Corpus is your collection of documents
• Use case: think of the typical question you are looking the answer to
• Query: the text you will use to search in your corpus
Vocabulary (or Dictionary): all unique terms appearing in a corpus, with size V := the
number of unique word
Token: a unit of text e.g. word, punctuation
Corpus Frequency: the number of times a word appears of all texts
Term Frequency (in a document): the number of times a word appears in ONE
document
Document Frequency: the number of documents (texts) a word appears in
Term: a single word, a lemma or stem of a word, an N – gram
Tokenizer: a program that takes in a text and splits it into smaller units. Once a text is
tokenized into sentences, you can tokenize sentences into words.
Examples of Python Tokenizers:
• NLTK: sentence & word tokenizer
• SpaCy: sentence & word tokenizer
,Bag of Words
For each document (text):
• Create a vector of dimension V
• Token count in document given per token, total number of tokens = V (the
number of unique words/tokens in the corpus)
• Only show tokens with count > 0
Example:
Sentence 1: “the cat sat on the hat”
Sentence 2: “the dog ate the cat and the hat”
Vocabulary: [and, ate, cat, dog, hat, on, sat, the] (8 unique words)
BoW 1: [0, 0, 1, 0, 1, 1, 1, 2]
BoW 2: [1, 1, 1, 1, 1, 0, 0, 3]
TF-IDF
TF – Term Frequency
IDF – inverse of document frequency (DF)
𝑇𝐹 − 𝐼𝐷𝐹(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝑐𝑜𝑟𝑝𝑢𝑠) = 𝑇𝐹 ⋅ 𝐼𝐷𝐹
Looking at the specificity of the word
High value: a word that appears in the document but not a lot in overall corpus
Low value: a word that appears in the document, but also in many others in the corpus
Text Processing
• Stopping: removing stopwords
o Stopwords: a set of commonly used words in a language but carry very
little useful information e.g. personal pronouns, definite & indefinite
articles
o Removed based on a pre-established lists
• Filter by Token Pattern
o Accept only words that correspond to a regular expression pattern.
• Filter by Frequency
o Retain only the top N tokens, based on the number of times they appear
in the complete corpus.
o Use the max_features argument of the vectorizer.
• Filter by Document Frequency
o Two corner cases to consider:
2
, ▪a word appears in nearly all documents: does not participate
actively to make a difference between documents
▪ a word appears only in 1 or 2 documents: It is likely a typo, or a
one-off e.g. Review by John, Jane’s opinion
o Use the min_df and max_df arguments:
▪ min_df = 3 words that appear in more than 3 documents will be
in the vocabulary
▪ min_df = 0.1 words that appear in more than 10% of the
documents will be in the vocabulary
Example
from nltk.corpus import stopwords
stops = stopwords.words('english')
count = CountVectorizer(
stop_words=stops,
token_pattern=r'[a-z]+\w*',
max_features=50000,
min_df=5,
max_df=0.8
)
• Stemming: removing plurals, conjugation
o ‘cats’ ‘cat’, ‘making’ ‘mak’
o Stem Word
o Stemmers: Porter, Snowball
• Lemmatizing: like stemming, but to a word
o ‘cats’ ‘cat’, ‘making’ ‘make’
o Lemma = Word
o Slower than stemming
o Lemmatizers: WordNet, SpaCy
• N – Grams: groups of N consecutive words in the text
o Importance: Bag of Words vectors ignore word order in a sentence. But it
makes sense that some information is communicated through some
words being side-by-side rather than these words being in the sentence. It
carries more information to know that new york is in a sentence, opposed
to knowing that both new and york are in the sentence, without knowing
that they are side by side. Similarity = Cosine similarity of BoW. It justifies
having a dimension of the BoW vectors that encodes the facts that some
words are side-by-side.
o 2 – grams: ‘New York’, ‘Greta Thunberg’
o 3 – grams: ‘New York City’, ‘Limited Liability Corporation’
3