This document contains my notes and a summary of the lectures given by Chris Emmery in the course Cognitive Science 2 in Quartile 2 of the year . The course has been renewed before the start of this year, so it is the first time this format is taught. The mentioned equations are mentioned in the do...
1. Introduction
Text Mining Preliminaries
The bare minimum approach is to convert text to vectors. We need to convert the language in numbers.
𝑑𝑑 = the cat sat on the mat
cat mat on sat the
Bag-of-words representation: � 1 1 1 1 1 �. If the word is on, 1, otherwise, 0. It is unordered.
We can document these sentences as instances:
𝑑𝑑0 = the cat sat on the mat
𝑑𝑑1 = my cat sat on my cat
cat mat my on sat the
�1 1 0 1 1 1 �
1 0 1 1 1 0
The representation can be more easily done via Documents × Terms matrix.
• 𝑑𝑑 = 𝑉𝑉 × 𝑋𝑋 with 𝑉𝑉: vocabulary and 𝑋𝑋: feature space
|𝐴𝐴∩𝐵𝐵|
For document similarity, we use the Jaccard coefficient: 𝐽𝐽(𝐴𝐴, 𝐵𝐵) = .
|𝐴𝐴∪𝐵𝐵|
|𝐴𝐴 ∩ 𝐵𝐵| the number of times both documents have a 1. |𝐴𝐴 ∪ 𝐵𝐵| all words except where those never occurring
in any document.
Binary vs. frequency
• Binary is a very compact representation
• Algorithms like Decision Trees have a straight-forward and compact structure
• Binary says very little about the weight of each word
• We can’t use more advanced algorithms that work with Vector Spaces
Notation Term Frequencies
• 𝐷𝐷 = {𝑑𝑑1 , 𝑑𝑑2 , … , 𝑑𝑑𝑁𝑁 } is the set of documents
• 𝑇𝑇 = {𝑡𝑡1 , 𝑡𝑡2 , … , 𝑡𝑡𝑀𝑀 } is a set of index terms for 𝐷𝐷
• Each document 𝑑𝑑𝑖𝑖 ∈ 𝐷𝐷 can be represented as a frequency vector:
o 𝑑𝑑⃗𝑖𝑖 = 〈tf(𝑡𝑡1 , 𝑑𝑑𝑖𝑖 ), … , tf(𝑡𝑡𝑀𝑀 , 𝑑𝑑𝑖𝑖 )〉 , tf(𝑡𝑡, 𝑑𝑑) is the frequency of term 𝑡𝑡𝑚𝑚 ∈ 𝑇𝑇 for document 𝑑𝑑𝑖𝑖 .
Term frequency does not capture importance very well. It should be in log-scale. We use ln(𝑡𝑡𝑡𝑡(𝑡𝑡, 𝑑𝑑) + 1).
There are still problems: longer documents have more words. And rare terms do not occur much, but are
most important. If two documents have the similar rare words in it, they are more similar.
This is solved via (inverse) document frequency.
𝑁𝑁
idf𝑡𝑡 = log𝑏𝑏
df𝑡𝑡
with 𝑁𝑁: number of documents and df𝑡𝑡 is the document frequency of 𝑡𝑡 occurring; in how many documents
does 𝑡𝑡 occur. The base 𝑏𝑏 is typically 10.
Normalizing vector Representations: Both terms can be put together via multiplication:
𝑁𝑁
𝑤𝑤𝑡𝑡,𝑑𝑑 = ln(tf(𝑡𝑡, 𝑑𝑑) + 1) ∗ lg � �
𝑑𝑑𝑑𝑑𝑡𝑡
, One way of calculating the “length” of a document is the Euclidean distance: documents with many words
are far way.
𝑛𝑛
𝑑𝑑(𝑝𝑝⃗, 𝑞𝑞⃗) = ��(𝑝𝑝⃗𝑖𝑖 − 𝑞𝑞⃗𝑖𝑖 )2
𝑖𝑖=1
Now, it will just compare documents that have similar lengths. It is possible to correct for this using the
𝒍𝒍𝟐𝟐 normalization: �|𝑝𝑝⃗|�2 = �∑𝑖𝑖 𝑝𝑝⃗𝑖𝑖2.
Cosine similarity: the dot product of two numbers under the assumption that they are 𝑙𝑙2 normalized.
𝑛𝑛
SIM = 𝑝𝑝⃗ ∙ 𝑞𝑞⃗ = � 𝑝𝑝⃗𝑖𝑖 𝑞𝑞⃗𝑖𝑖
𝑖𝑖=1
𝑝𝑝⃗∙𝑞𝑞�⃗
SIM = (normalize in the similarity)
�𝑝𝑝⃗⋅𝑝𝑝⃗ ∗�𝑞𝑞�⃗⋅𝑞𝑞�⃗
2. Collecting Data
Noisy Text
Text can be noisy. If we don’t filter this, our model reads things that are not in the format. Also typos have a
tremendous effect on the size of the vocabulary, and the representation of your documents (thus the similarity
quality).
Language variations: abbreviations, acronyms, capitalization, character flooding, concatenations,
emoticons, dialect, slang, typos.
Regular Expressions
Finding and reducing those noisy text errors is possible via regular expressions: a mini scripting language
of logic that is one of the few things that is standardized in almost any programming language. It is a way to
define strings and can be used to find patterns and also replace the matches found.
patt = re.compile(“you”)
patt.finditer(text)
You can also do disjunctions: either this or that; “[Yy]ou” to find you and You.
regex_find(“[Yy]ou”, text)
Negation is also possible: “[ ^a-x]”, find everything except a – z.
regex_find(“[^a-z]”, text)
In logic, there are Kleene expressions:
• matching at a singular position with nothing or any position after. This matches “o[a-z]*”,
anything that begins with o.
regex_find(“o[a-z]*”, text)
• matching a letter with one or more symbols. So, nothing is not returned in this case.
regex_find(“u[a-z]+”, text)
• wild card: anything on the position of the dot. “.e.” finds ‘her’ in ‘there’ … This is greedy.
regex_find(“.e”, text)
If we find something with this, we can use this for substitution. This expression finds all exclamation
points and replaces it with ‘ !’.
re.sub(‘+!’, ‘ !’, “Amazing!!!!!!”)
‘Amazing !’
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller datasciencestudent. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $4.27. You're not tied to anything after your purchase.