Lecture notes Introduction to Data Science Unsupervised Learning (Lecture 6-9) (UvT)
26 views 0 purchase
Course
Introduction To Data Science (500189B6)
Institution
Tilburg University (UVT)
Study: Bachelor Psychology Tilburg University
Major/minor: Psychological Methods and Data Science/Applied Advanced Research Methods
Course: Introduction to Data Science 2020/2021 (jaar 3)
Professor: Kim de Roover
Based on: Lectures and slides
This course is divided in two parts:
1. Supervis...
Lecture notes Unsupervised learning
Course: Introduction to Data Science 2021/2022
Study: Bachelor Psychology Tilburg University
Lecture 6
Unsupervised learning
In supervised learning methods we observe a set of variables (X1, X2, …, Xp) for
each object, and a response/outcome variable Y. The goal is to predict Y using these
variables/characteristics. In unsupervised learning we only have
variables/characteristics and we do not have an associated response/outcome
variable Y. The goal is to learn something about X-variables themselves: how do they
correlate for instance? Or is there an informative way to visualize the data? (what
kind of plot?). Can we discover subgroups among the variables/observations? We’re
going to discuss two branches of methods:
- Principal components analysis (PCA): finds subgroups of variables and uses these
subgroups to produce a dataset of many variables to a dataset with only a few
components.
- Clustering: finds subgroups of observations (the people in your sample for
instance).
Unsupervised learning is more subjective as there is no simple goal for the analysis,
such as prediction/response. But techniques for unsupervised learning are growing
importance in a number of fields:
- subgroups of breast cancer patients grouped by their gene expression
measurements (clustering)
- groups of shoppers characterized by their browsing and purchase histories
(clustering)
- summarize genetically informative or neuroimaging data (PCA)
It is often easier to obtain unlabelled data (unsupervised) — from a lab instrument or
a computer — than labelled data (supervised), which can require human intervention.
For example, it is difficult to automatically assess the overall sentiment of a movie
review: is it favourable or not? It also interesting to combine both learning methods:
- First summarize genetically informative data, then use regression analyses for
association analysis
- First group clients into subgroups based on their shopping behaviour, then, for
every subgroup, predict what variables determine whether they buy a product or not
- First summarize questionnaire data into a few components, then use the
components in a regression analysis
Principal component analysis
The aim is to reduce data dimensionality without throwing out the essential features
of the data. So the variables in your data are the dimensions in your data. So if you
have scores on 100 questionnaire items the dimensionality of your data is 100. When
you have high dimensionality, it’s hard to work with the data. So PCA reduces the
variables to a number of components that captures the same information. We want to
maintain to most valuable information. For example: DNA data contains millions of
genes that cluster in different regions of the brain. Or questionnaire data consisting of
,many variables, like the NEOPI-R that contains 240 items and the shorter version
contains only 60 items.
For example:
The research question is: Is a specific genetic variant associated with a certain
disease?
- Collect data of individuals with and without the disease
- Genotype these individuals
- Look for genetic variants or mutations (single nucleotide polymorphisms, SNPs)
(SNP = DNA sequence with a mutation in a part of the sequence)
- Then see if there is an association between any SNP and disease status
- However...There are millions(!) of SNPs!!
- Use PCA to summarize the data!
PCA produces a low-dimensional representation of a dataset. It finds a sequence of
linear combinations of the variables that explain as much variance as possible, and
are mutually uncorrelated. Apart from producing derived variables for use in
supervised learning problems, PCA also serves as a tool for data visualisation.
The first principal component of a set variables X1, X2, …, Xp is the normalized
linear combination of the variables:
- C1 = w11X1 + w12X2 + …w1pXp
This is comparable to a weighted sum score. w1, w2, …, are the weights, we call
them component loadings. PCA finds the weighted sum score with the weights such
that we can explain as much variance as possible. PCA aims for the components to
have a mean of zero and a variance of 1 (normalized). Component 2 is then the
combination that explains, above C1, the most variance:
- C2 = w21X1 + w22X2 + …w2pXp
Assumption is that ρ(C1, C2) = 0 (uncorrelated). In principle we can have as many
components as variables. Later more on how we choose the ideal number of
components. So C2 explains of the variance that is left (that’s not explained by C1)
the most variance. And C3 also explains the most of the variance that is left next
(that’s not explained by C1 and C2). The final components will explain very little
variance. So we don’t need as many components as variables (that means we
achieved to explain 100% of the variance). We want a lot less variables than
components, so we’re going to look for the ideal number. There are two important
concepts in PCA:
- Component loadings: these are the weights of the best possible sum score of your
dataset, as mentioned earlier. Then we do a normalization to obtain components with
a mean of zero and a variance of 1. After this, the loadings are a rescaled version of
the weights. Thanks to that normalization, the components will have a variance of 1
and a component loading can be seen as a correlation between a particular variable
and the component. Component loadings also indicate which variables correlate
highly with one another. Variables with a strong loading on the same component
have a higher correlation than variables with a strong loading on different
components. This also gives us a way to interpret the component: by looking at
which variables load strongly on a component.
,Example:
- Component 1 = genes that are responsible for cognitive functions= high loadings for
these genes, low loadings for genes that are not relevant for cognitive functions (we
could label this component as cognitive functions)
- Component 2 = genes that are responsible for non-cognitive functions = high
loadings for these genes, low loadings for genes that are relevant for cognitive
function (we could label this component as non-cognitive functions)
- Component scores: This is the other important PCA concept. Based on the scores
on the variables (X1,X2,...Xp) and the weights, we can calculate a score on every
component for every unit of the analysis (e.g., a person), indicated by i. So, in case of
two components:
C1i = w11X1i + w12X2i + ...w1pXpi (4)
C2i = w21X1i + w22X2i + ...w2pXpi (5)
This is called a component score. Component scores can be seen as the score of a
unit on that particular component. It is a weighted sum score that indicates the score
of a unit on that component. The component scores have a mean of zero and a
variance of 1 (normalized): thus, negative component score = below average (of the
particular sample), positive component score = above average
Example:
Genetic data: Component scores are related to presence/absence of certain
mutations (e.g., mutations important for cognitive functions, mutations important for
non-cognitive functions…)
Eigenvalue rule
We can have as many components as variables, but not every component gives
information (signal-noise compromise). Some people fill out questionnaires in an
random way. We want as much variance as possible with as very little components
as possible and as very little noise as possible. Eigenvalue rule: eigenvalue means
how much variation in the data set is explained by the component. So each
component has an eigenvalue. We use a scree plot (more later) to visualize this. We
assume that the variables have been standardized. To understand the importance of
each component, we are interested in knowing how much variance it explains
(eigenvalue). The total variance present in a data set is the sum of the variances of
all variables (the number of variables). The proportion of variance explained over all
components sums to one 100%. The eigenvalue rule: eigenvalue should be > 1. If
the eigenvalue is smaller than 1, the variable explains more variance than the
component. The amount of variation of each variable is 1, so if the component is
larger than 1, the component contains more information. Than it make sense we
retain that component for our summary.
Example
USAarrests data: For each of the fifty states in the United States, the data set
contains the number of arrests per 100, 000 residents for each of three crimes:
Assault, Murder, and Rape. We also record UrbanPop (the percentage of the
population in each state living in urban areas). PCA was performed after
standardizing each variable to have mean zero and standard deviation one.
, Number of arrests per 100, 000 residents for each of three crimes: Assault, Murder,
and Rape + UrbanPop (the percentage of the population in each state living in urban
areas).
These PC’s (principal components) show the component loadings per variable
Murder, Assault, UrbanPop and Rape. If you look at PC1 and Murder the correlation
between the first principal component and the variable Murder is positive (about .54).
There is about the same correlation between PC1 and the other variables. The only
exception is UbanPop that correlates less (about .28). But if look at PC2 and
UrbanPop you see it has a much higher loading, ‘cause now the correlation is
about .87. The other variables have a lot lower correlations on PC2 or even negative
correlations. Higher scores on Murder, for instance, go with lower scores on PC2.
You can also say that Murder, Assault and Rape correlate more with each other than
with UrbanPop. Because UrbanPop has on both PC’s a very different correlation.
Because PC2 has a very high correlation on UrbanPop we could rename that
component to Urban Population. PC1 has high correlations for the crimes, so that
component could be named crime rates. We can visualize this in a PCA-plot, also
called bi-plot.
- The blue state names represent the scores for the first two principal components.
- The orange arrows indicate the first two principal component loading vectors (with
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller melissauvt. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.26. You're not tied to anything after your purchase.