All video lectures (excluding the really practical examples done in SPSS and R). It includes all the slides and extensive explanations made by the lecturer in a clear way. It is 63 pages long.
AMDA FALL SUMMARY
2021 – 2022
Lysanne Groenewegen
Leiden University
,TOPIC 1: PCA & CFA
Lecture 1: PCA
PCA is very widely applied: We use it in social sciences for data reduction, questionnaires, design, etc.
Which questions predict the overall variable better (PCA):
- If you have 4 single variables that measure SES in some way, can’t you just use one weighted
average?
o No, you have to look at how much the individual single variables contribute to the overlaying
construct.
- How many sub concepts of intelligence can we distinguish?
o Here we use PCA to see whether there are groups in the items where we test intelligence
with.
- What chemicals possess similar properties under heat / pressure / /…?
- Quantify ethnic spread in (sub)populations
In a graph:
• The arrows are the variables
• The dots are observations
• The horizontal and vertical axis are principal components
For example:
With PCA (bv genetic makeup) you can correct for certain variables. based on genetic info, you can
pinpoint their localization on a map. It is very important to realize that there can only be 1 component, which
is the component that explains the biggest percentage of the overarching construct. We will dive into this
deeper later.
Scale construction:
Sometimes, your PCA natural groups do not match your theoretical groups (rephrase questions)
• You have an idea on how to design a questionnaire for some concept
• You design questions that address several sub concepts
• But are the supposed items actually adequate in their subdomain?
• Which items do you choose for subscales of your instrument, and how reliable are these subscales
E.G. Intelligence:
- 1 concept? (general intelligence)
- 2 concepts? (verbal and performance)
- 3 concepts? (Verbal, performance, freedom from distractibility)
(This lecture: examples of scale construction, but also holds for dimension reduction)
PCA
PCA is a method where we can get an idea about any underlying, not pre-assumed, structure. This means
that we use PCA on datasets for which we have no theoretical background (or assumed structure yet).
PCA is therefore a bottom-up exploratory method, which visualizes our dataset nicely. With CFA, on the
other hand, we do have a pre-assumed structure, and we do have a theoretical background. CFA is
therefore called a top-down confirmatory method. We will get into this in lecture 2.
,When you do PCA in SPSS, it computes a correlation matrix and starts to decompose the correlation matrix
(the backbone of PCA). By doing so, you have lost the information on your individuals because your unit of
analyses has become the correlation between variables, instead of an observation of a subject on a
variable. So the unit has changed to bivariate associations, and we want to find groups of high bivariate
associations.
- We explore data for a structure of PCA
- External (theoretical) knowledge is used afterwards for interpretation
- Analyses performed in SPSS of R
- From data/model theory
Explore with PCA, test with CFA
Test, in CFA, the structure that was suggested by PCA. What PCA will do is compute and analyze the
correlation structure (correlations between variables). Subgroups can be found when searching for
variables that correlate high amongst each other, but not with other subgroup of variables (= you find
groups of highly correlating variables).
Questions:
- How strong should the variables correlate with each other?
- How many subgroups do exist in my dataset?
- If I find groups, which variables belong to which group (component)?
Suppose this stage is successful and you find a structure that is reasonable. Then you can actually test (in
a slightly stricter way) with CFA whether this model is actually useful for predictions or real explanations
(instead of just a suggestion). PCA gives you the best suggestion it can give, but the best suggestion might
still be complete rubbish (it just cannot do any better). Sometimes (not often though) this can happen, but
then you usually have a not so sound dataset.
8 predictor variables
How do you combine these into a single score?
This is an example of already knowing the subscale (A & B), but this is an
unweighted, very straight forward scale construction. It’s the linear unweighted
sum (linear because it is a sum, unweighted because all of the variables have
an exactly equal contribution to the sum score = 1). This means that all the A
variables are in the scale, all the B’s are on another scale.
- Variable either in out of a scale
- Weights c in either 1 (in) or0 (out)
- Variables determine scale interpretation
- Equal or no contribution to construct
But: is it really true that all the A items contribute equally to the scale? How
do we find out?
- More subtle inclusion (or exclusion)
- Weighted contribution to component
- c anything between 0 and 1
- some variables more, others less impotent
- Values c < .30: exclusion
- Values c > .30: inclusion?
- Values c > .50: inclusion
- Values c > .80: for clinical instruments
How do I get these numbers?
The c weights are component loadings. Finding these weights only work under certain conditions.
- Component is linear combination of items
, - PCA searches for these linear combinations such that Cronbach’s A (reliability coefficient for
each of the subgroups, for each of these components) is as high as possible (largest possible
variance of all combinations). Because, if you have a high reliability coefficient, it means that you
have a lot of joint/explained variance in this group of variables. There is only 1 principle
component – 1 component that is the most important one
- Second, find the next highest combinations, such that they are uncorrelated with all previous
components
Visual explanation
Left box:
- Total amount of information that you have: Imagine that we have 10 variables. If we are
analyzing 10 variables using correlations, we are working on a standardized normal scale (because
correlations are standardized). So based on the variables, the mean is 0 and the SD is 1. Therefore,
the variance is also 1 (because a correlation is the standardized version of the covariance).
Knowing that we are analyzing the correlation coefficient, and knowing that we have 10 variables,
we know that we have 10 standardized variables with each contributing 1 point of variance. The
total surface of this box s 10 points.
- We want to explain 10 points of variance. PCA it finds the weighted combination of variables that
takes the largest possible bite out of the surface of 10 info points. It does not explain everything: we
can do the same trick again: we want to find the component that explains the largest amount out of
the remaining black part. Which is
always going to be lower that
component 1.
The whole box is explained when we
have found 10 components. If I use as
many components as variables will have 100% explained variance, but that is not data reduction, so we do
not want that.
This whole story only works if we center the variables that you put into your analysis. Picture this:
- 4 variables;
- 1 var is scored on a scale of 1 to 5;
- 2nd var is scored on scale of 10 – 50;
- 3rd var has range of 1 to 200.000;
Which variables of these has the largest variance (obviously variable nr. 3.). Thus, what will happen in a
component analysis is the largest variance will determine the component. If you don’t center (so
standardize), you don’t use correlations but covariances. In this case, your first component will be
determined by the variable that simply has the largest variance (which is nr 3, not because it is more
important, but simply because the scale is so different from the other ones).
Rules of thumb for final scale
We consider a scale ‘reasonable’ when its reliability a equals
- > .70 for new scales
- > .80 for standardized instruments
- > .90 preferably, before we use new scales in practice (clinical practice)
In practice: think, test, rethink, retest, repeat..
Choosing a number of components
Final decision is based on
- Explained variance: e.g. if you have 2 components (from 15 variables) that explain 60% of
everything, that means that out of 15 variance (so above 8 points in the first 2 components) = how
strong is the strongest part of my model
- Eigenvalue > 1 (use with care) – amount of variance points covered – tricky: eigenvalue larger than
1 means that we have found a component (an aggregated score) that explains more as an
aggregate of several variables compared to what a single variable would do. But. if you have 200.00
variables, it is VERY easy to find components that explain more than 1 point of variable due to
sheer random chance of correlation (so only works with small samples).
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller lys96. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $11.53. You're not tied to anything after your purchase.