Lecture notes topic 1 AMDA fall 2022 – PCA, CATPCA and CFA
Lecture 1 – Principal Component Analysis
Several distinctions in statistics:
- Descriptive vs Testing
- Exploration vs Confirmation
- Dependence vs Interdependence
With PCA, we are in the exploratory, interdependence situation.
PCA is widely applied:
- if you have 4 single variables that measure SES in some way (income, educational level e.g.),
can’t you just use one (weighted) average? → how are you going to do this? Is educational
level more important than income for example? And if so, how much more important? That
is something PCA can tell us. PCA gives you weights (c weights / components loadings) of the
‘ingredients’ / variables. Why is this useful? → Here we summarize 4 variables into one
component, and this component becomes one new variable, which you can use in turn for
your regression analysis for example. So you are combining PCA with regression here. This is
called dimension reduction: going from 4 to 1 dimensions.
- How many subconcepts of intelligence can we distinguish? → You can also do this reduction
the other way around: in my 24 items, are there subdomains?
- What chemicals possess similar properties under heat / pressure / …?
- Quantify ethnic spread in (sub)populations
PCA in chemistry → which chemicals perform in a similar way for example?
The picture on the right is called a biplot. It is the output of a component
analysis. Which variables are correlated? Those variables are closely together,
and (can) form one component. It is a visual representation of the dimensions
reduction → from variables to 2 dimensions
PCA for ethnicity → can we, based on
genetics, find subgroups of individuals with
similar genetic make-up? You have 10000 or
more variables for genetics. If you measure them all, you need to summarize them: going down from
a lot of genetic variables to something that you can use in a regression model.
The picture on the right: people close to each other are closer in genetic make-
up. You can see that people in the same area, are also more closely related
genetically. So the components make geographically sense: people from the
north of the Netherlands are genetically different from the people from the
south of the Netherlands. This is a very practical example, and computation of
this takes a lot of time. But this is an example of what you can do with PCA.
A dataset example (1): Dimension reduction
- Suppose we have a large collection (say 100) of variables (dimensions)
- Of which several (25) measure the same concepts
- Then working with 100 is too much (why?), but how do you summarize these? Or:
- How do your reduce the dimensionality of the items? → here you are not finding the number
of components, you just want to summarize the variables in 1 score. This way, PCA tells you
how important the different variables are.
- This is exploration!
,A dataset example (2): Scale construction
- You have an idea on how to design a questionnaire for some concept
- You design questions that address several subconcepts
- But are the supposed items actually adequate in their subdomains?
- Which items form sub-scales of your instrument,
- And how reliable are these sub-scales?
E.g. Intelligence:
- 1 concept? (General Intelligence)
- 2 concepts? (Verbal and Performance?
- 3 concepts? (Verbal, Performance, Freedom from distractibility)
Extra note!: It is principal component analysis, not principal components! → Because, there is only 1
principle component (but there can be more components). We use PCA when we have no idea about
what is going on in our data: there is no p-testing, no confidence intervals etc. We only explore, and
use it for visualization. PCA can give us simple but informative plots about the data!
Components versus Factors
Two main approaches:
Principal Component Analysis →
- We use this technique to get an idea about any underlying structure → no testing!
- No pre-assumed structure
- Exploratory method
- Visualization
Factor Analysis →
- To confirm or reject a suspected factor structure → testing!
- Structure derived from previous research or theory
- Confirmatory method → hypothesis testing
Similarities:
- Both methods deal with groups of variables.
Factor analysis: you assume knowledge of items in scales. It is based on previous research or from
theory. With factor analysis, you make a precise model for relationships between items and scales.
Based on fit, do results on new data match those from other work? Is the model true for you
(current) sample? This is not performed in SPSS, but specialized programs for Structural Equation
Modelling (SEM) exist (EQS, LISREL and others, like LAVAAN in R). You go from theory/model to data
here. So, factor analysis also contains groups of variables, but in this case there is an initial idea of
what this groups of variables might be: predefined structure. We will take about this next week.
PCA is the other way around: it is theory generating, and you evaluate relationships in the data
(Pearson, 1901). But, without an a priori structure! We explore data for a structure of principal
components (PCA) → e.g. 20 variables, we have a correlation matrix of 20 x 20. The technique
searches for a structure of components: it finds groups of variables that show high correlations with
each other, and low correlations with others. External (theoretical) knowledge is used afterwards for
interpretation. Analyses are performed in SPSS (Factor) or R. So you go from data to model/theory!
For example: we have a set a variables, but this set has no distinction in terms of a predictor vs
outcome set. We are interested in how these variables inter-relate: correlations.
, You could first do PCA and play around, and next, based on the outcomes of your PCA, do a factor
analysis. In this case, you are not exploring but you are testing your first results! So you can use the
output of PCA as input for factor analysis.
The principle of Principle Components
Basic scale construction → You want to create a sum-score. However, you cannot just
add all the scores, you need the weights of the scores. Because it is possible that this
set of 8 items is not one homogeneous set.
Scale construction using crisp weights in, example where you have 8 items (4 items
called A, and 4 items called B):
CA = c1A1 + c2A2 + c3A3 + c4A4 + c5B1 + c6B2 + c7B3 + c8B4
- Variable either in or out of a scale
- Weights c either 1 (in) or 0 (out)
- Variables determine scale interpretation
- Equal or no contribution to construct
What PCA does: look at data structure, decompose the whole thing, and suggest numbers to put on
the c’s (component loadings), depending on how much they weight. PCA finds out how many
subscales there might be, and which items should best be included in each subscale. So you can add
all the items to form one subscale, but you have to take into account how much they weight to the
total.
In this example: the 8 items (A1 to B4) are already in two groups: scale A and scale B. You could do it
the following: CA = 1A1 + 1A2 + 1A3 + 1A4 + 0B1 + 0B2 + 0B3 + 0B4. This would be the ‘easy straight
forward’ edition. But, we want to go away from ‘in’ or ‘out’ (0 and 1), we want to go here:
Advanced scale construction → Scale construction using actual weights in: CA = c1A1 +
c2A2 + c3A3 + c4A4 + c5B1 + c6B2 + c7B3 + c8B4
- More subtle inclusion (or exclusion)
- Weighed contribution to component
- c is anything between 0 and 1
- Some variables are more important, others are less important
Values c < 0.30: exclusion
Values c > 0.30: inclusion?
Values c > 0.50: inclusion (if highest)
Values c > 0.80: for clinical instruments
CA = 0.8A1 + 0.7A2 + 0.9A3 + 0.9A4 + 0.0B1 + 0.2B2 + 0.3B3 + 0.1B4 → what this means, is that item b1
to b4, belongs a little bit to first group, but not a lot. So probably they are only slightly correlated to
the 8 items. If the weight of b1 to b4 would be exactly 0, this would mean that they are completely
uncorrelated (the 4 items would in this case have nothing to do with the 8 items). But this is hardly
ever true! Again: the weights are suggested by PCA, and are mathematically based. In this example
you can see that items A3 and A4 are more important (c = 0.9) to the subscale than item A2 (c = 0.7).
So again: PCA tells you how many subgroups there might be, and which items go into which groups
best, and which contribution coefficient the items have. This is all exploration.
Some details on PCA and components:
- Component is a weighted linear combination of items: A = c1A1 + c2A2 + c3A3 + c4A4 + .. + ..