Exam Amda
Topic 1 - PCA & CFA
Learning Goals
1) Understand how PCA and CFA are (un)related in terms of both solution structure and
model specification.
PCA is widely applied —> Working with many variables that measure the same concept.
- You have an idea on how to design a questionnaire for some “concept”. —> you design
questions that address several subconcepts —> items suppose actually adequate in their
subdomain?
—> important to reduce the dimensionality and scale construction; for example measuring
intelligence, 1 concept? 2 concepts? or 3 concepts?
Two main approaches
1) Principal component analysis (only one component that is the only principal one).
- Use this technique to get an idea of any underlying structure —> Set of variables, on which you
could compute a correlation matrix. But this set has no distinction in terms of predictor and
output. Interested in how these variables interrelate/correlation
- Exploring (not testing!) (no sign. test)
- No pre-assumed structure, all 20 variables could be either predictors or outcomes.
2) Factor analysis
- Contains groups of variables, but already has an initial idea of what these groups of variables
might be.
- Predefined structures and these structures will be tested/evaluated in factor analysis. This
predefining might come from previous research or your own theory.
- Confirmatory method
- Makes a precise model for the relationship between items and scales —> Model true for your
current sample?
-
So in short;
Both methods deal with groups of variables. PCA tries to identify potential groups of variables
based on the correlation structure. In CFA, the idea of variable grouping and testing whether this
grouping structure works for our data.
Confirmatory Factor Analysis
- Test a specific factor structure —> predefined structure, grouping items/variables. No particular
predictor or outcome in this analysis.
- Trying to come up with fit measures to tell us how well our predefined structures work for the
data that we have.
- How do we test this?
,In linear regression —> explained variables (to check the fit) (residuals - predicts scores). The
closer the predicted scores are to our outcomes, the better the model works. (Significant regression
coefficients).
In CFA —> matrix situation, set of variables, construct covariance matrix (correlation matrix -
standardized matrix). —> observed matrix because we can compute the covariances between the
things that we have observed and we define a factor model and use this to predict a covariance
matrix.
So we compare the observed covariance matrix with the predicted covariance matrix. —> residuals
in a matrix shape.
> Interdependence technique —> so no predictor vs outcome (CFA)
Some form of a predictor structure going on.
> Dependence technique —> set of predictors and assume that some score in y depends on the
predictor variables (Linear regression)
Confirm our theoretical construct division.
Technical aim:
—> reproduce correlation / predict covariance matrix.
—> Error —> misfit between observed and predicted matrix. Errors are not correlated.
—> Correlations based on the observed numbers should be explained by common factors (sound
like regression).
—> Regression equation with manifest response variables with two latent predictors. Assume that
there is some underlying but not directly observable process going on (F1 and F2). —> but this
does lead to differences in scores in the variables that we do observe. (Variable X1-X6).
So there is single linear regression —> F1 —> X2, Factor 1 —> X2 etc.
We can predict scores for the items and predict a covariance matrix. Assume something going on we
cannot see and that something leads to the scores that we observe in the variables themselves.
Factors can be correlated or not. Can also be that items are explained by more than 1 factor.
(Crossloadings). —> if many it could be that you are ignoring the correlations between the factors.
Compared to component analysis;
For each component, there is an arrow to all of the items. Some items will have close to zero
numbers and some will be very high.
CFA is more strict than PCA because the item will have to be exactly a loading of zero.
Factors; theoretical constructs that we examine —> can be that our CFA is derived from PCA.
Components; empirically suggested combinations of variables —> may or may have not meant. In
CFA you already assume that this structure has meaning.
PCA —> if you have a set of variables you have no idea what will happen in terms of structure.
EFA —> instruments that have never been tested before. Some ideas and some items may be
correlated in some way, so there is a theory, but you are not testing this theory.
PCA —> One single strong conceptual idea of the factor structure.
,Example;
4 correlated factors and 15 items. Assume —> that each of these items corresponds to one factor
only. —> number of covariances; 0.5 x number of items x number of items + 1 = 120 covariances in
this case; 0.5 x 15 x 16. —> Units of observations (fit evaluation).
What elements are estimated in the model;
15 unique variances.
4-factor variances.
Correlations between the factors —> so 6 covariances between the factors; formula: 5 x number of
factors x number of factors - 1.
11 loadings (unique factor loadings) —> difference between 15 items and 4 factors. For each of
these factors, one of the arrows needs a fixed factor loading, all of the other factor loadings will be
relevant to that number. Thus, 4 are fixed and 11 values remain to be estimated.
Counting everything together —> 11 + 15 + 4 + 10 = 36 model parts that are going to be estimated.
(Number of parameters estimated in the model). We have 120 covariances, so 120-36 = 84 degrees
of freedom.
Check in the output;
- Warning and errors
- Standardized residuals
- Residual distribution
- Model fit statistics
- Estimated parameter
- Suggestion for improvement.
Assumptions
- Like performing CFA on metric/numerical variables. Scale/interval variables.
- The sample needs to have more observations than variables.
- For stable covariances we need 100 observations —> but CFA wants more, at least 200.
- Minimum 5 items, but preferably 10 observations per variable.
- A strong conceptual idea of what you are going to test —> hypothesized model!
Rule of thumbs
Look at the X2 statistics
CFI —> confirmatory fit >.95 (how well does your model fit).
sRMR —> <.08,
RMSEA —> <.06 (90% confidence interval)
—> Also apply equivalence tests.
No need for rotation, we don’t need to identify the best possible view of subgroups. There is one
specific subgroup defined by ourselves.
You can have variables that have a high coefficient on one factor only. Not persé a problem, as long
as you can defend it.
, Model specification;
- 4 factors = 4 latent variables
—> if 13 variables —> 91 covariances, so 91 numbers in this dataset.
Residual distribution
—> Symmetric distribution, on average 0, equal amounts of over and under estimations.
Interpreting model fit statistics.
Chisq + df + p value = fit statistics.
Baseline. Chisq —> the difference between the model that you have currently estimated and a
model without any factor structure. The larger the X2, the larger the difference. —> difference
between the covariance matrix based on your model, compared to a model without any factor
structure. You want to have specified a model that is better than no model specification —> Should
be significant!
The other chi-square —> is the difference between the observed and predicted model, covariances,
using your data and your model —> you do not want to be significant! Should be alike! A large
difference means that your prediction does not resemble the observation matrix. —> significantly
deviating from our observations.
- CFI should be >.95
We want to have a small standardized root mean residuals —> sRMR <.08
Rmsea —> <.06, if 0 —> perfect match between prediction and observation.
You can have a very reasonable model but is not very strongly fitting yet, based on the fit statistics
and X2.
Suggestions for improvement
Maybe being strict too strict by not allowing some of the factors to predict two of these items, for
example, factor 2 should also be allowed to have items from factor 1 in his model? —> this may
lead to an extra add of variance leading to a perfect/sufficient fit.
Request modification index that suggests where you might want to add things to your model to
improve your fit. For example; factor 3 should add vocabulary also in their model, leading to a
2.494 decrease in misfit.
2) Understand how PCA goes from data to component structure.
Rely on interrelationships between variables. Technique searches for a structure of components
by finding groups of variables that show high correlations within the group, but the lower
correlation between the groups.
Interpretation comes afterward, as an exploratory technique. Possible that the structures do not
make sense —> Probably due to weak correlations.
So —> Going from data (correlation matrix) to potentially suggested models that support a theory