Chapter 1: Exploratory factor analysis
What is multivariate analysis?
Convert data into knowledge. Large complex data converging from data to knowledge.
Multivariate analysis in statistical terms
Multivariate analysis: all statistical techniques that simultaneously analyse multiple measurements
on individuals or objects under investigation. Thus, any simultaneous analysis of more than two
variables can be loosely considered multivariate analysis.
Whereas:
Univariate analysis: analysis of single-variable distributions
Bivariate analysis: cross classification, correlation, analysis of variance.
Simple regression: used to analyse two variables.
To be considered truly multivariate, all the variables must be random and interrelated in such ways
that their different effects cannot meaningfully be interpreted separately.
Multivariate analysis technieken: Kennis creëren en besluitvorming verbeteren. Multivariate analyse
refereert aan alle statistische technieken die gelijktijdig meerdere metingen analyseren over
individuen of objecten.
Some basic concepts of multivariate analysis
The variate: The building block of multivariate analysis. A linear combination of variables with
empirically determined weights. The variables are specified by the researcher, whereas the weights
are determined by the multivariate technique to meet a specific objective.
In multiple regression, the variate is determined in a manner that maximizes the correlation
between the multiple independent variables and the single dependent variable.
In discriminant analysis, the variate is formed so as to create scores for each observation that
maximally differentiates between groups of observations.
In factor analysis, variates are formed to best represent the underlying structure or patterns
of the variables as represented by their intercorrelations.
Measurement scales
Data can be classified into one of two categories (1) nonmetric – qualitative and (2) metric –
quantitative.
Nonmetric data: describe differences in type or kind by indicating the presence or absence of a
characteristic or property. These properties are discrete in that by having a particular feature, all
other features are excluded (if a person is male, he cannot be female). Nonmetric measurements can
be made with either a nominal or an ordinal scale.
Nominal scales: assigns numbers as a way to label or identify subjects or objects. The
numbers assigned to the objects (categories or classes) have no quantitative meaning
beyond indicating the presence or absence of the attribute or characteristic under
investigation. Also known as categorical scales. (for instance, 2 = female and 1 = male).
Usually demographic attributes.
Ordinal scales: the next higher level of measurement precision. Variables can be ordered or
ranked in relation to the amount of the attribute possessed. They are nonquantitative
because they indicate only relative positions in an ordered series (for instance the levels of
consumer satisfaction, level of education). With ordinal scales you can find out the order of
, the values but not the amount of difference between the values (thus you may know that
product A it better but not how much better than B).
Metric Measurement Scales: Used when subjects differ in amount or degree on a particular
attribute.
Interval & ratio scales: provide the highest level of measurement precision, permitting nearly
any mathematical operation to be performed.
o The difference with ratio scales is that interval scales (Celsius) use an arbitrary zero
point, whereas ratio scales include an absolute zero point.
The impact of choice of measurement scale
If the researcher incorrectly defines this measure as metric, then it may be used
inappropriately.
The measurement scale is also critical in determining which multivariate techniques are the
most applicable to the data, with considerations made for both independent and dependent
variables. The metric or nonmetric properties of independent and dependent variables are
the determining factors in selecting the appropriate technique.
Measurement error and multivariate measurement
Measurement error: the degree to which the observed values are not representative of the true
values. Thus, all variables used in multivariate techniques must be assumed to have some degree of
measurement error.
The researcher may also choose to develop multivariate measurements, also known as summated
scales, for which several variables are joined in a composite measure to represent a concept.
The objective is to avoid the use of only a single variable to represent a concept and instead
to use several variables as indicators, all representing differing facets of the concept to obtain
a more well-rounded perspective.
Validity and Reliability
In assessing the degree of measurement error present in any measure, the researcher must address
two important characteristics of a measure:
Validity: the degree to which a measure accurately represents what it is supposed to
(measure what you want to measure).
Reliability: the degree to which the observed variable measures the true value and is error
free. Thus, it is the opposite of measurement error (measure what you are supposed to
measure).
The impact of measurement error and poor reliability is not directly seen, because those are embed in
observed variables.
,Statistical significance versus statistical power
Interpreting statistical inferences requires the researcher to specify the acceptable levels of statistical
error that results from using a sample.
Type I Error: also known as Alpha = the probability of rejecting the null hypothesis when it is
actually true generally referred to as a false positive.
Type II Error: also known as Beta (B) = the probability of not rejecting the null hypothesis
when it is actually false. It is referred to as the of the statistical inference test = the
probability that statistical significance will be indicated if it is present.
Reality
No difference Difference
H0: no difference 1-α Β
Statistical decision Type II error
H1: difference α 1-β
Type I error Power
Power is de waarschijnlijkheid van het juist verwerpen van de nulhypothese, wanneer die
verworpen moet worden. Dit geeft de kans op succes aan in het vinden van verschillen waar
zij ook werkelijk bestaan.
Type І error en Type ІІ error zijn inversely related, als de kans op een Type І error afneemt
wordt de kans op Type ІІ error groter.
Impacts on statistical Power
High levels of power cannot always be achieved because power is not solely a function of alpha. It is
determined by three factors.
1. Effect size: the probability of achieving statistical significance is based not only on statistical
considerations, but also on the actual size of the effect. Thus, the effect size helps
researchers determine whether the observed relationship is meaningful. (If a firm claims its
program leads to an average weight loss of 25 pounds, the 25 pounds is the effect size).
When examining effect sizes, a larger effect is more likely to be found than a smaller effect
and is thus more likely to impact the power of the statistical test.
2. Alfa: as alpha becomes more restrictive, power decreases. Therefore, as the researcher
reduces the chance of incorrectly saying an effect is significant when it is not, the probability
of correctly finding an effect decreases.
3. Sample size: at any given Alfa level, increased sample sizes always produce greater power of
the statistical test. As sample sizes increase, researcher must decide if the power is too high.
By ‘too high’ we mean that by increasing sample size, smaller and smaller effects will be
found to be statistically significant, until at very large sample sizes almost any effect is
significant.
The relationship between alpha, sample size, effect size and power is complicated. For most statistical
test for an acceptable level of power, you need to achieve alpha levels of a least .05 with power of 80
percent.
RULES OF THUMB 1-1
Researchers should design studies to achieve a power level op .80 at the desired
significance level
More stringent significance level requires larger samples to achieve the desired power
Conversely, power can be increased by choosing a less stringent alpha level –
Smaller effect sizes require larger sample sizes to achieve the desired power - An increase in
power is most likely achieved by increasing the sample size
,Classification of multivariate techniques
Dependence technique: a variable or set of variables is identified as the dependent variable to be
predicted or explained by other variables known as independent variables
The different dependence techniques can be categorized by two characteristics:
1. The number of dependent variables.
2. The type of measurement scale employed by the variables. If the several dependent
variables are nonmetric, then they can be transformed through dummy variable coding.
Independence technique: no single variable or group of variables is defined as being independent or
dependent.
By interdependence techniques the variables cannot be classified as either dependent or
independent. Instead, all the variables are analysed simultaneously in an effort to find an underlying
structure to the entire set of variables or subjects.
Classification is based on three judgements:
1. Can the variables be divided into independent and dependent classifications based on some
theory? This indicates whether a dependence or interdependence technique should be
utilized.
2. If they can, how many variables are treated as dependent in a single analysis?
3. How are the variables, both dependent and independent, measured?
The interdependence techniques for this exam:
Factor Analysis (the structure of the relationship among variables)
Confirmatory Factor Analysis (the structure of the relationship among variables)
Relationship Technique Measurement scale Types
Interdependence Factor analysis Metric’s variance Exploratory: PC, PFA
Confirmatory: Mplus,
Lisrel, AMOS, STATA
Dependence Multiple regression Metric y, ‘metric x’s Linear regression
Logistic regression Non-metric y, metric
x’s
(M)An(c)ova Metric y(‘s),
(non)metric x’s
Partial Least Squared Multiple metric y’s Smartspls, MPlus
and x’s
,Chapter 3: Exploratory factor analysis
What is factor analysis?
Factor analysis: interdependence technique whose primary purpose is to define the underlying
structure among the variables in the analysis (multivariate and interdependence technique).
As one adds more and more variables, more and more overlap (i.e. correlation) is likely
among the variables.
Factor analysis provides the tools for analysing the structure of the interrelationships (correlations)
among a large number of variables by defining sets of variables that are highly interrelated, known as
factors. Are supposed to represent dimensions within the data.
Two approaches:
1. Confirmatory approach: assesses the degree to which the data meet the expected
(theoretical structure.
2. Exploratory approach: searching for structure among a set of variables or as a data reduction
method.
In factor analysis, all variables are simultaneously considered with no distinction as to dependent or
independent variables.
Variate: Within factor analysis, the variates (factors) are formed to maximize their explanation of the
entire variable set, not to predict a dependent variable.
In other methods: the linear composition of variables.
Loadings: The contributions of each variable to the factor.
Factor analysis decision process
Stage 1: Objectives of factor analysis?
General purpose of factor analytical techniques: to find a way to condense (summarize) the
information contained in a number of original variables into a smaller set of new, composite
dimensions or variates (factors) with a minimum loss of information (thus searching for the
fundamental constructs or dimensions assumed to underlie the original variables).
Then factor analysis is keyed to four issues:
1. Specifying the unit of analysis
2. Achieving data summarization and/or data reduction
3. Variable selection
4. Using factor analysis results with other multivariate techniques
1.1 Specifying the unit of analysis
Factor analysis can identify the structure of relationships among either:
1. Variables
2. Respondents
By examining either:
A. The correlation between the variable
B. The correlation between the respondents.
,Focused on variables:
Objective: summarizing the characteristics.
Factor analysis would be applied to a correlation matrix of variables.
The most common R factor analysis analyses a set of variables to identify the dimensions
that are latent (not easily observed).
Unit of analysis is variables
Focused on respondents:
Objective: combine or condense large numbers of people into distinctly different groups within a
larger population. Applied to a correlation matrix of the individual respondents based on their
characteristics.
Referred to as Q factor analysis. Not utilized frequently due to computational difficulties.
More often cluster analysis is used, to group individual respondents.
Unit of analysis is respondents.
1.2 Achieving data summarization versus data reduction
Summarizing data: factors analysis derives underlying dimensions that, when interpreted and
understood, describe the data in a much smaller number of concepts than the original variables.
Data reduction: remove redundant (highly correlated) variables from the data file, perhaps replacing
the entire data file with a smaller number of uncorrelated variables.
Data summarization
The fundamental concept: definition of structure.
Purpose: defining a small number of factors that adequately represents the original set of variables.
Through structure, the researcher can view the set of variables at various levels of
generalization, ranging from individuals to concepts.
Data reduction
Purpose: retain the nature and characteristics of the initial variables but reduce their number to
simplify the subsequent multivariate analysis. How:
1. Identifying representative variables from a much larger set of variables for use in subsequent
multivariate analyses.
2. Creating an entirely new set of variables, much smaller in number, to partially or completely
replace the original set of variables.
1.3 Variable selection
Specify the potential dimensions that can be identified through the character and nature of
the variables submitted to factor analyses.
Watch out for garbage in garbage out. The quality and meaning of the derived factors reflect
the conceptual underpinnings of the variables included in the analysis.
1.4 Using factor analysis with other multivariate techniques
Variables determined to be highly correlated and members of the same factor would be
expected to have similar profiles of differences across groups in multivariate analysis of
variance or in discriminant analysis.
Highly correlated variables affect the stepwise procedure of multiple regression and
discriminant analysis that sequentially enter variables based on their incremental predictive
power over variables already in the model. As one variable from a factor is entered, it
, becomes less likely that additional variables from the same factor would also be included due
to their high correlations with variables already in the model little incremental power.
o It does not mean that the other variables of the factor are less important or have less
impact, but instead their effect is already represented by the included variables from
the factor knowledge of the structure of the variables by itself would give the
researcher a better understanding of the reasoning behind the entry of variables in
this technique.
The insight provided by data summarization can be directly incorporated into other multivariate
techniques through any of the data reduction techniques.
Stage 2: Designing a factor analysis
1. Calculation of the input data (a correlation matrix) to meet the specified objectives of
grouping variables or respondents.
2. Design of the study in terms of number of variables, measurement properties of variables,
and the types of allowable variables.
3. The sample size necessary, both in absolute terms and as a function of the number of
variables in the analysis.
2.1 Correlations among variables or respondents
R-type factor analysis; the researcher uses a traditional correlation matrix (one across
variables) as input.
Q-type factor analysis; the correlation among respondents. One could identify groups or
clusters of individuals that demonstrate a similar pattern on the variables included in the
analysis.
Difference between q-type and cluster analysis: where Q-type is based on the inter-correlations
between respondents, cluster analysis forms grouping based on distance-based similarity measure
between the respondents’ scores on the variables being analysed. Based on the example of page 99:
Q-types: respondents with the same structure on all the variables
Cluster: sensitive to the actual distance among the respondents, so takes the closest pair.
2.2 Variable selection and measurement issues
Central question: What type of variables can be used in factor analysis?
The primary requirement is that a correlation value can be calculated among all variables.
Harder with non-metric variables avoid them or dummy them (define in 0/1
representing the categories of nonmetric variables) Booleman is with only dummies
(special form of FA).
Central question: How many variables should be included?
The researcher should attempt to minimize the number of variables, but still maintain a reasonable
number of factors.
When the study is to find a structure, you need to include several variables (five or more).
The strength of FA lies in finding patterns among groups of variables, and it is of little use in
identifying factors composed of only a single variable.
2.3 Sample size
Researchers would not use factor analyse in a sample of fewer than 50 observations, and
preferably the sample size should be 100 or higher.
, The rule of thumb is to have at least give times as many observations as the number of
variables to be analysed, and the more acceptable sample has a 10:1 ratio.
Stage 3: Assumptions in factor analysis
3.1 Conceptual issues
Basic assumption: Some underlying structure does exist in the set of selected variables.
The presence of correlated variables and the subsequence definition of factors do not
guarantee relevance, even if it meets the statistical requirement.
Related to the set of variables selected and the sample chosen.
Basic assumption: The sample must be homogenous with respect to the underlying factor structure.
Example: inappropriate to apply FA to a sample of males and females for a set of items
known to differ because of gender. When two subsamples are combined, the resulting
correlation and factor structure will be a poor representation of the unique structure of the
group
Therefore, when you expect differing groups in the sample, separate factor analyses should
be performed, and the results should be compared to identify differences not reflected in the
results of the combined sample.
3.2 Statistical issues
Departures from normality, homoscedasticity, and linearity apply only to the extent that they
diminish the observed correlations.
Some degree of multicollinearity is desirable, because the objective is to identify interrelated
sets of variables.
3.3 Overall measures of intercorrelation (assumptions)
The data matrix should have sufficient correlations to justify the application of factor analysis If it
is found that all of the correlations are low, or that all correlations are equal (no structure exists to
group variables), then question the use of factor analysis. To end this, there are several approaches:
1. Visual: If visual inspection reveals no substantial number of correlation greater than .30, then
factor analysis is probably inappropriate.
2. Partial correlation (doing it by computer): is the correlation that is unexplained when the
effects of other variables are considered.
- If “true” factors exist the partial correlation should be small, because the variable
can be explained by the variables loading on the factors. High partial loadings no
underlying factors FA is not appropriate (significant + partial value of 0.7 = high).
3. Bartlett’s test of sphericity: examines the entire correlation matrix it is a statistical test for
the presence of correlation among the variables. It provides the statistical significance that
the correlation matrix has significant correlations among at least some of the variables.
- However, the test is sensitive to an increasing sample size.
- It tests the null hypothesis that the variables are uncorrelated in the population. You
want to reject the null hypothesis here and you want to be sure that there are
enough correlations in the population.
- Tests the hypothesis that your correlation matrix is an identity matrix, which would
indicate that your variables are unrelated and therefore unsuitable for structure
detection. Small values (less than 0.05) of the significance level indicate that a factor
analysis may be useful with your data.