Applied Multivariate Data Analysis – Week 1
Ch 2: The Spine of Statistics
The acronym SPINE stands for:
(1) Standard Error
(2) Parameters
(3) Interval Estimates (confidence intervals)
(4) Null Hypothesis significance testing
(5) Estimation
Statistical Models
Scientists collect data from the real world to test predictions from hypotheses about a
phenomenon
- Testing these hypotheses involves building statistical models of the phenomenon
of interest
Scientists build statistical models of real-world processes to predict how these processes
operate under certain conditions
- Scientists do not have access to the real-world situation – and can only infer
things about processes based upon the models built
o The statistical model should represent the data collected – i.e., the
observed data – as closely as possible in order for the predictions to be
accurate
The degree to which the statistical model represents the data collected – called the fit of the
model
1. An excellent representation of the real-world situation => good fit
2. A model with some similarities – but also important differences – to real-world
situation => moderate fit
3. A model that is completely different from the real-world situation => poor fit
, - If model is a poor fit to the observed data – the predictions inferred from it will be
equally poor
Types of Statistical Models
Linear Models Models based on a straight line
Statistical systems based on the linear model include ANOVA and
regression
Linear models tend to get fitted to data – as they are less complex and non-
linear models are rarely taught
Non-Linear Models Can be a good fit for some types of data/research
Rarely taught – thus, rarely used
Data can be represented on a scatterplot – in which each dot represents a certain score
Consequences of Using Mainly Linear Models
1) Many published statistical models may not be the ones that fit best – bc authors did
not try non-linear models
2) Findings may have been missed because a linear model was a poor fit – and scientists
gave up rather than fitting non-linear models
It is best to plot the data first – if the plot seems to suggest a non-linear model, the do not
apply a linear model
Statistical Models – Main Equation
Everything in statistics boils down to one equation:
outcome i=( model ) +error i
This equation means that the data we observe can be predicted from the model we choose +
some amount of error
- Where, the i refers to the ith score => reflecting the fact that the value of the
outcome and the error will be different for each person
The ‘model’ in the equation will vary depending on:
(1) The design of the study
, (2) The type of data
(3) The aim of using the model
We predict an outcome variable from some model – but we do so imperfectly – therefore,
there is some error in there
Populations and Samples
Scientists are interested in finding results that apply to an entire population of entities (=>
generalizable)
A population can be (1) very general – e.g., all human beings – or (2) very narrow – e.g., all
male ginger cats
Typically, scientists strive to infer things about general populations rather than narrow ones
- As such findings and conclusions have a much wider impact
There is rarely access to every member of a population – therefore, data is collected from a
smaller subset of the population – i.e., a sample
- The data is then used to infer things about the population as a whole
The bigger the sample => the more likely it is to reflect the whole population
- The results from random samples will give slightly different results – but on
average, the results from large samples would be similar
P is for Parameters
Parameters are the P in the SPINE of statistics
Statistical models are made up of variables and parameters
1) Variables – i.e., measured constructs that vary across entities in the sample
2) Parameters – i.e., they are not measured and are constants believed to represent
some fundamental truth about the relations between variables in the model
- E.g., parameters include the mean and median – i.e., estimate the center of the
distribution – and the correlation and regression coefficients – i.e., which
estimate the relationship between two variables
,Case (1) – In cases in which one is only summarizing the outcome (=> as we are when
computing the mean) then, there will be no variables in the model – only a parameter:
outcome i=( b 0 ) +error i
Case (2) – In cases in which we want to predict an outcome from a variable => expand the
model to include this variable (predictor variables are denotes with X):
outcome i=( b 0 +b1 X i ) +error i
This equation predicts the value of the outcome for a particular entity (=> i) – not just from
the value of the outcome when there are no predictors (=> b 0)
- But from the entity’s score on the predictor variable (=> X i )
The predictor variable has a parameter (=> b 1) attached to it
- This parameter tells us something about the relationship between the predictor X i
and the outcome
Case (3) – In cases when predicting an outcome from two predictors => add another
predictor to the model:
outcome i=( b 0 +b1 X 1 i+ b2 X 2 i ) +error i
This model predicts the value of the outcome for a particular entity i from the value of the
outcome when there are no predictors b 0 and the entity’s score on two predictor variables ( X 1 i
and X 2 i )
Each predictor variable has a parameter (b 1 , b2) attached to it => tells us something about the
relationship between that predictor and the outcome
In Summary – values of an outcome variable can be predicted based on a model
The form of model changes – but there will always be some error in prediction
- And there will always be parameters that tell us about the shape or form of the
model
Working Out How the Model Looks
,In order to work out what the model looks like => estimate the parameters (i.e., the values
of b)
- We want to know what our model may look like in the whole population =>
parameter estimates
The model is defined by parameters – as such, we are not interested in the parameter values
in the sample => interested in the parameter values in the population
The sample data can only be used to estimate the population parameter values – since we did
not measure the population, but only the sample
The Mean as a Statistical Model
The mean value is a hypothetical value – i.e., it is a model created to summarize the data and
there will be error in prediction
The model is:
outcome i=( b 0 ) +error i
In which the parameter, b 0 => is the mean of the outcome
- The value of the mean/parameter computed in a sample – can be used to estimate
the value in the population
outcome i=( b^ 0 ) + error i
When referring to an estimate => add a hat on top as the parameter does not represent the
true value and explicitly express that the value is an estimate
Assessing the Fit of a Model – Sum of Squares and Variance
With most statistical models – can determine whether the model represents the data well by
looking at how different the scores observed in the data are from the values that the model
predicts
,Estimating Model Fit for Particular Entity
Given that a model predicted a mean of 2.6 for the outcome of 1 entity=> in order to calculate
the error, fill in and rearrange the equation: outcome i=( b^ 0 ) + error i
1=2.6 +error 1error 1=1−2.6 = -1.6
As such => we have just calculated the deviance – i.e., the error
deviance=outcome i−model i
The error/deviance for a particular entity => the score predicted by the model for that entity
subtracted from the corresponding observed score
The line representing the mean can be
thought of as our model
- The dots are the observed
data
The diagram has a series of vertical lines
that connect each observed value to the
mean value
- These represent the
error/deviance of the model
for each entity
A negative number (e.g., -1.6) => shows that the model overestimates the actual value
Estimating the Model Fit Overall
We cannot add deviances => some errors are positive and others negative which would result
in a total of zero:
total error=∑ of errors
❑ni=1 ( outcome i−model i )=0
The solution to this problem => square the errors:
∑ of squared errors ( SS )=❑in=1 ( outcome i−model i )2
,Specific Models
When thinking about a specific model – i.e., such as when the equation was specific to when
n 2 n 2
the model is the mean => ❑i=1 ( outcome i−model i ) = ❑i=1 ( x i−x́ )
General Models
Think of the total error in terms of this general equation:
n 2
total error=❑i=1 ( observedi −modeli )
This equation shows how the SS can be used to assess the total error in any model – not just
the mean
The SS is a good measure of the accuracy of the model – and it depends on the quantity of
data collected (the more data points => the higher the SS)
- This problem is overcome by using the average error rather than the total
Computing the average error => divide the SS (i.e., total error) by the number of values (i.e.,
N) that we used to compute the total
Estimating the Mean Error in Population
Estimated by:
1. Divide by the degrees of freedom (df) – i.e., the number of scores used to compute
the total adjusted for the fact that we are trying to estimate the population value
n 2
SS ❑i =1 ( outcome i−model i )
mean squared error= =
df N−1
This is a more general form of the equation for variance => the above equation can be easily
transformed into the one for variance:
n 2
SS ❑i =1 ( x i−x❑ )
mean squared error= =
df N−1
Summary
The sum of squared errors (SS) and the mean squared error (i.e., the square root of variance)
=> can be used to assess the fit of a model
- Large values relative to the model => indicate a lack of fit
,SS => used to assess the total error in any model; measure of accuracy of a model
͢ Depends on quantity of data collected
͢ The more data => the higher the SS
The mean squared error (MS) – or the variance – is the average error in the model in the
population
E is for Estimating Parameters
Equations for estimating parameters are based on the principle of minimizing error –
providing the parameter that has the least error given the data
The principle of minimizing the sum of squared errors (SS) – known as the method of least
squares or ordinary least squares (OLS)
The equation for the mean is designed to estimate the parameter to minimize the error – i.e.,
the value that has the least error
,The equations obtain the lowest value of the SS errors – and the parameter value that results
is the lowest value of the SS
outcome i=( b^ 0 ) + error i
∑ of squared errors ( SS )=❑in=1 ( outcome i−model i )2
S is for Standard Error
The SD allows us to see how well the mean represents the sample data
The standard error – allows us to look at how representative the samples are of the
population of interest
When using a sample we calculate the average rating – i.e., the sample mean – however,
different samples will have different sample means
- This difference illustrates sampling variation – i.e., samples vary because they
contain different members of the population
Plotting sample means as a frequency distribution – or histogram (i.e., a graph of possible
values of the sample mean plotted against the number of samples that have a mean of that
value) – we would see the frequency of a given mean in the samples
- The end result is a distribution => known as a sampling distribution
Sampling Distribution
A sampling distribution – i.e., the frequency distribution of sample means from the same
population
The sampling distribution of the mean – tells us about the behavior of samples from the
population
- It is centered at the same value as the mean of the population
, If our observed data are sample means => the standard deviation of these sample means
would tell us how widely spread (i.e., how representative) sample means are around their
average
The average of the sample means = the population mean
- The standard deviation of the sample means => tells us how widely sample means
are spread around the population mean
Tells us whether sample means are typically representative of the population
mean
Standard Error of the Mean (SE)
The SD of sample means – i.e., the standard error (SE)
The central limit theorem – i.e., argues that as samples get large (> 30) => the sampling
distribution has:
(1) A normal distribution with m = population M
s
(2) SD of => σ X=
√N
Therefore – if the sample is large (> 30) this equation can be used to approximate the
standard error (SE)
- Because it is the SD of the sampling distribution
When the sample is relatively small (< 30) – the sampling distribution is not normal
- It has a different shape – i.e., t-distribution
Summary
͢ The SE of the mean – i.e., the SD of sample means – is a measure of how
representative of the population a sample mean is likely to be
͢ A large SE => a lot of variability between the means of different samples => sample
mean may not be representative of population mean
͢ A small SE => indicates most sample means are similar to the population mean – i.e.,
sample mean is likely to accurately reflect population mean
I is for (Confidence) Interval