Quantitative
Research Methodology
Radboud University - Discovering statistics using
IBM SPSS Statistics - Andy Field –
April 2020
,Chapter 2 The Spine of statistics
We are focussing on the similarities between statistic models instead of differences. You have to use
statistics as a tool. Most statistical models are variations on the very simple idea of predicting an
outcome variable from one or more predictor variables. The mathematical form can change, but it is
all about the relations between variables. They are also called the SPINE of statistics what stands for:
• Standard Error
• Parameters
• Interval Estimates (Confidence interval)
• Null hypothesis significance testing
• Estimation
We build statistical models, so we can test hypothesis in simulations of the reality. We collect data
from the real world to test predictions from our hypothesis about that phenomenon. Testing the
hypothesis therefore involves building statistical models of the phenomenon of interest.
It is important that the model accurately represents the real world, otherwise any conclusions you
may make will be meaningless. There are 3 kinds of fit for the models:
1. Good fit – if you use this model, it will be so close to the reality that you can confident that
your predictions will be accurate.
2. Moderate fit – the model comes close to reality, but also includes some big differences. So in
this case the predictions could be accurate, but also completely wrong, so we have little
confidence in this model.
3. Poor fit – any predictions made based on this model will be completely inaccurate. If our
model is a poor fit to observed data, then the predictions we make from it will be equally
poor.
Everything in the book boils down to this equation:
Outcome = Model + Error
The data we observe can be predicted from the model we choose to fit, plus some amount of error.
The model in the equation will vary depending on the design of your study, the type of data you have
and what it is you’re trying to achieve with your model. To simplify the equation; we predict an
outcome variable from some model, but we won’t do it perfectly so there will be some error in there
too.
Populations and sample
Scientists are usually interested in finding results that apply to an entire population of entities. It will
have a much wider impact if we can conclude everyone. The problem is that we rarely have access to
the entire population. Therefore we collect data from a smaller subset of the population known as a
sample. The bigger the sample, the more likely it is that it will reflect the whole population. If we take
several random samples from the population, then each of these sample will give us slightly different
results, but on average the results from the large samples will be similar.
1
D. Folmer
,P is for Parameters
Statistical models are made up of variables and parameters. Parameters are not measured and are
usually constants believed to represent some fundamental truth about the relations between
variables in the model. Some examples are: mean, median, correlation and regression coefficient
(estimate the relationship between the two). To not confuse you with all the symbols, we will just
use the letter b. if we’re are interested in summarizing the outcome, then we only have a parameter
is our model, so we could write it as:
So this means that the outcome of the equation is equal to a parameter plus some error. Often we
however want to predict an outcome from a variable, and if we do this we expand the model to
include the variable. Variables are often denoted with the letter X, so you will get this formula:
If we want to predict an outcome from two predictor, then we can add another predictor to the
model too:
This looks like a lot of abracadabra, but essentially it is saying that if you want to know the outcome
of the formula you have to pick a parameter, add the entities score on two predictor variables (b1X1
and b2X2) and add some error in it too, because you probably wont know the accurate truth.
So we can predict values of an outcome variable based on a model. The form of the model changes,
but there will always be some error in prediction and there will always be parameters to tell us about
the shape or form of the model.
You will see the phrase; ‘estimate the parameters’ or ‘parameter estimates’ a lot in statistics. This is
because we are interested in the whole population as scientists. The problem is that we didn’t
measure the whole population, so we can only estimate what population parameters are really like.
The mean as statistical model
Example: if we took five lecturers and measured the number of friends they have. We might find the
following data; 1, 2, 3, 3 and 4. If we want to know the average number of friends, or the mean we
have to add all the values and divide them by the number of values measured. This is 2.6, but you
can’t have 2.6 friends. So the mean is a hypothetical value: it is a model created to summarize the
data and there will be error in the prediction. So the model is:
In which bo is the mean of the outcome. We can use the value of the mean, to estimate the value of
the population. We give estimates little hats like this:
All the hats do, is making explicit that the values underneath are estimates.
Assessing the fit of a model: sums of squares and variance revisited
We need to know how representive the model is of the reality that we build. Lets look what happens
when we use the model of the mean to predict how many friends the first lecturer has in our
example.
We observed that lecturer 1 had one friend and the model predicted 2.6 by rearranging the equation
we see that there is an error of -1.6.
2
D. Folmer
, So we want the outcome to be one. The mean is 2.6 and placed on the place of b. if we make the
equation so that we can see the value, we see we have calculated the deviance. The deviance is how
much you are from the mean. 1 if -1.6 from 2.6. The deviance is another word for error, you can also
imagine the equation like this:
In other words, the error or deviance for a particular
entity is the score predicted by the model for that
entity subtracted from the corresponding observed
score. you can see this in the following figure. The line
representing the mean can be thought of as our
model and the dots are observed data. The vertical
lines represent the error or deviance of the model for
each lecturer.
The model overestimated the popularity of the first
lecturer as you can see. We want to know the fit/
accuracy of the model overall. We saw in the past chapter that we can’t just add deviances, because
we would get a total of 0. But we can square them and then add them. So we would square the
errors and eventually we would get the sum of squared errors. This looks difficult but it just squaring
all the deviances and then adding them all up.
This is exactly the same equation the sum of squares in chapter 1.6, but now we replaced some
symbols to our model.
However when we are thinking about models more generally, this illustrates that we can think of the
total error in terms of this general equation. This shows that we can use the sum of squares to assess
the total error in any model.
We saw that although the sum of squared errors (SS) is a good measure of the accuracy of our
model, it depends upon the quantity of data that has been collected. The mode data points, the
higher the SS. We can overcome this problem by using the average error. To computer the average
error we have to divide the sum of squares, by the number of values (N). but we are estimating, so
we have to use the degrees of freedom (N-1). We use this if we want to estimate the population
value.
3
D. Folmer