Statistics videolectures
Lecture 1
Stephen Toulmin’s model of argumentation
Claim: choice for a technique
Ground: statistical output, type of research questions, measurement levels (on what data is
your decision based on): information to found your conclusion
Warrant: general rules, statistical principles
Purpose of data analysis: get information to answer research questions
Numerical methods for describing sets of data:
Frequency table (can be used for every variable, regardless of measurement level)
Measures of central tendency and variability (choice depends on measurement level)
First, you collect the data you need, then you analyze the data, organize them and calculate the
order and relations between the variables. This way you can get the answers to research questions.
Most of the time we have to do with descriptive static questions. Typical: there is no time indication
involved. Also, all questions deal with one characteristic.
There is no median when you are
collecting nominal data, because the
median has to do with the order of
variables: nominal data doesn’t have
any order.
Variability or dispersion is about how
the scores are spread around the
measurement of central tendency. It
is not applicable for nominal
variables, because there can’t be a
dispersion around the mode.
When we look at the interquartile
range, the median is the midst and
the interquartile range is 25 % above the median and 25% below the median. Adding this up gives us
the midst 50 % of the observations. IQR = the upper quartile minus lower quartile (75-25).
When you work with the mean, you can use variance or standard deviation.
There are two ways to interpret the standard deviation. The choice depends on the shape of your
distribution. When you have a normal distribution, we can use the empirical rule. When your
distribution is not symmetric and bell shaped, we use Chebyshev’s rule.
,The last part of the videolecture is about how to determine the skewness and shape of a distribution.
For this we use the median and the mean, so it is only applicable for interval and ratio level. When
you have a perfectly bell shaped distribution, the mean and the median are the same (midst).
If your mean is smaller than your median, you have a negatively skewed distribution with a tail on
the left side. When your mean is higher than your median, you have positively skewed distribution, it
is the other way around.
Videolecture 2
The process of estimation contains four steps:
1. Determine the population: is presented in the research question or hypothesis
2. Draw the sample: we draw samples because populations are too high
3. Determine the sample value (X)
4. Estimates and tests by analyses
When we work with samples, we have to establish how confident we are about the estimations.
Determining the confidence interval is one way to do this. This is a range of scores we are confident
about (empirical rule and normal distribution). We are allowed to do that because of the central limit
theorem. We can draw a lot of different samples from 100 out of a population of 10.000, the
elements in the samples can vary, for example one sample can have older people or younger people
than the other. For each sample we can calculate a characteristic, for example the mean.
We can gather all means and put them
in a database: make a new variable
with all means for all samples of that
certain sample size. If our sampling
distribution is normally distribution we
know the percentages of 90, 95 and
99 percent. We can calculate intervals
of confidence and we can do tests
whether or not our calculations are
right. + 1.65 means ‘1.65 standard
deviation from the mean’.
To work with the features of a sampling distribution, we need to know the standard deviation, here
called ‘standard error of the mean’: the standard deviation of all possible sample means. This is
almost always unknown, so we have to calculate it. We do this by calculating the standard deviation
of our own sample (S) and divide this by the square root of N.
, The confidence interval is the probability that the random selected interval encloses the
unknown parameter. You don’t know the real parameter (for example, the mean). We
need to know two things: the confidence interval and alpha. Alpha is the probability that
the random selected interval does not enclose the unknown parameter. Alpha is the insecurity that
the estimated parameter is not in the confidence interval. The confidence interval is 1 – alpha. Alpha
has the value of 0,01, 0,05 or 0,10, so the confidence interval has the following values: 99% (a =
0.01), 95% (a = 0.05), 90% (a = 0.10).
To calculate confidence intervals, we use the normal distributions. When the sample distribution is a
normal distribution (samples larger than 30), we can work with the z-value. When you work with a
small sample, we need to use t-value. This is related to the degrees of freedom and minus one.
Two tailed means that the interval is equally spread on both sides of the distribution. When you have
determined the critical z-value or t-value, you multiply this with the SE (standard error of the mean),
the sum is added and subtracted from the mean and the outcome is the confidence interval. The t-
values become infinite: then they are similar to z-value.
You can also calculate confidence intervals around the proportion (instead of the
mean). For example, you have males and females (no males). You can ask: what
is the mean time that 1 (male) is scored? The result is the proportion. When you
have a sample of a 100 respondents with 55 males, the number of the times 1 is
scored is 55 divided by 100 = 0,55. This is the estimated proportion. Then we calculate the standard
error of the proportion with the formula on the right. In this example: 0,55 (1-0,55)/100 and from
this number the square root. Then we multiply that number with 3 and subtract and add this to the
mean, so we get two different values: the values the proportion is in. If this does not contain zero or
one, we can use the normal distribution.
In SPSS: analyze – descriptive statistics – explore and find the variable you want to calculate, but it in
the dependent list and click ‘ok’. You have to make sure that the confidence interval is set on 95
percent (statistics – descriptive).
When you calculate something, you also have to tell why this is correct: the warrant. The conclusion
‘I am 95 percent sure that the proportion of X lies between X and X’ is called the claim. The data you
use is the ground. Underneath the information there are some examples from the lecture.
If you have 4 categories, you have to make sure that you make 2 categories. The category you say
something about is category 1 and the other 3 make category 2 (with the option recode into different
variables).
Example:
Claim (conclusion): We are 99% confident that the proportion of Dutch people that live in the West
part of the country lies between 0.413 and 0.465 (that is between 41.3% and 46.5%).
Ground (data): on the next page.
Warrant (explanation): This conclusion is right, for a sample of 30 cases or more is considered a large
sample. We have a sample of 2384 Dutch people, therefore we can assume that the sampling
distribution is approximately normal. We may use the normal approach when the formula does not
include 0 or 1. Here it is [.4087-.4697], so the normal approach is allowed. The confidence level is
99%, so α = 1% = .01. This has to be divided by two, for a confidence interval is two-sided. The
appropriate z-value is then Zα/2 = Z.005 = ±2.580. The SPSS output about the confidence interval shows