Statistics lectures
Lecture 1
We use statistics to:
- Describe/summarize data: descriptive statistics
- Drawing inferences about populations: inferential statistics
- Studying complex multivariate relationships: statistical modeling
Measurement levels
1. Nominal data: numbers express group membership (in 2 or more categories)
ex. Marital status. 1= single, 2= married, 3= in a serious relationship, 4= not specified.
Categories must be exhaustive (all possibilities should be covered) and mutually
exclusive (i.e. every case fits into one category and one category only)
2. Ordinal data: numbers express an ordering (less/more)
Ex. Smoking intensity. 1= never, 2= occasionally, 3= regularly, 4= heavy.
Numbers expresses more or less of a quantity but the difference between 1 and 2 is
not the same in quantity than between 2-3, 3-4 etc.
3. Interval and Ratio (Scale level): numbers express differences in quantity using a
common unit. Ex. The difference between 70 and 80 in IQ points is comparable to the
difference between 100-110. Both span a difference of 10 units. Likewise, if on
Monday the temperature is 30 degrees, on Tuesday 25 degrees and Wednesday 15
degrees, then we can say that the temperature drop between Tuesday and
Wednesday is twice as large as the drop between Monday and Tuesday.
- Interval: No natural 0 point. (zero temperature is meaningless) But it is arbitrarily
chosen and can differ across scales (fahrenheit and celcius)
- Ratio: have a natural 0 point, as a result you can compare the relative magnitude of
things. You can say for ex. That a person is twice as large (length, income are
examples.)
All data that are not nominal or ordinal Scale level.
Every analysis starts with data inspection getting to know your data. In general, we want
to know more abour:
- Central tendency: What are the most common values?
- Variability: How large are the differences between the subjects? Are there extreme
values in the sample?
- Bivariate association: for each pair of variables, do they associate/covary.
Graphs and statistics
- Bar charts (nominal and ordinal data)
- Histogram (scale data)
- Scatterplots
- Numerical summaries: Frequency tables
Central Tendency – Mode, Median and Mean
- Mode, the score that is observed most frequently
- Median: 4,5,6,7,8,9,9 = 7. Or equal: 4,5,6,8,9,9 = 7
- Mean: M=SX:N (sum of all scores: total number of scores) = het gemiddelde
,Deviation scores
= the difference between a score x1 and the Mean score.
What is variability? Difference in scores, Sum of deviation scores from the mean = 0
(X1-M)2 (Sum of squares)
Measures of Variablity
Variance: S2= (X-M)2 = SS
---------- ----
N-1 df
Standard deviation (SD): S = S2`
SD=S
Minimum, maximum, range and interquartile range
Minimum: lowest observed value
Maximum: higher observed value
Range: maximum – minimum
IQR: ranges of scores that encompass 50% of the middle observations: thus excluding the
25% lowest and 25% highest observations
Z-score = A measure for relative distance
Deviation is almost the same. However, Z scores tells us how extreme or normal a certain
score is. A deviation score of 3,75 can be very large or just normal.
Z=X-M Z=X- Deviation = X-M so the same as Deviation :S
------- -------
S
Z score is the distance between a score and the mean score, relative to the variability of the
scores.
The normal distribution
- Mathematical distribution (Gauss curve)
- Total area under the curve is equal to 1
- Symmetrical distribution
Probability a person is taller than 212 cm. M==180 cm and Deviation== 20
This involves 2 steps:
1. Compute Z sore: Z= X- : = 1.6
Hence, we need to know the area to the right of 1.6 under the standard normal
distribution that is P(Z>1.6)
2. Use table in the book tool to compute the proportion.
= 0.0548 hence 5.48% of the population is taller than 212.
Emperical rules for the normal distribution
- 68% of the cases can be found within one SD = S from the mean
- 95% of the cases can be found within two SDs=2S from the mean
Example: if you know that weekly salary is normally distributed with mean 300 and sd of 15,
you know that 68% has an income between 285 and 315, and 95% between 270 and 330.
, Lecture 2; introduction to inferential statistics
When we want to know something about a population, we use a sample to observe.
Sampling fluctuations = The variability across samples (they are always a little different)
Also, the sample values will always be a little different from the population value.
Differences between sample values and population values are known as sampling errors.
We use inferential statistics to take sampling errors into account when drawing
conclusions about populations from sample results.
Sampling distribution of the mean: We use sampling means to draw conclusions about
population means.
- However, each random sample has a different composition; by change you may have
heavy social media users in the sample, but if you draw a new sample, you may have
by chance persons who use social media less intensively.
- Thus, each time we draw a new sample, the sample mean will vary.
- Using statistics we can predict the sampling fluctuations; that is; we can describe the
variation in means across all possible samples!
Theoretical sampling distribution of the mean = describes how sample means will vary
across samples if we would repeat the sampling many many times (with the same N, from
the same population).
- Sample means fluctuate around the population mean, hence, the mean of all sample
means IS the population mean!
- The variation in samples means depends on sample size; the larger the sample the
smaller the variation in sample means
- Statisticians have shown that if X is normal distributed with mean and , the
sample mean (M) is normally distributed with mean equal to the population mean ,
and standard deviation equal to m=:N (z score) OR SEm=S: N (t score)
The latter is called; standard error.
Standard Errors = the standard deviation of the sampling distribution
They show the variation in sample values (means) across samples.
The larger the standard error, the more sensitive the results are to sampling fluctuations.
- In general; the larger the sample size, the smaller the standard error, the less
sensitive sample results are to sampling fluctuations.
- The more hetereogeneous the population (the larger ) the more sensitive sample
results are to sampling
- The standard error is related to the errors we make if we use the sample as an
estimate of the population value
Central limit Theorem
Sampling of 30 or more is always safe to conclude that the sample means are normally
distributed
In small samples, you can only use the normal distribution for the means if you are willing to
assume that x is normally distributed in the population.
Lecture 1 = One person so we talk about Standard Deviation
Lecture 2 = sample so we talk about standard error