INTRODUCTION TO STATISTICAL ANALYSIS
LECTURE WEEK 1
Statistics
Univariate – one variable What was the average grade of the ISA last year (grade)
Bivariate – two variables Did males and females differ in their grades?
Multivariate – more than 2 variables Was the grade dependent on initial motivation, the
time spent on reading and gender?
Statistics – the study of how we describe and make inferences from data
- An inference is “a conclusion reached on the basis of evidence and reasoning”
make a statement, form the data
- Descriptive statistics – taking direct measurements of your data, describing your data
- Inferential statistics – special technique on the sample, make statements/conclusion
about the entire population take a sample, smaller big of data from your overall
population and making a larger statement (today and yesterday was cold (data),
inference tomorrow and the day after is going to be cold too)
- Distinction between descriptive & inferential
you have to give a descriptive answer and an inferential answer on the exam
Population = N
Sample = n
Units of analysis – the what or who that is being studied (one thing)
- the unit that you will be able to draw conclusions about
- typically, all units are the same type of “thing” on a single data set
- e.g. individuals, families, countries, corporations
- every row can be a different person/country
rijen/rows
Variable – a measured property of each of the units of analysis
- directly what your data will be
- e.g. age/household income
kolommen/coloms
“what is my unit of analysis and what kind of variables would be applicable for this kind of
analysis?”
Recap: level of measurement (ISSR)
Nominal level of measurement:
- group classifications
- no meaningful ranking possible (i.e. 3 is not ‘more’ than 2)
- Numerical coding arbitrary (but necessary in SPSS
e.g. what religion do you belong to:
- Atheist (1)
Name Religion
Claire 1
Sebastian 2
Geert 3
,- Protestant (2)
- Catholic (3)
- Muslim (4)
- other (5)
Ordinal level of measurement (order)
- Meaningful ranking/ordering (3 is ‘more’ than 2)
- BUT: distance between categories unknown/not equal (e.g. difference between 1 and
2 not equal to difference between 2 and 3)
“a horserace without a stopwatch” you know the winner, but don’t know the
exact time difference
E.g. during an average week, how often do you watch television?
- never (1) Name TV
- once a week (2) Claire 2
- a few days a week (3) Sebastian 1
- daily (4) Geert 3
Interval level of measurement
- meaningful ranking (17 is more than 16)
- Distances are equal (e.g. difference between 15 and 17 is equal to difference
between 20 and 22)
e.g. temperature in degrees Celsius
Day Degrees Celsius (at noon)
Monday 17
Tuesday 16
Wednesday 14
Ratio level of measurement
- All properties of interval (ranking & equal distances)
- Absolute & meaningful zero point
e.g. a person’s age
Name Age
Claire 28
Sebastian 22
Geert 34
Qualitative – nominal and ordinal
Quantitative – interval and ratio
we always first need to know the level of measurement in order to know which statistical
techniques we may use for the given variable(s)
Continuous variable – measured along a continuum number can have a decimal point
(5.457) person’s height
,Discrete variable – measured in whole units or categories whole, not a decimal point a
person’s number of children
Measures of central tendency - to (univariately, using one measure) describe the
distribution (staafdiagram, waar het verschil te zien is) of variables on several different levels
of measurements (not all of them)
MEASURE OF CENTRAL TENDENCY: THE MEAN (INTERVAL/RATIO)
Measuring trust in the news media (on a 11-point scale, 0 = no trust, 10 = complete trust)
10 respondents in our sample (n=10)
what is the average (mean) trust in the news media in this sample?
Σx
Mean = M formula for the mean M=
n
E = sigma
X = variable (the different values a variables can have)
Ex = sum up the variables
2+3+5
Thus = M = = 3 1/3 all values are added up and divided by n, i.e. the number of
3
observations in the sample
Characteristics of the mean:
- Changing any score will change the mean
- Adding or removing a score will change the mean (unless that score is already equal
to the mean)
- Adding, subtracting, multiplying, dividing each score by a given value causes the
mean to change accordingly
- Sum of differences from the mean is zero Σ ( x −M )=0 subtract the mean from
all values and than add them up = 0
- Sum of squared differences from the mean is minimal Σ ( x −M )=minimum
subtract the means from the values and subsequently square the answers
this value is also called sum of squares (SS)
a larger SS means that scores deviate more from the mean
WHY ‘MINIMAL”? – if we had used any other value than the mean to calculate the SS,
it would have been larger than 42, it’s the minimum lowest possible value
Σx
μ=
N Population mean
MEASURE OF CENTRAL TENDENCY: THE MEDIAN (ORDINAL &
INTERVAL/RATIO)
To find the median:
1. Sort all cases based on their value on x
2. The value of the “middle case” equals the median (equal amount of cases below and
above)
, - Whenever n is an even number, the median is the mean value of the two middle
cases (e.g. 7, median is average of 3 + 4 = 3.5)
Note that the median is not as sensitive to outliers as the mean
Outlier – value that is far away from the rest of our distribution, a lot higher or lower it
sticks out
- Can ruin the interpretate value of the mean
- Use median the median seems more useful here (not sensitive to outlier)
Frequency – how many of each thing
To determine the median from a frequency table, we need to identify the first category that
exceeds ‘50% in the cumulative percent’ column (ORDINAL)
Cumulative percent = frequency : total
MEASURE OF CENTRAL TENDENCY: THE MODE (NOMINAL, ORDINAL, INTERVAL/RATIO)
Mode - the category with the largest amount of cases which one has the highest
frequency, look at the highest percentage in the frequency table
Distribution – shows you the counts of units of analysis within each of the values that they
can take on
Normal distribution:
- Symmetric looks the same on both sides
- Mean, median and mode are the same value
Skewed distribution
- Left: the mean is shifted to the left in a negatively skewed distribution
- Right: the mean is shifted to the right in a positively skewed distribution
look at the way it stretches out (left/right)