Contents
Lecture 1 – Introduction, data exploration and visualization ......................................................................... 2
Lecture 2 - ANOVA........................................................................................................................................... 3
Lecture 3 – Linear Regression ......................................................................................................................... 5
Lecture 4 – Logistic Regression ....................................................................................................................... 6
Lecture 5 – Factor Analysis .............................................................................................................................. 9
Lecture 6 – Conjoint analysis......................................................................................................................... 11
Lecture 7 - Cluster Analysis ........................................................................................................................... 14
R..................................................................................................................................................................... 17
Tutorial 1 – Introduction, Data exploration and visualization .................................................................. 18
Tutorial 2 - ANOVA .................................................................................................................................... 18
Tutorial 3 – Linear regression.................................................................................................................... 20
Tutorial 4 – Logistic regression.................................................................................................................. 21
Tutorial 5 – Factor analysis ....................................................................................................................... 23
Tutorial 6 – Conjoint analysis .................................................................................................................... 25
Tutorial 7 – Cluster analysis ...................................................................................................................... 27
1
,Lecture 1 – Introduction, data exploration and visualization
Total error framework: There are three types of errors which causes the true value to be different to the
observed value:
1) Sampling error: The sample does not represent the population. Again, this could be caused by
three errors:
a. Coverage error: Not all potential respondents had the opportunity to respond (e.g. only
people with a telephone were being interviewed)
b. Sample error: The sample does not represent the frame population. The people selected
were not random (e.g. calling people whose phone number ends with a ‘9’. An error
occurs if people with a 9 differ from the people with an 8)
c. Non-response error: People who respondent differ from the people who don’t (e.g.
people who respond tend to be social and you’re researching social behaviour)
2) Measurement error: Not being able to capture all facets. That’s why measuring some variables
(mostly abstract) require multiple questions with summated scales, e.g. attitudes, feelings, or
beliefs. It can be split up into two things:
a. Validity: Does it measure what it’s supposed to? You can verify this by asking the
question whether the effect sizes and signs give plausible model results
b. Reliability: Is it stable? This can be verified by checking how much the results change
when:
i. We add additional control variables to the model
ii. We take away some observations (e.g. outliers)
iii. We estimate the same model on a new dataset
3) Statistical error: Two types of errors:
a. Type 1 (alpha): You accept the
alternative hypothesis, whereas the
null hypothesis was true
b. Type 2 (beta): You could not reject the
null hypothesis, whereas the
alternative hypothesis was correct. 1-
beta is called the beta and shows the
chance that you reject the null-
hypothesis when the alternative is true
Post-stratification weights: Adjusting the weights of
different subgroups within the sample so as to get
closer to the population (e.g. if there are 3 times less
males in your sample, as opposed to your population,
adjust the weight and triple the weight of the average of
males in your weighted average)
Measurement scales:
• Non-metric: They can only measure the direction of the response
o Nominal (categorical): Numbers or tags used for identifying or classifying objects in
mutually exclusive (if one, then not the other) and collectively exhaustive (at least one)
categories, e.g. SNR or gender
o Ordinal: Numbers are assigned to objects to indicate the relative positions of some
characteristics of objects, but not the magnitude of difference between them
2
, • Metric (continuous): Not only do they measure the direction, but also the intensity
o Interval: Numbers are assigned to objects to indicate the relative positions of some
characteristic of objects with differences between objects being comparable. The zero is
arbitrary, e.g. Likert scale (if it went from 2 to 8, instead of 1-7, it would work too.
Position is relative), satisfaction scale, perceptual constructs, temperature
(Fahrenheit/Celsius)
o Ratio: The most precise scale. Absolute zero point. Has all the advantages of other scales.
The sum of different observations can make sense. E.g. weight, height, age, income,
temperature (Kelvin)
p-value: Probability of observed data or statistic (or more extreme) given that the null hypothesis is true.
With other words: The probability of having a certain statistic (e.g. -3), or more extreme (e.g. even lower
than -3) whereas the null hypothesis is true (e.g. the null hypothesis predicts no effect, 0). When your p-
value is 0.02, there’s a 2% chance of having these results when in reality, there’s nothing going on.
Typically 0.05, and the null hypothesis can be rejected when the p-value is lower. This is NOT the
probability of the hypothesis being tested is true and you should not base your business decision solely on
the p-value. You should interpret the results, such as the power, measurement, study design and
numerical and graphical summaries of the data
Data preparation: Always explore your data before running any model. Check the data for any errors, for
example with visualization
Visualization: The chart should show the composition or distribution of one variable or compares data
points or variables across multiple subunits (e.g. male vs female). It’s purpose is to:
• Explore the data
• Understanding and making sense of the data
• Communicating the results
Lecture 2 - ANOVA
ANOVA: Test if there are differences in the mean of a metric (interval/ratio) dependent variable across
different levels of one or more non-metric (nominal/ordinal) independent variables (factors). It’s a special
case of linear regression. Types of ANOVAs:
• T-test: A special case of the ANOVA since it only compares the means of a metric variable of two
level-factor. Nevertheless, even in this case can an ANOVA be used
• One-way ANOVA: An ANOVA with only one factor
• Two-way ANOVA: An ANOVA with two factors. 2x3 ANOVA refers to an ANOVA with two factors,
whereby a factor has 3 levels and the other 2 levels
• ANCOVA: A normal ANOVA, except you account for any variables outside your models (control
variables / covariates)
Power: 1- P(Null | Alt). A good power-level is 0.8, and an excellent is 0.95. Depends on the following:
• Effect size: We have little influence on that. One way to predict the effect size is via Cohen’s f
• Sample size: This is the only thing we can tweak. If you expect a large effect, only a small sample
will be sufficient to find an effect, but if you expect a small effect, you’ll need a lot of participants
to find an effect. There are many factors determining the sample size and any rules regarding the
minimum number (e.g. 75 participants) are only rules of thumb
• Alpha: Typically fixed at 0.05
3