Samenvatting - IRM
WEEK 1
Introduction, Data & Visualization
Total error framework
There are errors in communication that can be
categorized into various types. Understanding
these types of errors can provide insights into the
dynamics of communication processes.
Sampling error
You will always get a sample and not the whole
universe.
“When a sample does not truly represent the entire
population”.
Example: Suppose a government agency conducts a survey to gather information about healthcare
access among residents in a particular city. They decide to use telephone surveys, assuming that most
households have landline phones or mobile phones.
→ Problem: not everyone has a telephone.
➢ Coverage error: The sampling frame is not representative of the target population (sampling
frame ≠ population). Coverage errors result from gaps between the sampling frame and the
total population.
In example:
o Undercoverage: The survey misses segments of the population, such as low-income
households, homeless individuals, or elderly people living alone, who may not have access
to phones or are less likely to be included in the sampling frame.
o Overcoverage: Conversely, the sampling frame may include non-residential or non-eligible
phone numbers, such as businesses, offices, or disconnected numbers, leading to an
overrepresentation of certain groups and distorting the survey results.
➢ Sample error: Low representativeness can be defined as properties of the population that are
over- or under- represented in the sample which leads to a higher sampling error.
➢ Non-response error: inability to obtain a useful response to all survey items from the entire
sample.
In example: Nonresponse error occurs when selected individuals in the sample do not
participate in the survey, potentially leading to biased estimates if nonrespondents differ
systematically from respondents in terms of healthcare access.
,Solution: post-stratification weights → you can make your sample closer to the population by using
post-stratification weights.
In example: if we want to correct for undercoverage: By assigning higher weights to responses from
underrepresented groups, the weighted sample becomes more representative of the population as a
whole.
Measurement error
Refers to the inaccuracy between the true value of a variable or attribute and the value obtained
through measurement or observation.
Werner Heisenberg
o Validity: refers to the degree to which the scores on a measure
represent the variable they are intended to.
→ does it measure what it’s supposed
▪ In practice: do these coefficients make sense? (i.e., do
the effect sizes and signs give plausible model results?)
10% increase in price leads to 200% increase in sales →
does this make sense??? No.
o Reliabiltiy: refers to the degree to which multiple measurements
give the same result.
→ is it stable?
▪ In practice: how much do these results change if:
o we add additional control variables to the model
o we take away some observations (e.g., outliers)
o we estimate the same model on a new dataset
Example: you’re in a coffee shop and you’re ordering a coffee
Measurement scales:
• Non-metric scales
Can measure only the direction of the response (e.g., yes/no).
o Nominal (categorical)
▪ Nominal scales categorize items into distinct groups or classes without
establishing any order or ranking among them.
▪ Note: mutually exclusive: not at the same time, collectively exhaustive: at
least one
▪ Examples: gender, country, student numbers, marital status
▪ In example: type of coffee (latte, cappuccino, mocha) → nominal variable.
Latte can’t be greater than cappuccino, so NO ranking.
o Ordinal
▪ Ordinal scales establish an order or ranking among the items being measured,
but they do not indicate the magnitude of differences between the
categories.
▪ Important: the difference between the ranking is not necessarily the same
Examples: gold/silver/bronze medal, whether you finished at first, second or
third place
, ▪ In example: The size (small, medium, large) → ordinal variable. Small is
smaller than medium, large is bigger than medium etc. → there is a clear rank
order
• Metric scales
They not only measure direction or classification, but intensity as well (e.g., strongly agree or
somewhat agree).
o Interval
▪ Interval scales measure variables where the difference between two
points is meaningful, but there is no true zero point.
▪ On an interval scale, equal differences between points represent equal
differences in the measured attribute.
▪ Examples: temperature in Celcius → where the difference between 20°C
and 30°C is the same as the difference between 30°C and 40°C, but zero
does not represent the complete absence of temperature.
o Ratio
▪ Ratio scales are similar to interval scales but have a true zero point,
meaning that zero represents the complete absence of the attribute
being measured.
▪ On a ratio scale, equal differences between points represent equal
differences in the measured attribute, and ratios between points are
meaningful.
▪ Example: height, weight, time, and income
→ Why is this important to know?
Get the units rights: the right statistical technique depends on what scale is used, e.g., metric vs. non-
metric. For example, it makes no sense to calculate the mean of a nominal or ordinal scaled variable.
Measuring some variables are easier than others → e.g measuring attitude, feelings, frugality is harder
to measure than income or age.
→ let’s say you want to measure frugality (zuinigheid). You can use eight, seven-point Likert-type
statements (1 = completely disagree, 7 = completely agree).
Statistical error (hypothesis testing)
There are two possible outcomes of hypothesis testing:
1. Fail to reject the null hypothesis (null is true)
2. Reject the null hypothesis (alternative is true)
Two types of error when hypothesis testing:
o Type 1 error (false positive): saying there is something
happening while there’s nothing happening → E.g. a doctor
saying to a man that he’s pregnant.
o Type 2 error (false negative): it's the error of concluding that
there is no effect when there actually is one → E.g. saying to a
woman that is most likely pregnant saying she’s not pregnant.
Statistical testing
, o What is a p-value? Probability of observed data or statistic (or more extreme) given that the null
hypothesis is true. & The probability of the observed deviation given that the null hypothesis is
true
If it’s “low” then the data are unlikely according to the null, and you can reject the null (low
chance of type 1 error)
o Typically we set the threshold (α) at 0.05 (that is: reject the null if p-value is < α)
Example:
Exploratory data analysis
Data preparation
o Always explore your data before running any model → this is important for the real world, in this
class we get data that have already been prepared.
o Few examples of problems in many datasets: recode missing observations (9999=missing), Check
mutually consistent: (age = 18, birthday = 4/30/1901)
o Data has to make sense
Visualization
The purpose of visualization is to
o Explore the data
o Understand and make sense of the data (“model-
free evidence”)
o Communicate the results
Choosing the right chart type
1. Showing the composition or distribution of one variable
2. Showing relationships between multiple (two) variables
3. Comparing data points or variables across multiple subunits
→ check lecture voor voorbeeld