Statistics II - IB
Lecture 1 – Introduction and hypothesis testing
Objectives of the course:
- Knowledge about multivariate statistics: hypothesis testing,
multivariate regression analysis, analysis of variance, time-series
analysis
- Skills for performing multivariate statistical analysis: use of SPSS
Introduction to multivariate statistical analysis
Types of data
Nonmetric or qualitative data
- Presence of a feature, male/female, vegetarian yes/no?
Metric or quantitative
- Quantifying an attribute, how tall is the individual/how satisfied?
Measurement scales
- Nominal scale: numbers in place of labels, male/female
- Ordinal scale: ranking
- Interval scale: with no ‘zero’ reference point: Celsius
- Ratio scale: with ‘zero’ reference point: Height
Missing value analysis
What are missing data? For an individual we have only partial information
(we know the values of only some of its characteristics).
The goal of the analysis is to identify the true patterns and relationships
among variables even when some data are missing. Impact:
- Reduces the sample size
- Can distort results: is it a systematic of random data deficiency?
Types of missing data:
Missing completely at random MCAR: for any respondent, the probability
that the value of a variable is missing does not depend on any variable.
Unsystematic missingness.
,Missing at random MAR: for any respondent, the probability that the value
of a variable is missing depends on other variables. E.g., probability of
missing data is related to age.
https://iriseekhout.shinyapps.io/MissingMechanisms/
How to analyse the missing values?
- Check in each variable the percentage of missing values and the
number of extremes and outliers.
- Check in each observation the percentage of missing values and
how often it is an extreme or outlier (also, to what extent)
- Check how often the missing patterns occur: frequent patterns might
indicate causality. Which cases present these missing patterns?
How to handle the missing values?
- Ignore: if it is less than 10% of cases/variables
- Deletion: pairwise or listwise
- Imputation: mean, hot deck imputation, cold deck imputation
Deletion
Listwise: delete entire observation. The advantage is that the remaining
dataset is complete. A disadvantage: the reduced resulting sample size
due to the loss of the incomplete cases, biased dataset if not MCAR.
Pairwise: delete incomplete cases on an analysis-by-analysis basis (delete
from the calculation). Sample size remain the same for some analyses,
reduced for others. Disadvantage is the inconsistency of the sample size.
Imputation
The mean of entire data or group. Creates reduced variability.
Hot deck imputation: use an observation from the sample that is
considered similar.
Cold deck imputation: use an observation from an external data source
that is considered similar.
Rules of thumb to handle missing data:
< 10%: ignore or use any imputation method
10-20%: hot deck imputation (assuming MCAR)
>20%: delete
Examining data
Why should be examine data carefully? To prevent from jumping to wrong
conclusions. Understand the type of data to answer the following
questions:
- What are the characteristics of the data?
- Is there a common behaviour to all the data?
- Is there any missing data?
- Is there any outlier?
- Which analysis methods can we use?
,We should detect the major features of the probability distribution of the
variables. But, first of all: identify the type of data.
Examining qualitative data
What could make sense to calculate: Frequency table, minimum,
maximum, range, mode.
What graphical techniques can be applied: bar chart, pie chart.
Examining quantitative data
What could make sense to calculate: mean, mode, median, range,
interquartile range, standard deviation, variance, skewness, kurtosis.
What graphical techniques can be applied: scatterplot, histogram, boxplot.
The normal distribution is always reference for comparison. We should
detect the major features of the probability distribution of the variables.
The shape of the probability distribution is important for the measure of
centrality and dispersion of the data.
, What can we do with the characteristics of the data?
Design a correct model reproducing the features of the data. Choose an
adequate technique for the analysis:
- Is the sample size large enough?
- Are the assumptions required by the chosen analysis technique
satisfied by the data?
- Do we have all the necessary data to apply correctly the chosen
analysis technique?
Transform the data before studying them, if necessary.
Types of samples
Independent samples: the groups in the data do not correspond to each
other. The number of observations in each group can be different.
Matched pairs: the groups in the data correspond to each other. The
number of observations in each group are always the same.
Lecture 2 – Hypothesis Testing
Statistical inference and testing
Statistical inference is conclusions based on the sample. When we analyse
statistical data, we try to infer some characteristics of the process that has
generated the data.
Observing a sample and statistical inference does not provide ‘definitive’
conclusions, it just sizes up the different ‘maybes’.
Using a sample, we can make a:
- Confidence interval
- Hypothesis testing
- Model parameter estimation
Expected results come from probability theory. Observed results come
from experiments. Statistics links these two.
We can test if the unknown value of a parameter is equal to a chosen
value (or set of values): this is a hypothesis. Example:
We roll a die 10 times, write down the result and see that the sample
mean is 4.6. The standard deviation of the sample is s=1.35. Can we
infer that the die is a fair die?
A statistical test is a function of the observed data which gives just two
answers: reject / no not reject the null hypothesis. Often: the population
mean equals/does not equal the theoretical mean. Example:
H0: the population mean = 3.5
H1: the population mean does not equal 3.5