The file contains materials to prepare for the course exam of Statistics and Methodology (880259-M-6), the core course for the M.Sc. Data Science & Society. It includes theoretical concepts from all 10 lectures. Knowing and being able to reproduce the materials of this summary should allow for a su...
Sta$s$cs and Methodology concepts and theory
Exam prepara)on M.Sc. Data Science & Society (year 2023/24)
1. Basics
Wald test
Sampling distribution = the probability distribution of a statistic.
p-value = the probability of observing the given test statistic, or one more extreme, if the null
hypothesis were true.
Statistical modeling - method of knowledge generation through the application of the statistical
analysis to real-world data. Allows to control for confounding factors. As opposed to statistical
testing which uses experimental data with random assignment.
2. Design
Data science cycle: Define -> Collect -> Process -> Clean
EDA can be used:
• to generate hypotheses for confirmatory data analysis;
• to sanity-check hypotheses.
3a. Data imputation
P items (columns) -> 2P possible response patterns (with missing data and without it)
Covariance coverage = the proportion of cases/observations/rows available to estimate a given
pairwise relationship (e.g., a covariance between two variables)
Missing data mechanisms
• MCAR -> non-response is not dependent on the data observed
o P(R|Ymis, Yobs) = P(R)
• MAR -> non-response is dependent on the data observed (e.g., by a certain cohort)
o P(R|Ymis, Yobs) = P(R|Yobs)
• MNAR -> non-response is directly defined by the data observed
o P(R|Ymis, Yobs) ≠ P(R|Yobs)
• Indirect MNAR -> certain variable correlated with non-response is not in the dataset
Missing data treatments
• Listwise deletion
o Biased parameter estimates for MAR and MNAR
o Biased (downwards) SEs
• Pairwise deletion
o Biased parameter estimates for MAR and MNAR
o Biased (downwards) SEs
• Unconditional mean substitution
o Biased parameter estimates in all scenarios
o Weakens measures of linear association
o Biased (downwards) SEs
, • Deterministic regression imputation (conditional mean subs.)
o Biased parameter estimates in all scenarios
o Inflates measures of linear association
o Biased (downwards) SEs
• Averaging available items (person-mean imputation)
o Biased parameter estimates for MAR and MNAR
o Biased parameter estimates if items do not contribute equally to the aggregate
score
• Last Observation Carried Forward
o Weakens estimates of growth
• Stochastic regression imputation
o Adds a random residual error to the imputated values to eliminate parameter bias
• Yimp = Y^mis + ε
o Biased (downwards) SEs
• Multiple imputation
o Models random residual error AND uncertainty in the regr. coefficients used to
create imputations
• A different set of coeff.s is randomly sampled to create each of the M
imputation
• Yimp = Y^mis + ε -> where Y^mis = 𝛽0 + 𝛽1Xmis is new for each M
o Eliminates parameter bias and SE bias (= accurate type I error rate)
o Biased parameter estimates for MNAR
3b. Outliers
• Int. student. residuals: an observation Xn is an outlier if Tn > c
where
• Ext. student. residuals: an observation Xn is an outlier if T(n) > c
o where:
o deletion mean and deletion SD are used,
o and (n) includes all observations bar the observation n itself.
• MAD: same logic but use TMAD:
• Tukey's boxplot method:
o A value outside of the inner fence (c = 1.5) is a possible outlier.
o A value outside of the outer fence (c = 3) is a probable outlier.
• By breakdown points (lowest first):
o Mean (int. stud. res.)
o Deletion mean (ext. stud. res)
o Tukey's boxplot
o MAD
, • Robust Mahalanobis (MCD estimation):
o "Multivariate generalization of the ext. stud. res."
o Fraction of the sample used to define center of the data determines robustness
• Fraction ↑ -> identified outliers ↓
o Cut-off determined as the sq.root of some quartile of chisq distribution
• Cutoff ↑ -> identified outliers ↓
4. Simple linear regression
• The full population model
o 𝑌 = 𝛽0 + 𝛽1𝑋 + 𝜀
• The estimated, sample model
o
• The estimated best-fit line(s)
o
• The true best-fit line
o
Sum of the squared residuals (RSS)
where
Statistical inference is needed to compute the precision with which we’ve estimated the OLS
estimators
Confidence intervals:
where tcrit = z1–α/2 for CI1–α
• => i.e. CI95 for the slope (𝛽^1) suggests that we can be 95% confident that the true value of 𝛽1
is between [_ ; _]
• => if we repeat the analysis an infinite number of times, 95% of the CIs that we calculate will
surround the true value of 𝛽1
• => CIs give us a plausible range for the population value of 𝛽
5. Multiple linear regression
Model fit:
where:
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jtjurlik. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €4,39. Je zit daarna nergens aan vast.