Summary Exam prep sheet: Statistics and Methodology
41 views 2 purchases
Course
Statistics and Methodology (880259M6)
Institution
Tilburg University (UVT)
The file contains materials to prepare for the course exam of Statistics and Methodology (880259-M-6), the core course for the M.Sc. Data Science & Society. It includes theoretical concepts from all 10 lectures. Knowing and being able to reproduce the materials of this summary should allow for a su...
Sta$s$cs and Methodology concepts and theory
Exam prepara)on M.Sc. Data Science & Society (year 2023/24)
1. Basics
Wald test
Sampling distribution = the probability distribution of a statistic.
p-value = the probability of observing the given test statistic, or one more extreme, if the null
hypothesis were true.
Statistical modeling - method of knowledge generation through the application of the statistical
analysis to real-world data. Allows to control for confounding factors. As opposed to statistical
testing which uses experimental data with random assignment.
2. Design
Data science cycle: Define -> Collect -> Process -> Clean
EDA can be used:
• to generate hypotheses for confirmatory data analysis;
• to sanity-check hypotheses.
3a. Data imputation
P items (columns) -> 2P possible response patterns (with missing data and without it)
Covariance coverage = the proportion of cases/observations/rows available to estimate a given
pairwise relationship (e.g., a covariance between two variables)
Missing data mechanisms
• MCAR -> non-response is not dependent on the data observed
o P(R|Ymis, Yobs) = P(R)
• MAR -> non-response is dependent on the data observed (e.g., by a certain cohort)
o P(R|Ymis, Yobs) = P(R|Yobs)
• MNAR -> non-response is directly defined by the data observed
o P(R|Ymis, Yobs) ≠ P(R|Yobs)
• Indirect MNAR -> certain variable correlated with non-response is not in the dataset
Missing data treatments
• Listwise deletion
o Biased parameter estimates for MAR and MNAR
o Biased (downwards) SEs
• Pairwise deletion
o Biased parameter estimates for MAR and MNAR
o Biased (downwards) SEs
• Unconditional mean substitution
o Biased parameter estimates in all scenarios
o Weakens measures of linear association
o Biased (downwards) SEs
, • Deterministic regression imputation (conditional mean subs.)
o Biased parameter estimates in all scenarios
o Inflates measures of linear association
o Biased (downwards) SEs
• Averaging available items (person-mean imputation)
o Biased parameter estimates for MAR and MNAR
o Biased parameter estimates if items do not contribute equally to the aggregate
score
• Last Observation Carried Forward
o Weakens estimates of growth
• Stochastic regression imputation
o Adds a random residual error to the imputated values to eliminate parameter bias
• Yimp = Y^mis + ε
o Biased (downwards) SEs
• Multiple imputation
o Models random residual error AND uncertainty in the regr. coefficients used to
create imputations
• A different set of coeff.s is randomly sampled to create each of the M
imputation
• Yimp = Y^mis + ε -> where Y^mis = 𝛽0 + 𝛽1Xmis is new for each M
o Eliminates parameter bias and SE bias (= accurate type I error rate)
o Biased parameter estimates for MNAR
3b. Outliers
• Int. student. residuals: an observation Xn is an outlier if Tn > c
where
• Ext. student. residuals: an observation Xn is an outlier if T(n) > c
o where:
o deletion mean and deletion SD are used,
o and (n) includes all observations bar the observation n itself.
• MAD: same logic but use TMAD:
• Tukey's boxplot method:
o A value outside of the inner fence (c = 1.5) is a possible outlier.
o A value outside of the outer fence (c = 3) is a probable outlier.
• By breakdown points (lowest first):
o Mean (int. stud. res.)
o Deletion mean (ext. stud. res)
o Tukey's boxplot
o MAD
, • Robust Mahalanobis (MCD estimation):
o "Multivariate generalization of the ext. stud. res."
o Fraction of the sample used to define center of the data determines robustness
• Fraction ↑ -> identified outliers ↓
o Cut-off determined as the sq.root of some quartile of chisq distribution
• Cutoff ↑ -> identified outliers ↓
4. Simple linear regression
• The full population model
o 𝑌 = 𝛽0 + 𝛽1𝑋 + 𝜀
• The estimated, sample model
o
• The estimated best-fit line(s)
o
• The true best-fit line
o
Sum of the squared residuals (RSS)
where
Statistical inference is needed to compute the precision with which we’ve estimated the OLS
estimators
Confidence intervals:
where tcrit = z1–α/2 for CI1–α
• => i.e. CI95 for the slope (𝛽^1) suggests that we can be 95% confident that the true value of 𝛽1
is between [_ ; _]
• => if we repeat the analysis an infinite number of times, 95% of the CIs that we calculate will
surround the true value of 𝛽1
• => CIs give us a plausible range for the population value of 𝛽
5. Multiple linear regression
Model fit:
where:
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller jtjurlik. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $4.78. You're not tied to anything after your purchase.