MSc. Data Science and Society Tilburg University 2019-2020
Lecture 1: Statistical Inference, Modeling, & Prediction
Statistical Reasoning: The purpose of statistics is to systematize the way that we account for
uncertainty when making data-based decisions.
Probability distributions: They quantify how likely it is to observe each possible value of some
probabilistic entity. Probability distributions are basically re-scaled frequency distributions. With an
infinite number of bins, a histogram smooths into a continuous curve. In a loose sense, each point on
the curve gives the probability of observing the corresponding X value in any given sample. The area
under the curve must integrate to 1.0.
Statistical Testing: We often want to distil the information in the preceding plots into a simple
statistic so we can make judgments. When we conduct statistical tests, we weight the estimated
effect by the precision of the estimate. For example: the Wald Test.
- A test statistic, by itself, is just an arbitrary number.
- Thus, we need to compare the test statistic to some objective reference which will tell us
something about how exceptional our test statistic is. This reference is known as a sampling
distribution. We compare the estimated value to a sampling distribution of t-statistics
assuming no effect (Thus, distribution quantifies the null hypothesis).
o The special case of a null hypothesis of no effect is called the nil-null.
- If our estimated statistic would be very unusual in a population where the null hypothesis is
true, we reject the null and claim a statistically significant effect.
Sampling Distribution: The sampling distribution quantifies the possible values of the test statistic
over infinite repeated sampling (The population is defined by an infinite sequence of repeated tests).
A sampling distribution is a slightly different concept than the distribution of a random variable.
- The sampling distribution quantifies the possible values of a statistic (e.g., f-statistic, t-
statistic, correlation coefficient, mean, etc.).
- The distribution of a random variable quantifies the possible values of a variable (e.g., sex,
age, attitude, salary, music preferences, etc.).
P-value: We can compute the probability of having sampled the data we observed, or more unusual
data, from a population wherein there is no true mean difference in ratings (by calculating the area
in the null distribution that exceeds our estimated test statistic).
Example: if t = 1.86 (test statistic) and the p-value is 0.032. All that we can say is that there is
a 0.032 probability of observing a test statistic at least as large as ˆt, if the null hypothesis is
true (proof by contradiction). We cannot say that there is a 0.032 probability of observing t, if
the null hypothesis is true because the probability of observing any individual point on a
continuous distribution is exactly zero.
One-tailed versus two-tailed: We only use a one-tailed test when we have directional hypotheses.
Statistical testing versus statistical modelling: Statistical testing is a very useful tool, but it quickly
reaches a limit because in experimental contexts, real-world “messiness” is controlled. However,
data scientists are rarely able to conduct experiments and deal with messy observational data. That
is why data scientists need statistical modeling.
1
,MSc. Data Science and Society Tilburg University 2019-2020
Statistical Modeling:
- Modelers attempt to build a mathematical representation of the (interesting aspects) of a
data distribution.
- Modelling the distribution = estimating ˆβ0 and ˆβ1
o Explaining the variation in the distribution by fitting a model to a sample.
- After we estimate ˆβ0 and ˆβ1, we can plug in new predictor data and get a predicted
outcome value for new cases.
Inference versus Prediction:
- When doing statistical inference, we focus on how certain variables relate to the outcome
(Example: Do men have higher job-satisfaction than women?)
- When doing prediction, we want to build a tool that can accurately guess future values.
(Example: Will increasing the number of contact hours improve grades?)
Lecture 2: Simple Linear Regression
Regression problem:
- Regression problems involve modeling a quantitative response.
- The regression problem begins with a random outcome variable, Y
- We hypothesize that the mean of Y is dependent on some set of fixed covariates, X.
Flavors of Probability Distribution:
- Marginal or unconditional: Each observation is expected to have the same value of Y,
regardless of their individual characteristics. There is a constant mean.
- Conditional: The value of Y that we expect for each observation is defined by the
observation’s individual characteristics. The distributions we consider in regression problems
have conditional means.
Projecting a Distribution onto the Plane: On the Y-axis, we plot our outcome variable. The X-axis
represents the predictor variable upon which we condition the mean of Y.
Modeling the X-Y Relationship in the Plane: We want to explain the relationship between Y and X by
finding the line that traverses the scatterplot as “closely” as possible to each point. This line is called
the “best fit line”. For any value of X, the corresponding point on the best fit line is the model’s best
guess for the value of Y.
Best fit line equation:
➔ We still need to account for the
estimation error →
- ε term represents a vector of errors
- The differences between Y and the true
regression line
- The errors, ε, are unknown parameters,
so we must estimate them.
2
,MSc. Data Science and Society Tilburg University 2019-2020
Regression models: In the estimated regression model, Y = ˆβ0 + ˆβ1X + εˆ, the εˆ term represent a
vector of residuals. The differences between Y and the estimated best fit line, ˆβ0 + ˆβ1X. The
residuals, εˆ, are sample estimates of the errors, ε.
3 → left side = expected mean within the population
Estimating the Regression Coefficients: The purpose of regression analysis is to use a sample of N
observed {Yn, Xn} pairs to find the best fit line defined by ˆβ0 and ˆβ1.
- The most popular method to do this involves minimizing the sum of the squared residuals
(i.e., estimated errors).
Residuals as the Basis of Estimation: The εˆn are defined in terms of deviations between each
observed Yn value and the corresponding Yˆn. Each εˆn is squared before summing to remove
negative values and produce a quadratic objective function.
The ordinary least squares (OLS) estimates of β1 and β0: The RSS is a very well-behaved objective
function that admits closed-form solutions for the minimizing values of ˆβ0 and ˆβ1. In the equation,
the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the random error.
Mean centering: to improve interpretation
- The intercept is defined as the expected value of Y when X = 0. We can use mean centering
so that X = 0 is a meaningful point.
- We mean-center X by subtracting the mean from each Xn.
- Now, suppose the estimated intercept is 143.83. This means that for the average X-value, Y
would be 143.83.
- Centering only translates the scale of the X-axis and does not change the linear relationship.
Thus, the slope won’t change, only the intercept.
Thinking about Inference: We need to use statistical inference to account for the precision with
which we’ve estimated ˆβ0 and ˆβ1. We cannot be sure that the linear relationship will be the same
if we examine a new sample.
- Our regression coefficients both have sampling distributions that we can use to judge the
precision of our estimates (normally distributed).
3
, MSc. Data Science and Society Tilburg University 2019-2020
Standard Errors: The standard deviations of the preceding sampling distributions quantify the
precision of our estimated ˆβ0 and ˆβ1.
- The sampling distributions are theoretical entities because the standard error is still an
estimate.
- Large SE is not good (not precise), small SE is good (quite precise)
Interpreting Confidence Intervals: Say we estimate a regression slope of ˆβ1 = 0.5 with an associated
95% confidence interval of CI = [0.25; 0.75]. We don’t talk about 95% probabilities when interpreting
Cis → instead, we talk about 95% confidence.
- The true value of β1 is fixed. β1 is either in our estimated interval or not. Thus, the
probability that β1 is within our estimated interval is either exactly 1 or exactly 0.
- If we collected a new sample—of the same size—re-estimated our model, and re-computed
the 95% CI for ˆβ1, we would get a different interval. Repeating this process an infinite
number of times results in a distribution of CIs. 95% of those Cis would surround the true
value of β1.
- Thus: We are 95% certain that if we repeat the analysis an infinite number of times, 95% of
the CIs that we’ll find will surround the true value of β1. → Which suggests that we can be
95% certain that the true value of β1 is somewhere between 3.57 and 4.97.
- CIs give us a plausible range for the population value of β → CIs support inferences.
Model Fit for inference: How well does our model describe/ represent the real world? It will never
be perfect. Our model explains some proportion of the outcome’s variability.
- The residual variance will be less than Var(Y)
- Reduce residuals until it is meaningless noise by adding new variables to the model
- We quantify the proportion of the outcome’s variance that is explained by our model using
the R2 statistic
- TSS = total sum of squares
- RSS = residual sum of squares:
- If R2 is 0.62, it means that our variable/ predictor explains 62% of the variability
Model Fit for Prediction: When assessing predictive performance, we will most often use the mean
squared error (MSE) as our criterion.
- The MSE quantifies the average squared prediction error. Taking the square root improves
interpretation. The RMSE estimates the magnitude of the expected prediction error.
- RMSE = 32.06 → we expect prediction errors with magnitudes of 32.06 Y.
4