Chapter 2 — Statistical Learning
Y = f (X) + ε. Estimate using Ŷ = fˆ(X). The ideal/true f (x) = E(Y | X = x) is called the regression function.
2 2 2
E(Y − Ŷ )2 = E[f (X) + − fˆ(X)]2 = [f (X) − fˆ(X)] + V ar() = E[(Y − fˆ(x0 ))2 ] = E[(Y − E[Y | x0 ]) ] + (f (x0 ) − fˆ(x0 ))
| {z } | {z } | {z } | {z }
Reducible Irreducible Irreducible error Reducible error
2 2
Reducible error: E[(f (x0 ) − fˆ(x0 ))2 ] = (f (x0 ) − E[fˆ(x0 )]) + E[(E[fˆ(x0 )] − fˆ(x0 )) ], where fˆ(x0 ) is random variable (repeatedly draw training sample from population)
| {z } | {z }
Squared bias of fˆ(x0 ) V arianceof fˆ(x0 )
When estimating f , there typically are few data points for some x, meaning E(Y | X = x) can’t be computed. Instead, average over a neighborhood N (x) of x: fˆ(x) = Ave(Y | X ∈ N (x)). This
nearest neighbor averaging works well or small p (p ≤ 4) and large N ), but can be lousy when p is large because of curse of dimensionality.
Parametric methods for estimating f (X): Non-parametric methods for estimating f (X):
1. Assume form, e.g. linear: f (X) = β0 + β1 X1 + ... + βp Xp Doesn’t explicitly assume the form off , but seeks an estimate of f that gets as close
2. Fit or train the model, e.g. using least-squares if linear. to the data points as possible without being too rough.
+ simplify problem to merely estimating params instead of whole function, very interpretable + Can accurately fit a wider range of possible shapes for f
- chosen model likely won’t match true form of f (X) (flexible models tend to overfit) - For more observations required for an accurate estimate
Trade-off between flexibility (useful mainly for prediction) and interpretability (mainly for inference).
• For regression:
Pn 2 2 2 2
Training MSE (Mean-Squared Error): 1 ˆ ˆ ˆ ˆ ˆ
i=1 (yi − f (xi )) . Test MSE: Ave(y0 − f (x0 )) . Expected test MSE: E(y0 − f (x0 )) = V ar(f (x0 )) + [Bias(f (x0 ))] + V ar()
n
| {z } | {z }
Reducible Irreducible
A fundamental property of statistical learning that holds regardless of the particular data set at hand and regardless of the statistical method being used: as model flexibility increases, training MSE
will decrease, but the test MSE may not. Training MSE monotonically decreases, while test MSE goes up at some point. When a given method yields a small training MSE but a large test MSE,
we are said to be overfitting the data. The flexibility level corresponding to the model with minimal test MSE can vary among datasets. More flexible methods have a higher V ar(fˆ(x0 )) but a lower
Bias(fˆ(x0 )), and vice versa (bias-variance trade-off ).
• For classification:
Pn
Training error rate: 1 i=1 I(yi 6= ŷi ), where yi 6= ŷi =⇒ I(yi 6= ŷi ) = 1. Test error rate: Ave(I(yi 6= ŷi ))
n
A Bayes classifier simply picks the class with highest Pr(Y = j | X = x0 ) per observation. Bayes error rate (irreducible): 1 − E(maxj Pr(Y = j | X)) averaging over X . But normally,
distributions are unknown, so KNN (K-nearest Neighbor estimates the conditional probability for class j as the fraction of points of the K observations nearest to x0 whose response values equal j :
Pr(Y = j | X = x0 ) = K 1 P
i∈N I(yi = j), where N0 are K nearest data points. Lower K = lower bias, higher variance.
0
Chapter 3 — Linear Regression
Simple linear regression: Y ≈ β0 + β1 X , with intercept β0 and slope β1 . Multiple linear regression: Y = β0 + β1 X1 + ... + βp Xp +
Estimate using Ŷ = β̂0 + β̂1 X (least squares line) We can’t use simple regression per predictor, as that ignores other predictors. In contrast, multiple linear
2 regression estimates the relationship between X and Y while holding other predictors fixed. Estimate using
e2
Pn 2
Least squares minimizes the residual sum of squares (RSS): e1 + ... + n = i=1 (yi − ŷi ) ,
ŷ = β̂1 x1 + ... + β̂p Xp . Correlations increase variance.
where ei = yi − ŷi = yi − β̂0 − β̂1 xi is the ith residual. Pn 2 Pn 2
Again, use least squares to minimize RSS = i=1 (yi − ŷi ) = i=1 (yi − β̂0 − β̂1 xi1 − β̂p xip )
Pn
(x −x̄)(yi −ȳ)
β̂1 = Pn i
i=1 and β̂0 = ȳ − β̂1 x̄, where ȳ = n 1 Pn y
i=1 i and x̄ = Is at least one of the predictors related to the response?
(x −x̄)2
i=1 i Use hypothesis test again: • H0 : β1 = ... = βp = 0 • Ha : at least one βj is non-zero.
1 P n
n i=1 = xi are the sample means, minimize RSS . Calculated using F-statistic:
(T SS−RSS)/p
F = RSS/(n−p−1) ∼ Fp,n−p−1 . If linear model assumptions are true,
Full form Y = β − 0 + β1 X + is the population regression line.
2 2
SE(β̂j ) then E[RSS/(n − p − 1)] = σ , and assuming H0 then E[(T SS − RSS)/p] = σ , in which case F ≈ 1.
The standard error of an estimator reflects how it varies under repeated sampling:
2
tells us the average amount that this estimate β̂j differs from the actual value of βj . However, if Ha is true, then E[(T SS − RSS)/p] > σ and F > 1. A relatively large F-statistic is needed
2 to reject H0 if n is small.
V ar(µ̂) = SE(µ̂)2 = σn , where σ is the standard deviation of each of the realizations Assuming H0 and the errors i are normally distributed, then F follows an F-distribution. For any n an
yi of Y . p, the p-value can be calculated, which decides if H0 gets rejected.
Subset q of p of the coefficients can also be tested for zero: • H0 : βp−q+1 = βp−q+2 = ... =
SE(β̂0 )2 = σ 2 [ n
1 + P x̄2 2
= Pn σ
2 2
n (x −x̄)2 ] and SE(β̂1 ) (x −x̄)2 , where σ = βp = 0 • Ha : at least one βj of q is non-zero.
i=1 i i=1 i
p−q (RSS −RSS)/q
V ar(). In this case we fit all variables except the last q : F = . The t-value of a coefficient in
Generally, σ isn’t known, but is called residual standard error (RSE) when estimated from RSS/(n−p−1)
p multiple regression = F-statistic that omits that coefficient. However, the t-value does not take the number
data: RSS/(n − 2). 95% confidence interval: β̂k ± 2 ∗ SE(β̂k ) of predictors into account, while F does. The F-statistic works best if p is relatively small compared to n.
Standard errors can be used for hypothesis tests on the coefficients: Which subset of the predictors are relevant?
• H0 : β1 = 0 — There is no relationship between X and Y p
• Ha : β1 6= 0 — There is some relationship between X and Y Since assessing fit for all 2 models is infeasible, we instead use one of three approaches: 1) Forward selection;
greedy algorithm which continuously selects an extra predictor from p that results in the lowest RSS (might
The t-statistic measures the number of standard deviations (errors) that β̂1 is away from include variables early that later become redundant) 2) Backward selection; starting with all variables, greedily
zero. If there really is no relationship between X and Y , then we expect that t will have remove the one with the largest p-value (cannot be used if p > n 3) Mixed selection; forward selection, but
a t-distribution with n − 2 degrees of freedom. The t-distribution has a bell shape and if at any point one of the p-values becomes too large, remove
for values of n greater than approximately 30 it is quite similar to the normal distribution.
How well does the model fit the data?
The p-value is the probability of observing any number equal to |t| or larger in absolute
value, assuming β1 = 0. A small p-value indicates that it is unlikely to observe such a R2 and RSE are also common measures for fit for multiple linear regression. R2 = Cor(Y, Ŷ )2 (in fact, the
substantial association between the predictor and the response due to chance, were there fitted linear model maximizes this correlation between response and fitted linear model among all possible
2
no association between X and Y . Hence, a small p-value indicates an association between linear models). However, R will always be higher for more variables when testing on the training set.
q
the predictor and response. 1
In multiple linear regression: RSE = RSS , which does penalize redundant (too little decrease
Now, assuming Ha , we can assess accuracy of the model: the residual standard error (RSE) n−p−1
1 Pn (y − ŷ )2 is the average amount that the response will de- in RSS ) variables.
q q
1 RSS =
n−2 n−2 i=1 i i What response value should we predict and how accurate is that prediction?
viate from the true regression line, and is considered a measure of the lack of fit of the Three sort of uncertainty associated with filling in ŷ = β̂ + β̂ X + ... + β̂ X :
0 1 1 p p
model Y = β0 + β1 X + to the data.
1. The least squares plane Ŷ = β̂0 + β̂1 X1 + ... + β̂p Xp is only an estimate for the true population regression
R2 is another measure of fit; the proportion of variance in Y explained by X : R2 =
T SS−RSS plane f (X) = β 0 +β X
1 1 +...+β X
p p (related to reducible error. Confidence intervals can be calculated
= 1 − RSS (yi − ȳ)2 is the total sum of squares
P
, where T SS = to assess accuracy)
T SS T SS
TSS measures the total variance of Y (inherent in the response before the regression), 2. Since linear models are often an approximation, there is an additional source of potentially reducible
while RSS measures the amount of variability left unexplained after regression. error called model bias. The book ignores this, though.
2
Like R , correlation also measures the linear relationship between X and Y : Cor(X, Y ) = 3. Random, irreducible error . Prediction intervals are confidence intervals that also include the irreducible
Pn
(x −x̄)(y − ȳ) error. So a confidence interval concerns the average value f (X), while a prediction interval concerns a
qP i=1 i q i
= r , and R2 = r 2 (simple linear regression) particular value Y = f (x) + . Both intervals have the same center/mean.
n (x −x̄)2 Pn (y −ȳ)2
i=1 i i=1 i
1 if first value β0 + β1 + i if first value
For a binary qualitative predictor (factor ), create a indicator/dummy variable: xi = In regression: yi = β0 + β1 xi + i =
0 if second value β0 + i if second value
1 if first value β0 + β1 + i if first value
We could also choose xi = in which case yi = β0 + β1 xi + i = Now, β0 is the average over both values.
−1 if second value β0 − β1 + i if second value
β0 + β1 + i if first value
1 if first value
For three+ values, create dummy variable per value j : xij = for all values except last: yi = β0 + β1 xi1 + β2 xi2 + i = β + β2 + i if second value
0 if not first value 0
β0 + i if third value
Now, β0 is the coefficient for the baseline value, and βj is the difference between the baseline and the corresponding value (level). The final predictions do not depend on the coding of the values, but
the coefficients and their p-values do. Therefore, rather than rely on individual coefficients, we can use an F-test to test H0 : β1 = β2 = 0
Relaxing two highly restrictive assumptions of the linear model:
• Additive assumption, which assumes that the effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors.
Taking synergy into account, we add an interaction term: Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + = β0 + (β1 + β3 X2 )X1 + β2 X2 + = β0 + β̃1 X1 + β2 X2 + , where β̃1 = β1 + β3 X2 . The
hierarchical principle states that if we include an interaction in a model, we hierarchical should also include the main effects, even if the p-values associated with principle their coefficients are not
significant, since interactions are hard to interpret in a model without main effects — their meaning is changed.
β2 if first value β0 + β2 if first value
Mix of a quantitative and qualitative binary variable, without interaction (parallel lines): f (X) ≈ β0 + β1 X1 + = β 1 X1
0 if second value β0 if second value
β2 + β3 X1 if first value (β0 + β2 ) + (β1 + β3 )X1 if first value
Mix of a quantitative and qualitative binary variable, with interaction: f (X) ≈ β0 + β1 X1 + =
0 if second value β0 + β1 X1 if second value
• Linear assumption, which assumes that the change in the response Y due to a one-unit change in Xj is constant, regardless of the value of Xj .
2 2
Use polynomial regression: include transformed version of the predictors, e.g. for a quadratic shape: Y = β0 + β1 X1 + β2 X1 + (still a linear model with X2 = X1 )
Potential problems:
1. Non-linearity of the data: use a residual plot (of residuals ei = yi − ŷi versus predictors x1 ). Patterns suggest something wrong with the model.
2. Correlation of errors i . Causes underestimation of standard errors and confidence intervals. Time series data often has tracking; adjacent residuals have similar values.
2
√
3. Non-constant variance of errors, called heteroscedasticity (V ar(i ) 6= σ ). Funnel shape in the residual plot. You could use a concave function like Y or log(Y )
4. Outliers (large residual ei = yi − ŷi ) can deteriorate the RSE and related stats. Studentized residual (divide ei by its estimated standard error) > 3 = possible outlier.
(xi −x̄)2
p+1
5. High leverage observations have unusual predictors xi . Leverage statistic: hi = 1 + Pn , which is between 1 and 1. Average leverage is always .
n (x −x̄)2 n n
i0 =1 i0
6. Collinearity: two or more predictors are closely (cor)related. Reduces certainty in the coefficient estimates (larger standard errors) and power (correctly detecting a non-zero coefficient) of hypothesis
tests. Detect collinearity by inspecting correlation matrix for high values. Doens’t work for multicollinearity, so compute variance inflation factor (VIF), the ratio of the variance of β̂j when fitting the
1 2 2 2
full model divided by the variance of β̂j if fit on its own: V IF (β̂j ) = , where R X |X is the R from a regression of Xj onto all other predictors. If R X |X is close to
1−R2 X |X j −j j −j
j −j
1, collinearity is present. 1 ≤ V IF . 5, 10 ≤ V IF indicates collinearity.
K-nearest neighbor regression (KNN-regression) is a non-parametric (assumes no form of f (X)) method: fˆ(X0 ) = 1
P
K xi ∈N0 yi , where N0 are the K training observations closest to x0 . Higher K ←
smoother fit. Optimal K depends on the bias-variance trade-off. The parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the
true form of f (X), in which case the non-parametric approach incurs a cost in variance that is not offset by a reduction in bias. KNN also tends to perform worse in higher dimensions (curse of
dimensionality; with more predictors, the ”nearest” K observations are further away).
Chapter 4 — Classification
For a binary response, linear regression is an option (and equivalent to linear discriminant analysis (LDA)) (probabilities can be outside [0,1] tho). Only when the qualitative response’s values have a
natural ordering and the gaps between them are equal (or binary), can we encode categorical values in a quantitative response and use linear regression. In other cases, we have to use logistic regression,
which models the probability of a value of the response using the logistic function: Pr(Y = 1 | X) = p(X) = eβ0 +β1 X and Pr(Y = 0 | X) = 1 .
1+eβ0 +β1 X 1+eβ0 +β1 X
p(X) p(X)
The odds
1−p(X)
= eβ0 +β1 X can take on any value between 0 and ∞. The log-odds/logit: ln( 1−p(X) ) = β0 + β1 X .
Where in linear regression we used least squares to estimate β0 and β1 , with logistic regression we use maximum likelihood, which chooses β̂0 and β̂1 that maximize the likelihood function: l(β0 , β1 ) =
Q Q βˆ1
i:yi =1 p(xi ) j:yj =0 (1 − p(xj )). Significance of β1 can again be assessed using the z-statistic: z=
SE(β̂1 )
(same as t-statistic for linear regression). The intercept β̂0 is generally not of interest.
Qualitative predictors are again encoded in dummy variables.