Lecture 1 — Introduction Deductive reasoning is truth-preserving, while inductive reasoning adds information (generalizes). Imputation: filling in missing values (danger of influencing model).
Missing data mechanisms (cannot infer from the observed data alone; only when knowledgeable about the nature of the missing data mechanism. If not, assume MAR and pray):
• MCAR (missing completely at random): i.e. Pr(X =?) = p. Removing sample: won’t cause bias. Imputation: won’t cause bias if value is taken randomly from sample with same class.
• MAR (missing at random): probability of missing depends on class i.e. Pr(X =? | Y = 0) = p and Pr(X =? | Y = 1) = q . Removing sample: will cause bias. Imputation: works as before.
• MNAR (missing not at random): probability that value is missing depends on the value itself, i.e. Pr(X =? | X < c) = p and Pr(X =? | X ≥ c) = q . Can’t be salvaged.
Lecture 2 — Classification Trees (1) Classification trees essentially partition the attribute space. An impurity measure/function i(t) = Φ(p1 , ..., pJ ), where the pj are the relative frequencies of
the J classes, indicates “how pure”, i.e. containing only samples from a single class. Quality of binary split s in node t is the reduction of impurity that it achieves:
4i(s, t) = i(t) − π(l)i(l) + π(r)i(r), where l and r are the left and right child nodes of t, and π(l) and π(r) are the proportions of samples sent to l and r .
Impurity functions:
• Resubstitution error; i(t) = 1 − maxj p(j | t) where p(j | t) is the relative frequency of class j in node t. Fraction of cases that is classified incorrectly if we assign all samples to the majority
class. It is not concave/doesn’t belong to F ; it only decreases at a constant rate as the node becomes purer, meaning it doesn’t give priority to purer nodes.
• Gini-index (CART, Rpart); i(t) = p(0 | t)p(1 | t) = p(0 | t)(1 − p(0 | t)), essentially attempts to minimize the variance (p(1 − p) of a Bernoulli random variable of the class distribution
• Entropy (C4.5, Rpart); i(t) = −p(0 | t) log p(0 | t) − p(1 | t) log p(1 | t) = −p(0 | t) log p(0 | t) − (1 − p(0 | t)) log(1 − p(0 | t))
00
Concave impurity functions (for two-class problems) of the class F have the properties: • Φ(0) = Φ(1) = 0 • Φ(p(0)) = Φ(1 − p(0)) (symmetry) • Φ (p(0)) < 0, 0 < p(0) < 1 (strictly concave)
L L−1
- Splits on categorical attribute: (2 − 2)/2 = 2 − 2 distinct splits to consider. For two-class problems and Φ ∈ F , we only have to check L − 1 possible splits. Sort the bl on their value of
j
p(0 | x = bl ) like so: p(0|x = bl1 ) ≤ ... ≤ p(0 | x = bl ), where x is a categorical attribute with values bl1 through bl . Then one of the subsets {bl1 , ..., bl }, h = 1, ..., L − 1 is optimal split.
L L h
- Splits on numerical attribute: we only consider splits (halfway) between two consecutive values of x, and there at most n distinct values of a numeric attribute in the training sample. However,
optimal splits can only occur at segment borders, where a segment is a block of consecutive values of the split attribute for which the class distribution is identical.
Lecture 3 — Classification Trees (2) Unless the problem is truly deterministic, we will likely overfit if we continue splitting nodes until all are pure. Solutions:
• Stopping rules; which decide when to stop splitting. Disadvantage: good split may come after a weak split. We only look one step ahead and may miss the good follow-up split.
• Pruning; first grow a very large tree on the training sample, then prune.
Cost-complexity pruning reduces the number of pruned subtrees we have to consider by selecting the ones that are the “best of their kind”. Total cost of tree T : Cα (T ) = R(T ) + α|Te| =
P #wrong classifications made by T #wrong classifications made by node t
t∈Te R(t) + α, where R(T ) = #training samples
is the resubstitution error of the whole tree T , R(t) =
#training samples
is the number of errors we
make in node t if we predict the majority class, divided by the total number of observations (not the number of observations in the node!) in the training sample, Te is the set of leaf nodes of T , and
α|Te| a penalty for the complexity of the tree. For any α, there exists a smallest minimizing subtree T (α) of Tmax that satisfies 1) Cα (T (α)) = min
T ≤Tmax Cα (T ) (T (α) minimizes total
0 0
cost for that value of α), and 2) if Cα (T ) = Cα (T (α)) then T (α) ≤ T (T (α) is a pruned subtree of all trees that minimize total cost). Note: T ≤ T means T is a pruned subtree of T , i.e. it
can be obtained by pruning T in 0 or more nodes. Now, construct a decreasing sequence of pruned subtrees of Tmax , like so: Tmax > T1 > ... > {root}, such that Tk is the smallest minimizing
subtree for α ∈ [αk , αk+1 ). (It is important that Tk+1 is a pruned subtree of Tk , so that no backtracking is required.) After pruning in t, t’s contribution to total cost is Cα ({t}) = R(t) + α.
et |, where Tt is the subtree of T rooted in node t, R(Tt ) = P 0
The contribution of Tt to the total cost is: Cα (Tt ) = R(Tt ) + α|T R(t0 ). (For clarity: when pruned in t, node t itself remains
t ∈T t
e
R(t)−R(Tt )
in T − Tt .) Now, T − Tt becomes better than T when Cα ({t}) = Cα (Tt ). This happens at α = (because et |). For each non-terminal node t we compute its
R(t) + α = R(Tt ) + α|T
|T
e |−1
t
R(t)−R(Tt ) increase in error due to pruning in t
“critical” alpha value: g(t) == decrease in # leaf nodes due to pruning in t
. Then we prune (top-down) in the nodes for which g(t) is the smallest (the “weakest links”). Repeat
|T
e |−1
t
until root node is reached. Recalculate the critical alpha values gk (t) for every iteration! This yields a sequence of intervals [αk , αk+1 ), with a smallest minimizing subtree Tk per interval.
r
Rts (1−Rts )
Finally, pick T from the sequence with the lowest error rate Rts (T ) on the test set (an estimate of the true error rate R? (T )). The standard error of this estimate: SE(Rts ) = ntest
.
ts
1 − SE rule: select the smallest tree with R within one standard error of the minimum. √
Can also use cross-validation: Construct a tree on all data, compute the sequence of subtrees/α-intervals, and set a β -value that represents each interval: β1 = 0, β2 = α2 α3 , ..., βK−1 =
alphaK−1 αK , αK = ∞. Then divide the data into v folds. For each fold Gj : build a tree on all data except Gj , determine the smallest minimizing subtrees T (j) (β1 ), ..., T (j) (βK ), and
p
Pv (j)
compute the error for all those subtrees separately on Gj . Then, for each βk : sum the total error over all folds like j=1 T (βk ). Pick β with lowest overall error (could also use the 1 − SE
rule).
Lecture 4 — Bagging and Random Forests Mean squared prediction error of a model at some fixed point x: EP (Y |x) [(Y − fˆ(x))2 ] = (f (x)− fˆ(x))2 +EP (Y |x) [(Y −f (x))2 ], where f (x) ≡ E[Y |
x] is the true (population) regression function, and the expectation is taken with respect to P (Y | x). The irreducible error is the error of the best possible prediction (which is f (x) ≡ E[Y | x]), and
2 2 2
is equal to the variance of Y around the regression line at the point x. The reducible error is decomposable in bias and variance: ED [(f (x)− fˆ(x)) ] = (f (x)−ED [fˆ(x)]) +ED [(fˆ(x)−ED [fˆ(x)]) ].
As sample size increases, variance goes down, but bias remains the same. Hence, we can afford to fit more complex models if the data set is large. In practice, we have to find the right trade-off
between bias and variance in order to get small prediction error.
Single trees on their own really suck for prediction because of their high variance. Averaging the predictions using bootstrap aggregating/bagging: 1) Draw a sample with replacement from the training set,
and of the same size as the training set. 2) Grow a tree on this sample (no pruning). 3) Repeat these steps M times to create M different trees. Prediction: 1) Predict a test sample using each of
PM
the M trees in turn. 2) Take the majority vote of the M predictions as the final prediction. (For regression, just take the average: fˆBAG (x) = 1 ˆ
m=1 fm (x)). The output of each model can
M
th 2 2
be written as the true value plus an error term in the form: fˆm (x) = f (x) + em (x). The expected squared error of the m model then becomes: EP (x) [(fˆm (x) − f (x)) ] = EP (x) [em (x) ],
PM 2
where the expectation is taken with respect to the distribution of x. The average error made by the models acting individually is therefore: ErrorAV = 1 m=1 EP (x) [em (x) ]. Similarly, the
M
PM 2 1 PM 2
expected error from the committee is given by ErrorBAG = EP (x) [( 1 ˆ
m=1 fm (x) − f (x)) ] = EP (x) [( M m=1 em (x)) ]. Assume that EP (x) [em (x)] = 0 for all m, and that the errors
M
PM
of different models are uncorrelated, so that: EP (x) [em (x)en (x)] = 0, for all m 6= n, then we obtain EBAG = 12 m=1 EP (x) [em (x)2 ] = M 1 E
AV (a *sensational* reduction!) In order to
M
make the EP (x) [em (x)en (x)] = 0 assumption more true (if we don’t, the reduction in overall error tends to be much smaller than suggested by this formula, because errors of individual models are
(sometimes highly) correlated), we use random forests as an attempt to “de-correlate” the predictions of the individual trees: each time we have to determine the best split in a node, we first randomly
select a subset of the features, and then determine the best split on this random subset of features, and perform that split (otherwise the procedure is identical to bagging).
Lecture 5 and 6 — Undirected Graphical Models for Discrete Data
n(x,y)
Saturated model: If n(x, y) is the number of observations which have particular values for x and y , then P̂ (x, y) = is an estimation of the joint probability distribution of x and y . In the
n
saturated model, the fitted counts n̂(x, y) = nP̂ (x, y) are the same as the observed counts themselves. This model suffers from the curse of dimensionality and doesn’t scale very well: the number
k
of probabilities that need to be estimated grows exponentially in the number of variables. For k categorical variables with m possible values each, there are m − 1 probabilities to estimate (the
−1 is because of the sum-to-one constraint on probability distributions). Look for appropriate independence assumptions to find a simpler model that still gives a good fit in order to avoid the curse.
n(x) n(y) n(x)n(y)
Independence model: assume x and y are independent, P̂ (x, y) = P̂ (x)P̂ (y) = = . Now we only have to calculate (m − 1)k values. The fitted counts of the independence model
n n n2
n(x)n(y) n(x)n(y)
are given by n̂(x, y) = nP̂ (x, y) = n = . They are not the same as the observed counts anymore, but quite close (a satisfactory fit of the data). However, the independence
n2 n
assumption usually doesn’t hold. Interesting models are somewhere in between saturated and mutual independence: this requires the notion of conditional independence: X and Y are independent if
and only if P (x, y) = P (x)P (y) for all values (x, y). As a consequence P (x | y) = P (x), and P (y | x) = P (y). In words, X and Y do not provide information about each other: X ⊥ ⊥ Y.
P
P (X) = Y P (X, Y ) (sum rule) P (X, Y ) = P (X)P (Y | X) (product rule) P (X, Y ) = P (X)P (Y ) (product rule iff X ⊥ ⊥ Y)
We can prove that X and Y are independent by using the factorization criterion: X ⊥ ⊥ Y iff there are functions g(x) and h(y)
P(not necessarily
P the marginal distributions
P of X and Y ) such that
P (x, y) = g(x)h(y) (in logarithmic form: log P (x, y) = log g(x) + log h(y). Proof: if P (x, y) = g(x)h(y) then P (x) = y P (x, y) = y g(x)h(y) = g(x) y h(y) = c1 g(x), so g(x) is
proportional to P (x). Likewise, h(y) is proportional to P (y). Therefore, P (x, y) = g(x)h(y) = 1 P (x) 1 P (y) = c3 P (x)P (y). Summing over both x and y establishes that c3 = 1.
c c 1 2
X and Y are conditionally independent given Z iff P (x, y | z) = P (x | z)P (y | z) for all values (x, y) and for all values z for which P (z) > 0. Equivalently, P (x | y, z) = P (x | z). In words, if
P (x,z)P (y,z)
I know the value of Z , then Y doesn’t provide any additional information about X . We also write X ⊥
⊥ Y | Z . An equivalent formulation is P (x, y, z) = P (z)
. Factorisation criterion:
X ⊥⊥ Y | Z iff there exist functions g and h such that P (x, y, z) = g(x, z)h(y, z) for all (x, y) and for all z for which P (z) > 0.
Random Vector X = (X1 , ..., Xk ) with probability distribution P (X) has the undirected conditional independence graph G = (K, E) with K = 1, 2, ..., k, where i, j is not in the edge set E iff
Xi ⊥⊥ Xj | rest. The set a separates node i from node j iff every path from node i to node j contains one or more nodes in a. a separates b from c (with a, b and c disjoint) iff a separates i from j
for every i ∈ b and j ∈ c (can greatly simplify independence assumptions, since a can be a small subset of rest). Equivalent Markov properties: saying ”Xi ⊥
⊥ Xj | rest for all non-adjacent vertices
i and j ” (pairwise property) is equivalent to saying ”Xb ⊥
⊥ Xc | Xa if a separates b from c (with a, b and c disjoint)” (global property) is equivalent to saying ”Xi ⊥ ⊥ rest | blanket(i)” (local
property).
P (0,p)P (1,q)
The cross-product ratio quantifies the association between two binary variables: CP R(X1 , X2 ) = with 0 and 1 being values of X1 , and p and q being values of X2 (divide the product
P (0,q)P (1,p)
of the main diagonal of the conditional probability table by the product of the other diagonal). CP R = 1P : X1 ⊥⊥ X2 , CP R > 1: positive association, CP R < 1: negative association.
General log-linear expansion: The log-linear expansion of the probability distribution PK is log PK (x) = a∈K ua (xa ) where the sum is taken over all possible subsets a of K = {1, 2, ..., k}. We
get a (set of) u-term(s) for each subset of the variables. Code the values of Xi as {0, 1, ..., di−1 }, where di is size of the domain of Xi . Set ua (xa ) = 0 whenever xi = 0 for any Xi with i ∈ a.
This is analogous to the case where X is binary. There are as many u-terms in the full log-linear expansion as there are cells in the contingency table. If (Xa , Xb , Xc ) is a partitioned random vector,
then Xb ⊥ ⊥ Xc | Xa iff all u-terms in the log-linear expansion with coordinates in both b and c are 0. In most applications, it does not make sense to include the three-way association u123 unless
the two-way associations u12 , u13 and u23 are all present as well. A log-linear model is said to be hierarchical if the presence of a term implies that all lower-order terms are also present (uniquely
identified by listing its highest order interaction terms). A log-linear model is a graphical model for X if all constraints of a graphical model can be read from the independence graph. A graphical
model is a hierarchical model in which the highest order interaction terms correspond to the cliques in the graph.
Maximum Likelihood estimator of graphical log-linear model M returns estimates of the cell probabilities (u-terms) that maximize the probability of the observed data, subject to the constraint that
the conditional independencies of M are satisfied by the estimates. Nice property: whenever the subset of vertices a in the graph form a clique (a maximal complete subgraph), the ML estimator
M M
of graphical model M satisfies the likelihood equations n̂a = N̂ P̂a = na , i.e. observed counts = fitted counts for every marginal table corresponding to a clique. (Remember this holds for the
saturated model, which has no constraints.) The same likelihood equations hold for all hierarchical models, where the margins a correspond to the highest order interaction terms in the model.
If a forms a clique and the random vector X can be partitioned into (Xa , Xb ), with b containing all variables not in a, then we can write P (X) = P (Xa )P (Xb | Xa ), and since P (Xa ) is not
constrained by the model (constraint = independency) all model constraints apply only to P (Xb | Xa ) and the ML estimates will yield n̂a = na . In other words, if we can rewrite a distribution
(using independencies from the graph) to an expression containing only terms concerning (subsets of) the cliques of the graph (can’t be be conditional distros of course), we have a closed-form solution
for the ML fitted counts.
Not all models have a nice closed-form solution (are decomposable). Iterative Proportional Fitting is an algorithm to compute the maximum likelihood fitted counts for hierarchical log-linear models. Say
we have m margins {a1 , ..., am } to be fitted (∪i ai = K ). We have to find a table n̂(x) that agrees with the observed table n(x) on the m margins corresponding to the subsets ai . Starting
(
with a table n̂ 0) of uniforms counts (all cells are 1), the algorithm cycles through the list of subsets a = ai , i = 1, ..., m fitting n̂(x) to each margin in turn. This is repeated until convergence
(
is reached, i.e. all margin constraints are (approximately) satisfied simultaneously. To fit to the margin a, the observed count na (xa ) on xa is distributed over n̂ab (xa , xb ) t + 1) according to
n̂ab (xa , xb )( t + 1) = na (xa )P̂ (xb | xa )( t) where b is the complement of a, and P̂ (xb | xa )( t) = n̂ab (xa , xb )( t)n̂a(xa )( t), i.e. the current estimate of P (Xb = xb | Xa = xa ). Decomposable
models have closed-form solutions for MLE’s and have triangulated independence graphs, i.e. have no chordless cycles (only the successive pairs of vertices in the cycle are connected by an edge/no
shortcut) of length greater than three. An ordering C1 , ..., Cm of the cliques of the graph has the running intersection property (RIP) iff Cj ∩ (C1 ∪ ... ∪ Cj−1 ) ⊆ Ci , for some i < j , and for
Qm Qm
j = 2, ..., m. We define the corresponding separator sets Sj = Cj ∩ (C1 ∪ ... ∪ Cj−1 ), with S1 = ∅. The ML fitted counts are then given by n̂(x) = j=1 n(xCj )/ j=2 n(xSj ) where
n(x∅ ) = N . If the cliques of a decomposable model are presented in RIP order to IPF, then the algorithm will converge in one iteration/cycle. Otherwise IPF will converge in two iterations.
M
= x P̂ M (x)n(x) (the product of the probabilities over all observations), where P̂ M (x) is the fitted probability of cell x, i.e. the likelihood score
Q
We test model M using the likelihood score: L
M P M
of model M is the probability of the observed data using the fitted cell probabilities according to model M . Log-likelihood score: ` = x n(x) log P̂ (x) (for convenience). The deviance of a
sat P
fitted model compares the log-likelihood score of the fitted model to that of the saturated model ` = x n(x) log(n(x)/n)(x). The larger the model deviance, the poorer the fit. The deviance
P M P
of M is 2(log-likelihood of the saturated model − log-likelihood of M) = 2 x n(x) log(n(x)/N P̂ (x)) = 2 cells observed × log(observed/fitted)
When comparing two models M0 and M1 with M0 being the simpler model (more u-terms being 0) M0 ⊆ M1 (subset in terms of u-terms), then the deviance difference is dev(M0 ) − dev(M1 ) =
−2`M0 + 2`M1 = 2(`M1 − `M0 ). For large N : 2(`M1 − `M0 ) ≈M0 χ2 υ , where υ (degrees of freedom) number of additional restrictions (zero u-terms) of M0 compared to M1 . We reject the
M M 2 2 2
null hypothesis that M0 is the true model when 2(` 1 − ` 0 ) > χυ;α , where α is the significance level of the test, and P (X > χυ;α ) = α. The test is called a likelihood ratio test because
log LM1 /LM0 = log LM1 − log LM0 = `M1 − `M0 . Use hill-climbing for finding optimal model: • AIC: AIC(M ) = dev(M ) + 2dim(M ) • BIC: BIC(M ) = dev(M ) + log(N )dim(M )