Statistical Learning Framework
We first take a look at an motivating example: support vector machine
Suppose we have
O
Target variable ine[-1 13 (hence takes value 1 or -1) :
,
s
Features (Xi Xid) : Xi =
.
. .
..
These can be drawn in the following figures
The blue dots denote Yi = 1
The orange dots denote Ye = -
1
What we do in the figure on the left is that we
r
+
Margin
draw different lines that seperate the points
*
What is the optimal choice of this separation line?
Y
This is done in the right figure, where we try to maximize the
We get a linear discriminant function: margin (= distance between first points and separation line,
f(x) wix (f(x)) signed distance)
=
+ wo
and a decision rule e(x) =
sign
&
In this example we get:
decision boundary [x wix : + wo = 0
,
ll wll =
13
S S
O >
((Xi) = 1 O
>
((Xi) =
y
decision rule wiXi the
boundary (wiXi the
le
boundary
=
+ wo
=
O
=
on or + wo
=
O on
O
<
((Xi) = 1 g > ((Xi) =
y
Now we look in general in this example
2 cases:
Hard margins: assume that the points are perfectly separable, maximizing gives hard margins
·
max M (w
s .
t .
same as
is .
min B Ai +
(BXi Bot
No
B: s .
t . 11 wll =1 S . . in +
gives (x + wo 03 =
gives Ex
x + 3= = : x +
equal to eachother
Soft margins: assume dots are not perfectly separated introduce that allows datapoints to violate the 3
constraint, go to otherside note gives hard margin version C >
min -eEi st .
in
xitBo is
violating margins
However this is hard to interpretate, hence we rewrite it such that we can understand
Note we can rewrite to [1 (BXi Bo) ob Ei Zi =
max -
Vi +
,
V
, min -emax[
Xi Bo) o -
Yi +
,
En
V
rewrite
min B ma [ -Yi (BXi Bo) o
+ , +
,
>
x = enc
as we only change the scale, the solution is the same
> o
,
We can rewrite the objective function again
↓B max E-YiBo) ob ABB th(b(Xi B Bo) Yi)
+
,
=
+ , . .
f(Xi B , . Bo)
y) (1 f(x) 03
Hinge loss function D
: In (f(x) ,
= max -
y
.
,
Candidate prediction rule f(x B Bo) 3x
D
= ,
= +
Bo
Hence we now have Regularization
~
mine
Bo) (b(Xi B , .
,
Yi) s .
A .
BB b
Empirical risk
Concluding from this derivation we define:
O
Empirical hinge loss (= average hinge loss) ↳ (b) = In (b(Xi B .
Bo) ,
Yi)
O
Population hinge loss (= expected hinge loss) ↳ (b) =
E (s(b) =
Eln(f(X B , .
Bo) ,
Y)
Until now we have only looked at linear function, now we look at the case, where we do not necessarily
need linearity
·
↳(b) In (b(X) Y) [0 Y &(x)] f(X)
:
,
=
max .
1 - .
note is not linear anymore
·
L (b) :
E[(s(1)] : ELIuCe(X) .
Y) : ElmaxSo .
1 . Y . f(x)
what is the optimal f(x) ?
(x) nigh(EYIX-)) =
e Bayes classifier
:
[. &
Hence the soft-margin SVM is directly estimating the Bayes classifier Minimizes the probability of
>
making classification mistakes
mis P[g(X)Y] =
mig Elo (g(X) .. ,
Y)
,
lo .,
(y y)
,
=
1
y y
= zero-one loss function
, We have now only looked at an example, now we will look at the general framework, where we can change
many things
The ideal prediction rule
We start again with
Target variable
8 I
Features
8
/
Before observing Y
Make prediction f(X) with some prediction rule f
>
After observing Y
We quantify the predictive loss l(b(X) Y)
>
Note & (b(X) Y) l(Y Y) . ,
= ,
=
0
The goal is to find the ideal prediction rule f that minimizes the population risk function (p(f) =
E l(f(x) Y)
.
>
Hence find &* such that L (f )
*
Lo (b)
Examples
O
When the loss function is the squared loss ((y y) 11y y ,
=
-
The optimal predictor is the regression function M(x) E[YIX
> = = x
O
When the loss function is the zero-one loss lo ( y) [ ..
,
:
The optimal predictor is the Bayes classifier (example) (BATES (x)
> =
argmax
P2Y Cm)X = = x
Empirical risk minimization
We start with an random sample S EYi Xi n] :
,
: i = 1, ...,
We define
empirical risk function L (b) n l(b(Xi) Yi)
& =
,
population risk function
C E-Ls(f)] E[l( (Xi) Yi) Lo (f) : :
& ,
Empirical risk minimizer angming (s(k) : : a
7
The idea is to approximate the population risk by the empirical risk La Ls
only in an appropriate subspace
,
I c F
and use the restricted estimator
,
>
How to choose this?
Choose I , such that n (f) :f) = ((t)1b 00] -
ef :
.
Equivalent to min Ls(f) d ((f) +
.