PLA maximizes data
wiw Scalar
overall variance
of the
of directions
=
ra along a small set ,
who
info on class labels
WWT = matrix
-C
Lecture 9 og(10/23
instances x
features
The following notation will be used: * nxd = matrix
Reasons to reduce dimensionality:
No
-Reduces time complexity; less computation
-Reduces space complexity; fewer parameters
* 1st Axis PCA creates accounts
-Saves cost of observing/measuring features
-Simpler models are more robust in small datasets for most variation in data
-More interpretable; simpler explanation
-Data visualisation (structure, groups, outliers) if plotted in 2 or 3 dimensions
Unsupervised
Feature selection: (subset selection algorithms): choosing k<d important features and ignoring the remaining d - k
->preferred when features are individually powerful/meaningful
Feature extraction: project the original x i, i=1,…,d dimensions to new k<d dimensions zj, j=1,…,k)
->preferred when features are individually weak and have similar variance
We want to maximise
*
PCA (Principal Component Analysis); Find a low-dimensional space s.t. when x is projected there, information loss is
info density
low
minimised By leaving out column don't loose lot of info
a , we a
-The projection of x on the direction of w is: z = w Tx a
-Find w s.t. Var(z) is maximised (subject to |w| = 1)
constraint wisunit vet
minimize
function
a
Considering a constrained optimisation problem min x Tw subject to Aw = b, w ∈ S w varianeas
Lagrangian Relaxation method relaxes the explicit linear (equality) constraints by introducing a langrange multiplier
t
vector λ and brings them into the optimization function: min x tw + λ (Aw - b) subject to w ∈ S
The langrangian function of the original problem can be expressed as: L(λ) = min{x t w + λ (Aw - b) | w ∈ S}
t
T
PCA makes sure z = W (x - m), where the columns of W are the eigenvectors of ∑ and m is sample mean;
Centers the data at the origin and rotates the axes
Zwi Xw =
,
,
rector
with WT [W X O
, ,
=
2Zw
& xWiw,
,
-
=
zaw , =
0
pick largest eigenvalue
from [+ biggest value
of
variance
for projected data X1 Xi
PoV (Proportion of Variance): X x2 xk XdS
+. .. +
, + +... + +... +
when λ i are sorted in descending order, typically you can stop at PoV > 0.9 or elbow data visualisation/dimensionality reduction
PCA can be applied to clean out outliers from data, to de-noise, and learn/explore common patterns(eigenvectors)
T
Singular Value Decomposition: X = VAW is a dimensionality/data reduction method
U V = NxN ;contains eigenvectors of XX X USWT
T
=
C
eigenvalues of
w T
W = dxd ;contains eigenvectors of X X AV = WTX A = VXWT Given X centered ; C =
represent variances
A = Nxd ;contains singular values on its first k diagonal of principal components
S
Singular values in SVD are
eigenvalues
[
represent amount
of variance of each vector
LDA Linear Discriminant Analysis (k=2 classes) focus is separability between classes on
*
Find a low-dimensional space s.t. when x is projected, classes are well-separated
Find w that maximises =>
s see axis
maxseparationbetweenmeansofprojecte
new
new axis
S
We come to deal with between-class scatter:
eigenvectors basedt
And within-class scatter: (k=2, binary classification case)
Fishers Linear Discriminant (k=2 classes)
~ between
3
. within
Reduce
dimensionality to 1
, SVD VAWT X is mean-normalized how
: X =
,
Assuming ,
are the
C XTX/N-1 and related ?
eigenvalues of singular values of SVD
=
singular values in SVD
of mean-normalized X are
directly related to the
eigenvalues of cov-matrix C ,
so
singular values provide info about the
amount
of variance explained by each principal component just like ,
the eigenvalues of C .
Largest singular value* largest eigenvalue
Find point central to all classes
min . distance between each class &
the central point while min Scatter
, .
d + d2 + d2/52 + 52 + 52
Vector C WIWT CWWT
projection Imagine light above under
= =
~ U where the red arrow shadows
,
matrix
of eigenvectorsC diagonal matrix of eigenvalues (
WTW I
are
projected on the
target vector eigenvectors of (in W are
orthogonal to each other >
-
=
u = (2) unit vector (has
lengthymagnitude = 1)
& Xi K, K7
. .
., magnitude datapoints
01
Eigenvectors -
X values A =
- 2 -
3 o
m
is not invertible
Val : Ax =
XX ,
where X +0 and XER Ax XX XIx = = >
-
Ax XIX -
= (A XI)x - = 0
- -
- x I
So det1A-XI) + det det det (ad-b)
eigenvalue eigenvector 0 -2
=
- =
=
= x2 + 3x + 2
= (x+ 2)(x + 1) 0+ x 2, x 1
:
= = =
(t) (2)
Vec + x X 2x 3x2
x23any values work so
=
- - - = -
,
-
2x1 =
2X2
Xi = - X2
If you know something is an eigenvector for a given matrix/linear transformation, you know that, that linear transformation will map that
eigenvector onto a different vector which maintains the same ratios (ex. ratios of x1(length) to x2(weight))
Lagrange multipliers = Ul
uSu = (u + u1)
H = 42
*
Direction + uTSu + X(1 uπy)
is important not
magnitude ; llull 1 + Max (uTSu) S . 4 +u 1
t -
= =
, .
11411
Taking the derivative of ↓ get Su 14 uSu XnTy
=
We = + =
maximise
So take all S to maximise .
eigenvalues of and
find biggest one X