technische universiteit eindhoven 2IIG0 Data Mining and Machine Learning, Q2 2021-2022
Data Mining and Machine Learning
(2021-2022)
January 29, 2022
0.1 Introduction and Foundations
0.1.1 Linear Algebra
Mathematical rules that apply to vector spaces V :
• Scalar multiplication: α( βv) = (αβ)v for α, β ∈ R and v ∈ V
• Distributivity: (α + β)v = αv + βv and (v + w)α = αv + αw
1
• Well-Defined: v/α = α and v − w.
• Cannot do: v · w and α/v
Matrices:
• A· j is the column-vector j
• Ai· is the row-vector i
⊤
• A⊤ is the transpose: swap the row- and columnvectors: A⊤ = A
• For symmetric matrices, it holds that A⊤ = A
• A diagonal matrix has only non zeroes in the diagonal
The innerproduct of two vectors is: v⊤ w = ∑id=1 vi wi and the outer product is vw⊤ .
In the matrix product, every element is calculated by the inner product of row j and column i:
Cij = A j· B·i = ∑rs=1 A js Bsi . The eventual matrix product is the sum of the outer products of the
corresponding row and column vectors: C = ∑rs=1 A·s Bs· .
The multiplication of an n × d matrix with an identity matrix is: In A = A = AId .
For C = AB it holds that C ⊤ = B⊤ A⊤ . For the inverse it holds that AA−1 = A−1 A = I.
Vector Norms: measure the length of vector spaces.
• ||v + w|| ≤ ||v|| + ||w|| (triangle inequality)
• ||αv|| = |α|||v|| (homogeneity)
• ||v|| = 0 ↔ v = 0
q
• Euclidean Norm: ||v||2 = ∑id=1 v2i
• v⊤ w = cos(∠(v, w)||v||||w||
• Manhattan Norm: ||v||1 = ∑id=1 |vi |
/department of computer science 1
,technische universiteit eindhoven 2IIG0 Data Mining and Machine Learning, Q2 2021-2022
For orthogonal vectors it holds that cos (∠(v, w)) = 0 and thus the innerproduct is zero.
For orthonormal vectors, they are orthogonal and ||v|| = ||w|| = 1.
You can normalize vectors by dividing the values by ||ww||
.
ww⊤
The length of the projection pv : || pv || = cos(θ )||v|| = v⊤ ||w
w
||
. pv = ||w||2
v.
Matrix Norms:
• Element-wise L p matrix norms: || A|| p = (∑in=1 ∑m p 1/p
j=1 | A ji | )
• Operator norm: || A||op = max||v||=1 || Av||
• Orthogonal columns: A⊤ A = diag(|| A·1 ||2 , ..., || A·d ||2 )
• Orthonormal columns: A⊤ A = diag(1, ..., 1)
• Orthonormal square matrix: A⊤ A = AA⊤ = I
A vector norm is orthogonal invariant if || Xv|| = ||v||.
A matrix norm is orthogonal invariant if || XV|| = ||V|| where X is an orthogonal matrix.
Trace: tr( A) = ∑in=1 Aii . The sum of the diagonals.
• tr(cA + B) = ctr)( A) + tr( B) (linearity)
• tr( a⊤ ) = tr( A)
• tr( ABCD ) = tr( BCDA) = tr(CDAB) = tr( DABC )
• ||v||2 = v⊤ v = tr(v⊤ v)
• || A|2 = tr( A⊤ A)
Binomial formulas:
• ||x − y||2 = (x − y)⊤ (x − y) = ||x||2 − 2⟨ x, y⟩ + ||y||2
• || X − Y ||2 = || X ||2 − 2⟨ X, Y ⟩ + ||Y ||2
0.1.2 Optimization
Given an objective function f : Rn 7→ R, the objective of an unconstrained optimization
problem is:
minn f ( x )
x ∈R
• x ∗ ∈ arg minx∈Rn f ( x ) is a minimizer
• minx∈Rn f ( x ) is the minimum
The global minimizer: x ∗ → f ( x ∗ ) ≤ f ( x ) for all x
The local minimizer: x0 → f ( x0 ) ≤ f ( x ) for x ∈ N∈ ( x0 ) (domain)
d
Every local minimizer has to be a stationary point (no slope): dx f ( x0 ) = 0. It is a minimizer if
2
d
dx2
f ( x0 ) ≥ 0.
/department of computer science 2
, technische universiteit eindhoven 2IIG0 Data Mining and Machine Learning, Q2 2021-2022
There are multiple types of partial derivatives f : Rd 7→ R:
∂ f (x) ∂ f (x) ∂ f (x)
Jacobian: ∂x = ( ∂x1 ... ∂xd ) ∈ R1×d
∂ f (x)
∂x. 1
. ∈R
Gradiant:∇ x f ( x ) = . d
∂ f (x)
∂xd
First Order Necessary Condition: if x is a local minimizer of f : Rd 7→ R and f is contin-
uously differentiable in an open neighbourhood of x; ∇ f ( x ) = 0 → stationary point.
Second Order Necessary Condition: if x is a local minimizer of f : Rd 7→ R and ∇2 f is con-
tinuous in an open neighbourhood of x; ∇ f ( x ) = 0 and ∇2 f ( x ) is positive semi definite.
A matrix is positive semidefinite if x ⊤ Ax ≥ 0 for all x ∈ Rd .
The constrained optimization problem consist of two parts:
• objective function: f : Rd 7→ R
• constraint functions: ci , gk : Rd 7→ R
min f ( x ) → ci ( x ) = 0 1≤i≤m
x ∈Rn
min f ( x ) → gk ( x ) ≥ 0 1≤k≤l
x ∈Rn
The set that satisfies is the feasible set C . FONC and SONC will not work. There should be
checked at the boundaries.
It is possible to transform constrained problems to unconstrained via the Langragian formula:
m l
L( x, λ, µ) = f ( x ) − ∑ λi ci ( x ) − ∑ µ k gk ( x )
i =1 k =1
This introduces the dual objective function Ldual :
min f ( x ) ≥ inf L( x, λ, µ) ≥ inf L( x, λ, µ) = Ldual (λ, µ)
x ∈C x ∈C x ∈Rd
The dual problem is maxλ,µ Ldual (λ, µ). The solution of the primal problem is always bounded
below by the solution to the dual problem: f ∗ ≥ Ldual
∗ .
Another option is numerical optimization. This is iterative and therefore updates the solu-
tion. There are two main numerical optimizations:
( t +1) (t) (t) (t)
Coordinate Descent: all coordinates are fixed except one. xi ← arg minxi f ( x1 , ..., xi , ..., xd ).
Each value of t is smaller than the previous one.
Gradient Descent: used if the gradient is known. The stepsize needs to be small enough.
xt+1 ← xt − η ∇ f ( xt ). The stepsize is η.
Convex optimization: every local minima is a global minima.
The convex set X : if and only if the line segment between every pair of points in the set is
in the set: for all x, y ∈ X and α ∈ [0, 1] : αx + (1 − α)y ∈ X .
The convex function: if and only if α ∈ [0, 1] and x, y ∈ Rd :
f (αx + (1 − α)y ≤ α f ( x ) + (1 − α) f (y).
The objective of the convex optimization problem is minx∈Rn f ( x ) s.t. x ∈ C .
! If f ( x ) is convex, every local minimizer x ∗ is a global minimizer !
/department of computer science 3