Week 1: probability recap
A random variable is a function from a sample space S to the real numbers R
P(x ∈ A) = Px (A) = P({s ∈ S|X(s) ∈ A})
The cdf: FX (x) = P(X ≤ x), ∀x ∈ R
pdf/pmf: f (x) = P(X = x) if x is discrete, f (x) = F ′ (x) if X is continuous.
Any function of X, Y = g(X) is a random variable. To find the distribution of
Y we have to invert function g and calculate the cdf of Y .
P
g(x)fX (x), Discrete
E(g(X)) = ´ x∈X
f
x∈X X
(x)dx, Continuous
Var(X) = E(X 2 ) − E(X)2
Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2abCov(X, Y )
E(aX + bY ) = aE(X) + bE(Y )
Simultaneous and conditional distribution:
Discrete:
pX,Y (k, j) = P(X = k; Y = j)
=j)
pX|Y (k|j) = P(X = k|Y = j) = P(X=k;Y
P(Y =j)
Continuous:
fX,Y (x, y)
f (x,y)
fX|Y (x|y) = X,YfY (y)
E(X) = E[E(X|Y )]
Var(X) = Var(E[X|Y ]) + E[Var(X|Y )]
Law of Large Numbers (LLN)
suppose {Xn }∞
n=1 a sequence iid random variables, then there will be almost sure
convergence to X̃ iff
P lim |Xn − X̃| < ϵ = 1, ∀ϵ > 0
n→∞
n
1X
lim Xi = lim X̄n → E[X1 ] almost surely
n→∞ n n→∞
i=1
lim P |Xn − X̃| < ϵ = 1, ∀ϵ > 0 Convergence in probability
n→∞
Central limit theorem (CLT) √
n(X̄−µ)
For finite expectation and variance we have. σ
→ N (0, 1)
1
,Week 2: Statistical models
fX (x1 , · · · , xn ) = nk=1 fXk (xk )
Q
A histogram gives the first insights on whether we have possibly chosen the
correct probability distribution for our dataset.
Let {aj }m j=1 be a partition over range xi . It holds that aj − aj−1 = c
Choose y ∈ (aj−1 , aj ]. then
hn (y) = #{1 ≤ i ≤ n|aj−1 < xi ≤ aj } = ni=1 1{xi ∈(aj−1 ,aj ]}
P
A scaled histogram is then:
hn˜(y) =
#{1≤i≤n|aj−1 <xi ≤aj }
cn
Transformations
How do we get the distribution of Y = h(X) from X?
FY (y) = P(Y ≤ y) = P(h(X) ≤ y) = P(X ≤ h−1 (y)) = FX (h−1 (y)
∂
fY (y) = ∂y FY (y)
fY (y) = fX (h−1 (y)) · ∂ −1
∂y
h (y)
Location-scale family Let µ ∈ R, σ > 0,
x−µ
Hµ,σ (x) = H
σ
Y , a random variable with cdf H, define Zµ,σ = µ + σY , Then Zµ,σ has cdf Hµ,σ
P(Zµ,σ ≤ y) = P(µ + σY ≤ y)
y−µ y−µ
=P Y ≤ =H
σ σ
Week 3: Maximum Likelihood
Definition An estimate for θ0 is any function of the data W (⃗x). The corresponding
estimator is a stochastic variable obtained by filling in the random vector.
Method of moments
n
1X
lim Xi = E(X1 ) → X̄ ≈ E(X1 )
n→∞ n
i=1
n
1X 2
lim Xi = E(X12 ) → X̄ 2 ≈ E(X12 )
n→∞ n
i=1
..
.
n
1X k
lim Xi = E(X1k ) → X¯k ≈ E(X1k )
n→∞ n
i=1
Sample mean: X̄ = n1 ni=1 P
P
Xi
Sample variance: S = n−1 ni=1 (Xi − X̄)2
2 1
2
, Definitions on maximum likelihood
Likelihood function: θ → L(θ|⃗x) = fθ (⃗x)
Maximum likelihood estimate: W (⃗x) = argmaxθ∈Θ L(θ|⃗x), The parameter value in
the parameter space at which the likelihood functino attains its maximum.
L(θ|⃗x) = fθ (⃗x) = Πni=1 gθ (xi )
Log likelihood: θ → log(L(θ|⃗x))
Suppose that the log likelihood is differentiable on Θ ⊆ Rk . Then the maximum
can be attained at two different kinds of points:
i) boundary points
ii) stationary points: is a point θ̃ that satisfies ∂θ∂ j log(L(θ|⃗x))|θ=θ̃ = 0, ∀j ∈ {1, · · · , k}
Week 4: Evaluating estimators
Definition Biasθ (W ) = Eθ W (X) ⃗ − τ (θ) . We say that an estimator is unbiased
⃗ = τ (θ).
if Eθ (W (X))
⃗ − τ (θ)||
M AE(θ, W ) = Eθ ||W (X)
⃗ − τ (θ)||2 = Varθ (W (X))
M SE(θ, W ) = Eθ ||W (X) ⃗ + Bias2 (W )
θ
Where the variance is called the precision and the bias squared is called the accuracy.
Definition An estimator W ∗ is a UMVU estimator if it is unbiased and, for any
other estimator W that is unbiased, we have Varθ (W ∗ ) ≤ Varθ (W ).
Cauchy-Schwarz Lemma E(Y Z)2 ≤ E(Y 2 )E(Z 2 )
2
∂
Iθ = Eθ log(fθ (X)) Fisher information
∂θ
2
∂ ∂
iθ = Eθ log(gθ (X1 )) = Var log(fθ (X)) for an individual observation
∂θ ∂θ
Cramer-Rao:
′ 2
⃗ ≥ τ (θ)
Varθ (W (X))
Iθ
′ 2
⃗ ≥ τ (θ)
Varθ (W (X))
niθ
Week 5: Exponential families
Definition A set of univariate distributions {gθ |θ ∈ Θ} is called an exponential
family if we can rewrite it as:
Pm
wj (θ)tj (x)
gθ (x) = h(x)c(θ)e j=1
3