Samenvatting

Summary EE4560 Information Theory - Cheatsheet

0 keer verkocht

Vak
Information Theory (EE4560)

Instelling
Technische Universiteit Delft (TU Delft)

This PDF is a (legal) cheatsheet and can be used during the exam, containing all relevant information very densely presented.

[Meer zien]

Voorbeeld 1 van de 2 pagina's

Bekijk voorbeeld

Geupload op 20 november 2021
Aantal pagina's 2
Geschreven in 2020/2021
Type Samenvatting

€12,99

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

INFORMATION MEASURES
Hartley’s measure: HH (X) = log kl = l log k for random variable X with k-element alphabet X and message of length l. Assumes equal probabilities for X ’s elements and independence.
P
Shannon’s measure (entropy): H(X) = H(p) = − x∈X p(x) log p(x) = −E log p(X) for random variable X with k-element alphabet X and probability density p(x) = Pr(X = x ∈ X ).
Shannon first coding theorem: entropy is the minimum number of bits needed to represent an outcome, or, minimum number of yes/no questions needed to determine which symbol x ∈ X occurred.
0 ≤ H(X) ≤ log k with equality iff ∃i such that p (xi ) = 1 (in which case H(X) = 0) or p (xi ) = k 1 for all i (in which case H(X) = log k).
P P P P
H(Y | X = x) = − y∈Y p(y | x) log p(y | x) =⇒ H(Y | X) = x∈X p(x)H(Y | X = x) = − x∈X y∈Y p(x, y) log p(y | x)
P P
H(X, Y ) = − x∈X y∈Y p(x, y) log p(x, y) = H(X) + H(Y | X) = H(Y ) + H(X | Y ) ≤ H(X) + H(Y ) with equality iff X and Y are statistically independent, i.e., p(x, y) = p(x)p(y)
Pn
Chain rule for entropy: H (X1 , . . . , Xn ) = i=1 H Xi | Xi−1 , . . . , X1
P P p(x,y)
Mutual information: I(X; Y ) = x∈X y∈Y p(x, y) log p(x)p(y) = H(X) − H(X | Y ) = H(X) + H(Y ) − H(X, Y ) ≥ 0 with equality iff p(x, y) = p(x)p(y)
Pn
Chain rule for mutual information: I (X1 , . . . , Xn ; Y ) = i=1 I Xi ; Y | Xi−1 , . . . , X1 Pn
Jensen’s inequality: for a random variable X and convex function f : Ef (X) ≥ f (EX) (Ef (X) ≤ f (EX) if f is concave). Because of this, H(X) ≥ H(X | Y ) and H (X1 , . . . , Xn ) ≤ i=1 H (Xi )
Q
If the random variables that model an information source are mutually independent (p (x1 , . . . , xn ) = i p (xi )), we say that the process is memoryless.
A stochastic process is stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index,
i.e. Pr (X1 = x1 , . . . , Xn = xn ) = Pr Xl+1 = x1 , . . . , Xl+n = xn for every shift l and all xi ∈ X
A discrete stochastic process is a Markov process of order m if Pr(Xn | Xn−1 , . . . , X1 ) = Pr(Xn | Xn−1 , . . . , Xn−m ) (depends only on the chain state; the m preceding variables).

Besides the conditional probability distribution of Xn , a Markov process can also be defined by a probability transition matrix P : pj|i = p xj | xi ; the probability of going to state j from state i.
Two measures for the information of a stochastic Markov process (with memory):
• entropy rate/per-symbol entropy of {Xk , k ∈ Z} is defined by H∞ (X) = limn→∞ 1 H
(X1 , . . . , Xn ); the ”rate of growth” of the entropy of a sequence of n random variables
n
0
• conditional entropy of the last random variable given the past: H∞ (X) = limn→∞ H Xn | Xn−1 , . . . , X1
0
For stationary processes, both limits exist and are the same. For a stationary Markov process: H∞ (X) = H∞ (X) = H Xn | Xn−1 , . . . , Xn−m
ASYMPTOTIC EQUIPARTITION PROPERTY
1 log p (x , . . . , x ) → −E log p(X) = H(X).
X1 , . . . , Xn , the asymptotic equipartition property (AEP) states that − n
For i.i.d. random variables 1 n
−nH(X)
As a consequence, the probability p (x1 , . . . , xn ) of almost all sequences will be close to 2 when n is large. ”Almost all events are almost equally surprising.”

n 1 Pn
The analog to the AEP is the weak law of large numbers, which states that 1
P
n i=1 xi → EX in probability, as n → ∞. Alternatively: ∃n0 s.t. n ≥ n0 s.t. Pr n i=1 xi − EX < ≥ 1 − δ
2
The latter can be proven using Chebyshev’s inequality: Pr(|x − µ| < ) ≥ 1 − σ2

Shannon-McMillan’s theorem: for {Xn , n ∈ Z} a i.i.d. source, ∃n0 s.t. n ≥ n0 s.t. 1 ≥ Pr − n 1 log p (x , . . . , x ) − H(X) < ≥ 1 − δ . The set of all sequences (of length n) can be divided into
1 n
n o
(n) 1 log p (x , . . . , x ) − H(X) < . The proportion of any symbol in a typical sequence is close to its true probability: N (x )/n ≈ p(x ).
• Typical set: A = (x1 , . . . , xn ) : − n 1 n i i

(n) −nH(X) −n(H(X)−)
n
> p (x1 , . . . , xn ) > 2−n(H(X)+)
P
The set has prob close to 1: 1 ≥ Pr A = (n) p (x ) ≥ 1 − δ . All sequences in it have prob close to 2 , namely 2
xn ∈A
(n)
Its number of sequences is nearly 2nH(X) : (1 − δ)2n(H(X)−) < A < 2n(H(X)+)

(n)
• Non-typical set: A , where the sample entropy is close to the true entropy. Typical sequence 6= most probable sequence.

(n) (n) (n) (n)
If Qq ⊂ X n is the set of most probable sequences with Pr Qq > 1 − q , then Pr Aε ∩ Qq >1−δ−q
DISCRETE CHANNEL CAPACITY
A discrete channel is given by an input alphabet X , output alphabet Y , and probability transition matrix p(y | x). It is memoryless if output depends only on current input.
Informational channel capacity (C = maxp(x) I(X; Y ) ≤ min{log |X |, log |Y|}) is the same as operational channel capacity (highest rate in bits per channel use with arbitrarily low probability of error).
P
It is the difference between uncertainty about X before and after transmission, i.e. the amount of information communicated. Recall I(X; Y ) = H(Y ) − H(Y | X) = H(Y ) − p(x)H(Y | X = x)
A channel is symmetric if the rows of the channel transition matrix are permutations of each other, and columns of the channel transition matrix are permutations of each other. (same = permutation)
A channel is weakly symmetric if the rows of the transition matrix are permutations, and all column sums are equal. Then, C = log |Y| − H(row of matrix) achieved by uniform input distribution.
n n
An (M, n) code is given by an index set {1, ..., M }, encoding function {1, ..., M } → X , and decoding function Y → {1, ..., M }
n n n P n n n
Conditional probability of error given i was sent: λi = Pr (g (Y ) 6= i | X = x (i)) = y n p (y | x (i)) I (g (y ) 6= i), where I(·) is the indicator function.
λ(n) = maxi∈{1,...,M } λi . Average probability of error: Pen = M
Maximal probability of error: 1 PM λ
i=1 i l m
log M
Rate R of an (M, n) code: R =
n
bits per transmission. A rate R is achievable on a channel if there exists a sequence of 2nR = M, n codes such that the maximum probability of error
λ(n) tends to 0 if n → ∞. The operational capacity of a channel is the supremum of all achievable rates.
(n) n n n n
∼ p (xn , y n ) = n 1 n n 1 n n
Q
The set A of jointly typical sequences has pairs (x , y ) ∈ X × Y i=1 p (xi , yi ) s.t. − n log p (x or y ) − H(X or Y ) < and − n log p (x , y ) − H(X, Y ) < .

(n)

(n)
nH(X,Y ) (n)
Pr (X n , Y n ) ∈ A → 1 if drawn from p(xn , y n ). Pr (X n , Y n ) ∈ A ≤ 2−n(I(X;Y )−3) ≈ 2
nH(Y )
n n
(small) if drawn from p(x )p(y ). A ≤ 2n(H(X,Y )+) .
2nH(X)2
Since we decode according to joint typicality, the latter statement suggests there are about 2nI(X;Y ) = 2nH(X) n
distinguishable signals X .
2nH(X|Y )
nR
Shannon’s channel coding theorem/2nd theorem: for a memoryless channel, all rates below C are achievable. Conversely, any sequence of 2 , n codes with λ(n) → 0 must have R ≤ C . Proven by
Rn n Qn
randomly generating 2 codewords according to p (x ) = i=1 p (xi ), and showing that if R < I(X; Y ), then the average probability of error is small with large n. From this, it can be derived
(n) (n) (n)
that for any rate R < C and any > 0, a code exists with maximal probability of error λ ≤ 4. In short, Pe → 0 if R < C , while Pe → 1 if R > C .
CF B = C , but channels with feedback can simplify encoding/decoding.
CONTINUOUS CHANNEL CAPACITY
R h(X)
The differential entropy h(X) of a continuous random variable X with density f(x) is h(X) = − S f (x) log f (x)dx, where S is the support set of X (where f (x) > 0 ). Can be < 0, but 2 >0

For a Guassian distribution: h(f ) = 1
2
log 2πeσ 2 (the larger the variance, the larger the differential entropy).
1 Pn
Yi = Xi + Zi at time discrete i, where all are real numbers and Zi ∼ N (0, N ) and independent of Xi . Input power constraint for any codeword (x1 , ..., xn ): n 2
i=1 xi ≤ P .
1 P 1 2
I(X; Y ) = h(Y ) − h(Z) ≤ 2 log 1 + N , because h(Y ) ≤ 2 log(2πe(P + N )), because E(Y ) = P + N .

The information capacity of the Gaussian channel with power constraint P is defined as C = max
2
I(X; Y ) = 1 2
log 1 + NP , where the max is achieved with X ∼ N (0, P ).
f (x):E X <P
For a continuous-time channel: Y (t) = (X(t) + Z(t)) ∗ h(t), where X(t) etc. are waveforms and h(t) is impulse response of an ideal bandpass filter (removes frequencies larger than bandwidth W ).
Nyquist-Shannon sampling theorem: a function f (t), which is band-limited to W (in (k)Hz), is completely determined by samples of the function spaced 1 (milli)seconds apart.
2W
Noise power N = N0 W , where N0 is the noise power spectral density (given by nature). If channel is used over the time interval [0, T ] (T in seconds), then energy per sample is total energy
!
P
N W T N

consumed divided by the number of samples P T = P and noise variance per sample is 0 = 20 . Capacity per sample: C = 1 log 1 + 2W N = 1 log 1 + N PW . Since there are
2W T 2W 2W T 2 2 0 0
2
2W samples per second, C = W log 1 + N PW bits per second. Bandwidth efficiency η = W R bit/s/Hz. The slides have more complicated stuff that I didn’t put here.
0 Pk 2 P P
For k independent parallel Gaussian channels with a common power constraint E j=1 Xj = Pj ≤ P , we use a water-filling algorithm to find the optimal power allocation that maximizes C = Cj
It means making P + N equal across all channels. If this is impossible due to too small budget for P , then equalize as many channels as possible (starting with the best ones = lowest noise).
Feedback increases capacity for channels with memory, but not by more than a half (CwithFB ≤ CwithoutFB + 1 ) and not by more than a factor of 2 (CwithFB ≤ 2CwithoutFB ).
2
DATA COMPRESSION
(n) (n)
We can code A
n(H(X) + ) + 2 bits and A
with no more than with no more than n log |X | + 2 bits, where one bit is needed in case not an integer, and the other for distinguishing.

(n) (n)
n
≤ n H(X) + 0 , where 0 = + log |X| + (2/n) can be made arbitrarily small with n.

The average code length El(X ) = (n(H(X) + ) + 2) Pr A + (n log |X | + 2) Pr A
Shannon’s source coding theorem/1st theorem: for a discrete memoryless i.i.d. source {Xn , n ∈ Z}, we can encode messages of length n into codewords of length l from a code alphabet of size r with
(n) (n)
arbitrary small probability of error Pe ≤ δ iff r l ≥ 2n(H(X)+) , because the number of elements in A is A ≤ 2n(H(X)+) ≤ r l , so that the number of codewords is larger than the number

(n)
of typical source words, and Pe ≤ Pr A ≤ δ . However, code construction based on typical sets requires long source sequences (n → ∞), which is inefficient for short sequences.
A source code C for a random variable X is a mapping C : X 7→ C . C(x) denotes the codeword corresponding to x of length l(x) and r -ary alphabet. There are different nested classes of codes:
• A non-singular code is a code that uniquely maps each of the source symbols x ∈ X into a code word C(x), i.e. xi 6= xj ⇒ C (xi ) 6= C xj ; there can’t be two equal codewords.
• A uniquely decodable code means that its n-extension is non-singular as well, for all n, i.e. any sequence of n symbols should be unambigiously decodable as well.
• A prefix/instantaneous code means no codeword is a prefix of any other codeword, i.e. you don’t have to wait for the entire sequence to appear in order to decode it; you can decode on-the-fly.
Pk −l(xi )
Kraft inequality: for any prefix code over an alphabet of sizer , the codeword lengths l (x1 ) , . . . , l (xk ) must satisfy the inequality i=1 r ≤ 1. It is a sufficient condition for the existince of
Pk Pk −l(xi )
a codeword set with the specified length. Finding a prefix code with minimum expected length then becomes an optimization problem: minl(x ) i=1 p (xi ) l (xi ) subject to i=1 r ≤1
i
∗
−l (xi )
For any prefix code: El(X) ≥ H(X) with equality iff r = p (xi ) for all i. In case there is a − log p(xi ) which is not an integer, we round it up and additionally have El(X) ≤ H(X) + 1.
n
The overhead of the additional bit disappears with n → ∞. Assuming symbols are drawn i.i.d. according to p (x ), we have El (X1 , . . . , Xn ) < H (X1 , . . . , Xn ) + 1 ⇒ El(X) < H(X) + 1
n
H(X1 ,...,Xn ) H(X1 ,...,Xn ) 1 For stationary processes, we have that H (X , . . . , X ) /n → H (X), the entropy rate of the process.
For non-i.i.d. sequences, we have the bound
n
≤ El(X) < n
+ n 1 n ∞
Kraft-McMillen inequality: the exact same as the Kraft inequality, but McMillen proved that it also holds for uniquely decodable codes.
Pi−1
• Shannon code: arrange messages in order of decreasing p(xi ). Obtain the code of xi by calculating the cumulative probability F (xi ) = j=1 p xj , translating this decimal value to binary
and taking the first l(xi ) = d− log p (xi )e bits. This code is asymptotically optimal (as n → ∞), but may be much worse than the optimal code for finite n for some particular symbol.
• Huffman code: arrange messages in order of decreasing p(xi ), join the two least probable, repeat until two messages left. Working backward, put a 1 to the combined symbols and a 0 to the
0
non-combined zeros (or vice versa). This code is optimal (El (C) ≤ El C ), but for small alphabets is only efficient (meaning no redundancy in th compressed data) for long blocks, and slow
algorithm, and all calculations have to be redone if block length increases. Concatenating Huffman codes leads to inefficiency, which is only mitigated by redoing the code for the whole block.
n n
• Arithmetic code for sequences: create nested |b, li intervals according to I0 (x ) = |0, 1i and Ik (x ) = bk−1 + F (sk ) lk−1 , p (sk ) lk−1 , k = 1, . . . , n. The sequence xn = (s1 , ..., sn ) is
n
then represented by the final interval In (x ), and encoded by the binary representation of any number v̂ who’s d− log ln e-bit representation lies in the interval. For decoding, we process the
v̂ −F (sk )
symbols FIFO: starting with v̂1 = v̂ , we continuously obtain the corresponding symbol with sk = xi if F (xi ) ≤ v̂k < F xi+1 and obtain the next v̂k+1 = k

. Because intervals
p(sk )
represent sets of infinite sequences, the decoding can go on forever, and we need to let the decoder know when to stop somehow.
This code can be done on-the-fly (if you have cumulative probabilities, for which a histogram can be build on-the-fly as well, but then earlier symbols will be encoded with a premature distribution,
resulting in some erroneous encodings). Arithmetic coding is optimized for whole sequences, because it uses probabilities over the whole sequence. However, arithmetic coding is not optimal, as
a Huffman encoding of the whole sequence will always have expected length smaller or equal than the arithmetic’s encoding. But Huffman is not efficient for block smaller than the sequence.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper emielbos. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €12,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65507 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

Samenvatting

Summary EE4560 Information Theory - Cheatsheet

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?