100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary EE4560 Information Theory - Cheatsheet $13.94   Add to cart

Summary

Summary EE4560 Information Theory - Cheatsheet

 32 views  0 purchase
  • Course
  • Institution

This PDF is a (legal) cheatsheet and can be used during the exam, containing all relevant information very densely presented.

Preview 1 out of 2  pages

  • November 20, 2021
  • 2
  • 2020/2021
  • Summary
avatar-seller
  INFORMATION MEASURES
Hartley’s measure: HH (X) = log kl = l log k for random variable X with k-element alphabet X and message of length l. Assumes equal probabilities for X ’s elements and independence.
P
Shannon’s measure (entropy): H(X) = H(p) = − x∈X p(x) log p(x) = −E log p(X) for random variable X with k-element alphabet X and probability density p(x) = Pr(X = x ∈ X ).
Shannon first coding theorem: entropy is the minimum number of bits needed to represent an outcome, or, minimum number of yes/no questions needed to determine which symbol x ∈ X occurred.
0 ≤ H(X) ≤ log k with equality iff ∃i such that p (xi ) = 1 (in which case H(X) = 0) or p (xi ) = k 1 for all i (in which case H(X) = log k).
P P P P
H(Y | X = x) = − y∈Y p(y | x) log p(y | x) =⇒ H(Y | X) = x∈X p(x)H(Y | X = x) = − x∈X y∈Y p(x, y) log p(y | x)
P P
H(X, Y ) = − x∈X y∈Y p(x, y) log p(x, y) = H(X) + H(Y | X) = H(Y ) + H(X | Y ) ≤ H(X) + H(Y ) with equality iff X and Y are statistically independent, i.e., p(x, y) = p(x)p(y)
Pn 
Chain rule for entropy: H (X1 , . . . , Xn ) = i=1 H Xi | Xi−1 , . . . , X1
P P p(x,y)
Mutual information: I(X; Y ) = x∈X y∈Y p(x, y) log p(x)p(y) = H(X) − H(X | Y ) = H(X) + H(Y ) − H(X, Y ) ≥ 0 with equality iff p(x, y) = p(x)p(y)
Pn 
Chain rule for mutual information: I (X1 , . . . , Xn ; Y ) = i=1 I Xi ; Y | Xi−1 , . . . , X1 Pn
Jensen’s inequality: for a random variable X and convex function f : Ef (X) ≥ f (EX) (Ef (X) ≤ f (EX) if f is concave). Because of this, H(X) ≥ H(X | Y ) and H (X1 , . . . , Xn ) ≤ i=1 H (Xi )
Q
If the random variables that model an information source are mutually independent (p (x1 , . . . , xn ) = i p (xi )), we say that the process is memoryless.
A stochastic process is stationary if the joint distribution of any subset of  the sequence of random variables is invariant with respect to shifts in the time index,
i.e. Pr (X1 = x1 , . . . , Xn = xn ) = Pr Xl+1 = x1 , . . . , Xl+n = xn for every shift l and all xi ∈ X
A discrete stochastic process is a Markov process of order m if Pr(Xn | Xn−1 , . . . , X1 ) = Pr(Xn | Xn−1 , . . . , Xn−m ) (depends only on the chain state; the m preceding variables).

Besides the conditional probability distribution of Xn , a Markov process can also be defined by a probability transition matrix P : pj|i = p xj | xi ; the probability of going to state j from state i.
Two measures for the information of a stochastic Markov process (with memory):
• entropy rate/per-symbol entropy of {Xk , k ∈ Z} is defined by H∞ (X) = limn→∞ 1 H
(X1 , . . . , Xn ); the ”rate of growth” of the entropy of a sequence of n random variables
n
0 
• conditional entropy of the last random variable given the past: H∞ (X) = limn→∞ H Xn | Xn−1 , . . . , X1
0 
For stationary processes, both limits exist and are the same. For a stationary Markov process: H∞ (X) = H∞ (X) = H Xn | Xn−1 , . . . , Xn−m
ASYMPTOTIC EQUIPARTITION PROPERTY
1 log p (x , . . . , x ) → −E log p(X) = H(X).
X1 , . . . , Xn , the asymptotic equipartition property (AEP) states that − n
For i.i.d. random variables 1 n
−nH(X)
As a consequence, the probability p (x1 , . . . , xn ) of almost all sequences will be close to 2 when n is large. ”Almost all events are almost equally surprising.”
 
n 1 Pn
The analog to the AEP is the weak law of large numbers, which states that 1
P
n i=1 xi → EX in probability, as n → ∞. Alternatively: ∃n0 s.t. n ≥ n0 s.t. Pr n i=1 xi − EX <  ≥ 1 − δ
2
The latter can be proven using Chebyshev’s inequality: Pr(|x − µ| < ) ≥ 1 − σ2
  
Shannon-McMillan’s theorem: for {Xn , n ∈ Z} a i.i.d. source, ∃n0 s.t. n ≥ n0 s.t. 1 ≥ Pr − n 1 log p (x , . . . , x ) − H(X) <  ≥ 1 − δ . The set of all sequences (of length n) can be divided into
1 n
n o
(n) 1 log p (x , . . . , x ) − H(X) <  . The proportion of any symbol in a typical sequence is close to its true probability: N (x )/n ≈ p(x ).
• Typical set: A = (x1 , . . . , xn ) : − n 1 n i i
 
(n) −nH(X) −n(H(X)−)
n
> p (x1 , . . . , xn ) > 2−n(H(X)+)
P
The set has prob close to 1: 1 ≥ Pr A = (n) p (x ) ≥ 1 − δ . All sequences in it have prob close to 2 , namely 2
xn ∈A
(n)
Its number of sequences is nearly 2nH(X) : (1 − δ)2n(H(X)−) < A < 2n(H(X)+)

(n)
• Non-typical set: A , where the sample entropy is close to the true entropy. Typical sequence 6= most probable sequence.
   
(n) (n) (n) (n)
If Qq ⊂ X n is the set of most probable sequences with Pr Qq > 1 − q , then Pr Aε ∩ Qq >1−δ−q
DISCRETE CHANNEL CAPACITY
A discrete channel is given by an input alphabet X , output alphabet Y , and probability transition matrix p(y | x). It is memoryless if output depends only on current input.
Informational channel capacity (C = maxp(x) I(X; Y ) ≤ min{log |X |, log |Y|}) is the same as operational channel capacity (highest rate in bits per channel use with arbitrarily low probability of error).
P
It is the difference between uncertainty about X before and after transmission, i.e. the amount of information communicated. Recall I(X; Y ) = H(Y ) − H(Y | X) = H(Y ) − p(x)H(Y | X = x)
A channel is symmetric if the rows of the channel transition matrix are permutations of each other, and columns of the channel transition matrix are permutations of each other. (same = permutation)
A channel is weakly symmetric if the rows of the transition matrix are permutations, and all column sums are equal. Then, C = log |Y| − H(row of matrix) achieved by uniform input distribution.
n n
An (M, n) code is given by an index set {1, ..., M }, encoding function {1, ..., M } → X , and decoding function Y → {1, ..., M }
n n n P n n n
Conditional probability of error given i was sent: λi = Pr (g (Y ) 6= i | X = x (i)) = y n p (y | x (i)) I (g (y ) 6= i), where I(·) is the indicator function.
λ(n) = maxi∈{1,...,M } λi . Average probability of error: Pen = M
Maximal probability of error: 1 PM λ
i=1 i l m 
log M
Rate R of an (M, n) code: R =
n
bits per transmission. A rate R is achievable on a channel if there exists a sequence of 2nR = M, n codes such that the maximum probability of error
λ(n) tends to 0 if n → ∞. The operational capacity of a channel is the supremum of all achievable rates.
(n) n n n n
∼ p (xn , y n ) = n 1 n n 1 n n
Q
The set A of jointly typical sequences has pairs (x , y ) ∈ X × Y i=1 p (xi , yi ) s.t. − n log p (x or y ) − H(X or Y ) <  and − n log p (x , y ) − H(X, Y ) < .

(n)
 
(n)
 nH(X,Y ) (n)
Pr (X n , Y n ) ∈ A → 1 if drawn from p(xn , y n ). Pr (X n , Y n ) ∈ A ≤ 2−n(I(X;Y )−3) ≈ 2
nH(Y )
n n
(small) if drawn from p(x )p(y ). A ≤ 2n(H(X,Y )+) .
2nH(X)2
Since we decode according to joint typicality, the latter statement suggests there are about 2nI(X;Y ) = 2nH(X) n
distinguishable signals X .
2nH(X|Y )  
nR
Shannon’s channel coding theorem/2nd theorem: for a memoryless channel, all rates below C are achievable. Conversely, any sequence of 2 , n codes with λ(n) → 0 must have R ≤ C . Proven by
Rn n Qn
randomly generating 2 codewords according to p (x ) = i=1 p (xi ), and showing that if R < I(X; Y ), then the average probability of error is small with large n. From this, it can be derived
(n) (n) (n)
that for any rate R < C and any  > 0, a code exists with maximal probability of error λ ≤ 4. In short, Pe → 0 if R < C , while Pe → 1 if R > C .
CF B = C , but channels with feedback can simplify encoding/decoding.
CONTINUOUS CHANNEL CAPACITY
R h(X)
The differential entropy h(X) of a continuous random variable X with density f(x) is h(X) = − S f (x) log f (x)dx, where S is the support set of X (where f (x) > 0 ). Can be < 0, but 2 >0
 
For a Guassian distribution: h(f ) = 1
2
log 2πeσ 2 (the larger the variance, the larger the differential entropy).
1 Pn
Yi = Xi + Zi at time discrete i, where all are real numbers and Zi ∼ N (0, N ) and independent of Xi . Input power constraint for any codeword (x1 , ..., xn ): n 2
  i=1 xi ≤ P .
1 P 1 2
I(X; Y ) = h(Y ) − h(Z) ≤ 2 log 1 + N , because h(Y ) ≤ 2 log(2πe(P + N )), because E(Y ) = P + N .
 
The information capacity of the Gaussian channel with power constraint P is defined as C = max 
2
 I(X; Y ) = 1 2
log 1 + NP , where the max is achieved with X ∼ N (0, P ).
f (x):E X <P
For a continuous-time channel: Y (t) = (X(t) + Z(t)) ∗ h(t), where X(t) etc. are waveforms and h(t) is impulse response of an ideal bandpass filter (removes frequencies larger than bandwidth W ).
Nyquist-Shannon sampling theorem: a function f (t), which is band-limited to W (in (k)Hz), is completely determined by samples of the function spaced 1 (milli)seconds apart.
2W
Noise power N = N0 W , where N0 is the noise power spectral density (given by nature). If channel is used over the time interval [0, T ] (T in seconds), then energy per sample is total energy
!
P
N W T N
 
consumed divided by the number of samples P T = P and noise variance per sample is 0 = 20 . Capacity per sample: C = 1 log 1 + 2W N = 1 log 1 + N PW . Since there are
2W T 2W 2W T 2 2 0 0
  2
2W samples per second, C = W log 1 + N PW bits per second. Bandwidth efficiency η = W R bit/s/Hz. The slides have more complicated stuff that I didn’t put here.
0 Pk 2 P P
For k independent parallel Gaussian channels with a common power constraint E j=1 Xj = Pj ≤ P , we use a water-filling algorithm to find the optimal power allocation that maximizes C = Cj
It means making P + N equal across all channels. If this is impossible due to too small budget for P , then equalize as many channels as possible (starting with the best ones = lowest noise).
Feedback increases capacity for channels with memory, but not by more than a half (CwithFB ≤ CwithoutFB + 1 ) and not by more than a factor of 2 (CwithFB ≤ 2CwithoutFB ).
2
DATA COMPRESSION
(n) (n)
We can code A
n(H(X) + ) + 2 bits and A
with no more than with no more than n log |X | + 2 bits, where one bit is needed in case not an integer, and the other for distinguishing.
   
(n) (n)
n
≤ n H(X) + 0 , where 0 =  + log |X| + (2/n) can be made arbitrarily small with n.

The average code length El(X ) = (n(H(X) + ) + 2) Pr A + (n log |X | + 2) Pr A
Shannon’s source coding theorem/1st theorem: for a discrete memoryless i.i.d. source {Xn , n ∈ Z}, we can encode messages of length n into codewords of length l from a code alphabet of size r with
(n) (n)
arbitrary small probability of error Pe ≤ δ iff r l ≥ 2n(H(X)+) , because the number of elements in A is A ≤ 2n(H(X)+) ≤ r l , so that the number of codewords is larger than the number
 
(n)
of typical source words, and Pe ≤ Pr A ≤ δ . However, code construction based on typical sets requires long source sequences (n → ∞), which is inefficient for short sequences.
A source code C for a random variable X is a mapping C : X 7→ C . C(x) denotes the codeword corresponding to x of length l(x) and r -ary alphabet.  There are different nested classes of codes:
• A non-singular code is a code that uniquely maps each of the source symbols x ∈ X into a code word C(x), i.e. xi 6= xj ⇒ C (xi ) 6= C xj ; there can’t be two equal codewords.
• A uniquely decodable code means that its n-extension is non-singular as well, for all n, i.e. any sequence of n symbols should be unambigiously decodable as well.
• A prefix/instantaneous code means no codeword is a prefix of any other codeword, i.e. you don’t have to wait for the entire sequence to appear in order to decode it; you can decode on-the-fly.
Pk −l(xi )
Kraft inequality: for any prefix code over an alphabet of sizer , the codeword lengths l (x1 ) , . . . , l (xk ) must satisfy the inequality i=1 r ≤ 1. It is a sufficient condition for the existince of
Pk Pk −l(xi )
a codeword set with the specified length. Finding a prefix code with minimum expected length then becomes an optimization problem: minl(x ) i=1 p (xi ) l (xi ) subject to i=1 r ≤1
i

−l (xi )
For any prefix code: El(X) ≥ H(X) with equality iff r = p (xi ) for all i. In case there is a − log p(xi ) which is not an integer, we round it up and additionally have El(X) ≤ H(X) + 1.
n
The overhead of the additional bit disappears with n → ∞. Assuming symbols are drawn i.i.d. according to p (x ), we have El (X1 , . . . , Xn ) < H (X1 , . . . , Xn ) + 1 ⇒ El(X) < H(X) + 1
n
H(X1 ,...,Xn ) H(X1 ,...,Xn ) 1 For stationary processes, we have that H (X , . . . , X ) /n → H (X), the entropy rate of the process.
For non-i.i.d. sequences, we have the bound
n
≤ El(X) < n
+ n 1 n ∞
Kraft-McMillen inequality: the exact same as the Kraft inequality, but McMillen proved that it also holds for uniquely decodable codes.
Pi−1 
• Shannon code: arrange messages in order of decreasing p(xi ). Obtain the code of xi by calculating the cumulative probability F (xi ) = j=1 p xj , translating this decimal value to binary
and taking the first l(xi ) = d− log p (xi )e bits. This code is asymptotically optimal (as n → ∞), but may be much worse than the optimal code for finite n for some particular symbol.
• Huffman code: arrange messages in order of decreasing p(xi ), join the two least probable, repeat until two messages left. Working backward, put a 1 to the combined symbols and a 0 to the
0
non-combined zeros (or vice versa). This code is optimal (El (C) ≤ El C ), but for small alphabets is only efficient (meaning no redundancy in th compressed data) for long blocks, and slow
algorithm, and all calculations have to be redone if block length increases. Concatenating Huffman codes leads to inefficiency, which is only mitigated by redoing the code for the whole block.
n n
• Arithmetic code for sequences: create nested |b, li intervals according to I0 (x ) = |0, 1i and Ik (x ) = bk−1 + F (sk ) lk−1 , p (sk ) lk−1 , k = 1, . . . , n. The sequence xn = (s1 , ..., sn ) is
n
then represented by the final interval In (x ), and encoded by the binary representation of any number v̂ who’s d− log ln e-bit representation lies in the interval. For decoding, we process the
v̂ −F (sk )
symbols FIFO: starting with v̂1 = v̂ , we continuously obtain the corresponding symbol with sk = xi if F (xi ) ≤ v̂k < F xi+1 and obtain the next v̂k+1 = k

. Because intervals
p(sk )
represent sets of infinite sequences, the decoding can go on forever, and we need to let the decoder know when to stop somehow.
This code can be done on-the-fly (if you have cumulative probabilities, for which a histogram can be build on-the-fly as well, but then earlier symbols will be encoded with a premature distribution,
resulting in some erroneous encodings). Arithmetic coding is optimized for whole sequences, because it uses probabilities over the whole sequence. However, arithmetic coding is not optimal, as
a Huffman encoding of the whole sequence will always have expected length smaller or equal than the arithmetic’s encoding. But Huffman is not efficient for block smaller than the sequence.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller emielbos. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $13.94. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

67474 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$13.94
  • (0)
  Add to cart