Default: β(EWA factor)=0.9, → Improves computational efficiency and allows for 18. LSTM: Linear algebra →
1. Perceptron update rule Hinge loss
α=0.001, v= weighted average localized feature detection. Cell state: Long-term memory, not modified directly.
Binary classification
Parameter sharing: Kernel coefficients are identical Prevents gradient issues.
Here, margin=1
for each input location. → Reduces the number of Hidden state: Short-term memory, modified by
parameters and enables translation invariance. weights.
y: target output (-1 or 1), z: 1. Forget gate: Determines relevant parts of cell state
Equivariant representation: Convolution value Exercises:
y (expected output) - o output of network without Problem: lacks momentum, (% to remember) Input: the previous hidden state and
covaries with input value. → Provides robustness to
(output) = error any activation function initial values can be biased the new input data
transformations and improves data efficiency.
Step activation function: Update rule:
Categorical Cross-Entropy CNN elements:
Multiclass classification Downsampling: 1. Stride: defines the amount of 2. Input gate: Decides what new information should
Compares output to OHE movement over the input 2. Pooling: replaces output be stored in the cell. Input: current input and previous
output of NN at a certain location with a summary statistic of hidden state, Output: value between 0 & 1 (=how much
Stacking perceptron:
the nearby outputs 2.1: Max Pooling → Outputs of the input to let in cell state), which will be merged
multiple perceptrons with with the ‘old’ memory from the forget gate, creating a
maximum value from the input window 2.2: Average
the same input new cell state (updating cell state:)
Default: β1=0.9, β2=0.999, Pooling → outputs the average value from the input
2.Activation function in NN ɛ=1e^(-8) window 2.3: L2 norm → reduces spatial dimensions
6. Weights Initialization
Variance given by: Adadelta: Extends Adagrad while retaining important features
Xavier: tanh, sigmoid → 1/n Adamax: Extends Adam Padding: amount of pixels added around the input (to
Nadam: Combines Adam and an image) → maintains spatial dimensions, prevents 3. Output gate: Controls whether to output
He: ReLU → 2/n
z = weighted sum of o Nesterov shrinking output size information from the current cell state. Input: current
n: number of incoming neurons
a^L = ouput of each 9. Hyperparameters: Dilation: expands the receptive field by inserting input and previous hidden state, Output: value
hidden layer 7. Gradient Descent between 0 & 1 (=represents how much of the current
Learning rate spaces between filter elements → capture larger
MLP Network size context without increasing filter size cell state to output)
3. Back-Propagation Regularization parameters CNN Building Blocks:
θ: parameter to update (e.g.
Updating parameters: weights), α: learning rate, : 10. Regularization: adding 1. Convolution Layer → kernel/filter is passed over
Weights: gradient of J, J: loss function penalties against complexity the image
1. Batch/Vanilla GD: Contraints on weights 2. Activation Layer → introduces non-linearity (to
Updates θ by calculating the Additional terms to allow backpropagation)
gradients using the whole dataset objective (loss) function 3. Downsampling → reducing input size Input Embedding: Words mapped to continuous
2. Mini-Batch GD: 4. Fully connected Layer → traditional MLP structure vectors to represent meaning.
Bias: updates θ by calculating the J: objective function (loss), θ: 13. Recurrent Networks (RNN): Process sequential Positional Encoding: Adds positional information to
gradient using randomly selected parameter=weights, Ω: penalty data by maintaining hidden states across time steps embeddings.
examples L1: absolute value of magnitude, to remember previous information (loops in network) Encoder Layer: Maps input sequence (embeddings
3. Stochastic GD: can lead to sparsity in weights Training: Uses Backpropagation Through Time (BPTT): with positional encodings) into abstract representation.
Updates the parameters θ by → for feature selection extends backpropagation to sequences (layers repeated N times)
calculating the gradients using Limitations: Vanishing/exploding gradients, difficulty Multi-Headed Attention: Associates words in input
4. Activation Functions
every single example Ω(w) = in capturing long-term dependencies, slow training with each other.
19. GRU:
Linear: o(z) = z Queries (Q): Target tokens for which attention
8. Optimization L2: Weight decay, squared Solution: gradient clipping (force gradients to a Reset gate (z): how much of the previous hidden state
Regression (output layer) weights are computed.
magnitude → weights close to specific min/max) or LSTM/GRU to forget in the current state (0–1). Input: current input
Range (-inf,inf) Keys (K): Tokens used to compute attention
but never reach 0 Update rule: & previous hidden state, Output: between 0-1(=how
Sigmoid:
v: filtered version of gradients scores for the queries.
much of the previous hidden state to forget)
x, β: filtering parameter, x: Values (V): Content associated with each token
gradients at time t in the input sequence.
Binary classif. (output) Update gate (r): How much information to pass
Larger β → more weight 14 Bidirectional RNNs: Process data in both forward Computation: Multiply queries and keys, divide
Range (0,1) Dropout: Randomly remove through from previous time steps. Input: current input
to recent data, smoother and backward directions for more context. 2 RNNs: by the square root of dimensions, then multiply
neurons with a certain and previous hidden states, Output: between 0-1
Tanh: one processes the sequence from start to end by values:
curve probability to prevent co- (=how much of the previous hidden state to keep &
Smaller β → includes a (forward), and the other from end to start (backward). how much of the new input to let through) →
adaptation
longer history of data, Limitation: computational cost, longer training time, Combination of forget gate and input gate in LSTM. Residual Connection: Adds output of attention to
Early stopping: Keep track and
Binary classif. (output) probably less data higher memory consumption input.
stop training when validation
Range (0,1) 15. RNN Architectures Layer Normalization: Stabilizes and normalizes
Momentum-based optimizers: error increases (here the bias-
Softmax: Many-to-Many: Map sequence input x to output.
variance trade-off is reflected)
Multiclass classif. (output) corresponding sequence output o → Video Pointwise Feed-Forward Network: Adds non-
Data augmentation: increase
Range (0,1) classification where each frame is an input, and the linearity and processing.
diversity and size of training set
α: learning rate, β: usually 0.9 output is a label for each frame Decoder Layer: Generates text sequences using
(translation, crop, scaling,
With SGD: To keep the Many-to-One: Map sequence input to single output previous outputs and encoder inputs.
rotation, flipping, adding noise
gradient step equivalent to → Sentiment analysis where a sequence of words Embedding Layer: Converts words to vectors.
ReLU: 11. Batch Normalization:
the one in SGD, the (sentence) is classified into a sentiment label Positional Encoding Layer: Adds positional
Hidden Layer Stabilizes learning by
learning rate is scaled by One-to-Many: Single input to sequence output → information.
Range [0,inf) normalizing layer inputs,
1/(1-β): Image captioning where a single image is described Multi-Headed Attention 1: Considers only past
reducing the Covariate Shift with a sequence of words. tokens (masking).
problem (distribution of layer’s Seq2Seq: encoder-decoder to transform input Multi-Headed Attention 2: Matches encoder input
Nesterov Accelerated input changes during training → sequence into output sequence → Machine with decoder input.
Momentum slows down) → Solution: Residual Connection: Adds output of attention to
translation where a sentence in one language is
Gradient term is computed Normalize mini-batch. 1. input.
translated into another language
from θ+uv Calculate mean and variance of 15.1 Seq2Seq: encoder-decoder LSTM: Preferred for tasks with very long sequences Layer Normalization: Stabilizes and normalizes
Dying Relu: negative side Gradient always points in right inputs per mini-batch. 2. and complex dependencies. output.
Transforms an input sequence into (a vector into) an
gradient is zero, causing direction, momentum may Normalize inputs by subtracting output sequence GRU: Preferred for tasks requiring faster training Feed-Forward Network: Adds non-linearity and
some neurons to remain not. If not → gradient can still mean and dividing by the square and efficiency. processing.
Encoder: 1. Generates a vector per time step (ht) 2.
inactive and not update → ‘go back’ root of variance. 3. Scale and 20. NLP Output Layer:
Last vector can be assumed to summarize the
dead neurons that do not shift normalized inputs using Embeddings: Words as vectors as input to RNN Linear Classifier: Projects the decoder output to
sequence
contribute to learning learnable parameters (gamma One-Hot Embeddings: sparse, high-dimensional, the vocabulary size.
Adaptive Learning rate Optimizers and beta). 4. Apply activation Decoder: 1. First hidden state of the decoder set to
Leaky ReLU: hard-coded Softmax Layer: Produces output probabilities.
Adapt to individual parameters the last hidden state of encoder (normally a special
o(z) = max(0.1x, x) function to the normalized and Word Embeddings: Dense vectors capturing
character) 2. At each time step, generates a hidden 23. Vision Transformers (ViT): Apply transformer
Range (-inf,inf) Adagrad: transformed inputs. semantic meaning, dense, lower-dimensional,
vector (ht
) and creates an output architecture to image data leveraging self-attention
5. Loss Uses sum of squared grads learned from data
16. Attention: Allows the model to focus on relevant mechanisms to capture global image features.
Mean Squared Error (MSE/L2) Problem: decreasing learning
parts of the input sequence when predicting each part 21. Self-Attention: Self-attention focuses on a single 1. Image Division: Divide image into fixed-size patches
Regression rates bc of the square root of
of the output sequence → encoder passes all hidden sequence. It allows the model to let a sequence learn (e.g., 16x16).
the accumulated square of
states to decoder (as a matrix) information about itself → transformers 2. Patch Embedding: Flatten and project each patch to
gradients in denominator,
1. Create representation using encoder states 2. Calculating self-attention: query vector (e.g. input) * a vector, add positional encoding.
leading to slow/impossible μ: mean, σ^2: variance, γ & β: Attend states in encoder that are most similar to state key vector, followed by softmax function to obtain 3. Transformer Encoder:
Mean Absolute Error (MAE) learning learnable parameters To do:
in decoder 3. Create a final representation (context attention weights Multi-Headed Self-Attention captures patch
Regression 1. Multiply kernel size with input, starting
12. CNN: calculate output size vector) using all relevant information 22. Transformers: model that uses attention to boost relationships. left upper corner, move with stride steps
RMSprop (activation map dimensions): Limitations in RNNs: 1. Relevance of information the training speed → Handle sequential data without Residual Connections and Layer Normalization (you get a 3x3 matrix).
Overcomes Adagrad problem W: width, H: height, F: width/ might depend on future inputs 2. Recurrence recurrence, allowing for parallel processing stabilize output.
2: Max pooling layer (2,2) → take
by using a moving average height of the filter, P: padding, S: prevents parallelization of computations. Encoder: Feed-Forward Network adds non-linearity. maximum value of each matrix (in above
Binary Cross-Entropy Only recent squared gradient stride 17. Gated units: Improve RNNs by managing long- Self-Attention Layer 4. Classification: Use a special CLS token for final matrix): Max(3, 0, 0, 2) = 3, Max(0, 1, 2, 0) =
Binary classification matters; long ago=forgotten term dependencies & mitigating vanishing gradient Feed-Forward Layer classification with an MLP head. 2, Max(0, 2, 1, 0) = 2, Max(2, 0, 0, 3) = 3
Derivative becomes linear Difference with Adagrad: gt: problems Decoder: Pros: Captures global context, flexible, state-of-the-art
So: [[3,2] [2,3]]
measured by exponentially CNN properties: Closed gate: multiply data by 0 → erase content Self-Attention Layer performance with sufficient data.
p: output network, y: target decaying average and not the Sparse connectivity: conv. kernel Open gate: multiply data by 1 → persever content Encoder-Decoder Attention Layer Applications: Image classification, object detection,
output
sum of gradients is much smaller than the input LSTM, GRU Feed-Forward Layer segmentation