100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Deep Learning cheat sheet $3.69
Add to cart

Summary

Summary Deep Learning cheat sheet

2 reviews
 516 views  35 purchases
  • Course
  • Institution

Cheat sheet for Deep Learning 2019/2020

Last document update: 5 year ago

Preview 1 out of 2  pages

  • October 14, 2019
  • October 15, 2019
  • 2
  • 2019/2020
  • Summary

2  reviews

review-writer-avatar

By: adiivkovic • 4 year ago

review-writer-avatar

By: phenriquez • 5 year ago

avatar-seller
Introduction: (sigmoid) function: sigmoid(Wh + b). The and feeding the output of one layer The n of filters is the same as the n of majority ensemble makes a mistake.
Upcoming popularity is due to: 1. affine function produces the logit value z directly to another layer, skipping a few in feature maps in output. The more layers, 𝑝𝑚𝑎𝑗 = ∑3𝑘=2 𝐵𝑖𝑛𝑜𝑚(𝑘, 3, 𝑝)
Upcoming of large-scale datasets 2. = Wh + b. The logistic function then between. More information because it the more complex patterns you can Dropout: randomly dropping nodes
Upcoming of powerful hardware. translates into a probability. takes lower level features, that don’t exist detect. The size of the filter depends on during training. Ensembling all networks
Motivations for Deep Learning: 1 anymore in the higher levels. the size of the feature map. with a subset of units missing. Makes
𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧) =
Curse of dimensionality: many AI tasks 1 + exp⁡(−𝑧) Concatenating layers: two inputs go Pooling: bagging practical for ensembles of very
have extremely high dimensionality in the Used with: Binary Cross Entropy: through different parts of the network Pooling: a pooling function replaces the many large neural networks. Since larger
raw input. Classical ML does not work standard definition where yn is the binary and then we combine them to a single output of the network at a certain ensembles are computationally
well. You need feature engineering: target of example n and pn is the layer and use that. E.g. match spoken location with a summary statistic of the expensive. Dropout helps with this.
compute meaningful features by hand- predicted probability of class 1 for numbers with written numbers. nearby outputs.
coding it. Learning features: in DL example n. Multiple task: one input eventually get Max pooling: taking the maximum
𝑁
features are learned driven by the split into multiple outputs. Predict activation.
𝑙(𝑝, 𝑦) = − ∑ 𝑦𝑛 ∗ log 𝑝𝑛 + (1 − 𝑦𝑛 )
learning objective. 𝑛=1 multiple things from the same input. Average, L2 norm, Weighted average.
Pro: important data only lives in a few ∗ log⁡(1 Forward propagation / pass: computing Typically a convolution operation is
dimensions of the space, you figure out − 𝑝𝑛 ) the scalar cost of the network based on followed by an activation function and
them out, throw rest away. The log and the sigmoid can be combined the input, target and weights. Compare then a pooling function. Stride: the jump
into a single numerically more stable prediction with target, difference is the you make when the filter is passing
HC 1: Multilayer Perceptron computation. So sigmoid directly in loss loss. through the input. You can skip a couple
Perceptron: the model of perceptron is a function instead of first doing sigmoid Backpropagation: computing the gradient and make the stride bigger than 1. Less Dropout at training:
and then loss function binary cross of this computation with respect to computationally expensive. - For each input unit, sample a binary
entropy. network parameters, starting from the Padding: add 0’s on the borders so there
𝑁 variable from a Bernoulli distribution with
𝑙(𝑧, 𝑦) = − ∑ 𝑦𝑛 ∗ log 𝜎(𝑧𝑛 ) value of the loss. Once this gradient is is more. Useful if you want same sizes. probability pinput.
𝑛=1 computed, it is used to update model 𝑊𝑜𝑢𝑡 = (𝑊𝑖𝑛 − 𝑓𝑤 + 2𝑃)/𝑆 + 1 - Multiply unit activation with the
+ (1 − 𝑦𝑛 ) parameters. Backpropagation is based on 𝐻𝑜𝑢𝑡 = (𝐻𝑖𝑛 − 𝑓ℎ + 2𝑃)/𝑆 + 1
∗ log⁡(1 sampled variable.
the chain rule of calculus. Chain rule: - For each hidden unit, sample a binary
− 𝜎(𝑧𝑛 )) Win , Hin : input width and height
compute derivative of function variable from a Bernoulli distribution with
simple linear model. Softmax:
composition. fw , fh : width and height of the filter probability phidden.
MLP: generalization perceptrons into a Models categorical / Multinoulli targets
(𝑓°𝑔)′ = (𝑓 ′ °𝑔) ∗ 𝑔′ P: padding - Multiply unit activation with the
model with layers. A single perceptron- Affine transformation followed by a
f’ = the derivative of f S: stride sampled variable.
like unit is a node, connected to other softmax: softmax(Wh + b)
f o g = function composition Convolutional Neural Networks (CNN’s): Dropout mask: usually you don’t do this
units via weights. Neurons arranged in Converts logits to a categorical
Function composition: takes functions f It has multiple layers of convolution, so: one by one, but with a mask. A mask is a
layers compute function compositions. probability distribution. It makes it add up
and g, and produces a function h, such input layer → filter → output layer → vector that can be applied to a layer by
Inputs / hidden are vectors, weights to 1. filter → output layer → filter etc.
that h(x) = g(f(x)). So that the function g is elementwise multiplication. Dropout at
matrices. exp⁡(𝑧𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖 ) = 𝑁 applied to the result of applying function f Typically used in neuroscience. testing:
∑𝑘=1 exp⁡(𝑥𝑘 )
to x. Backpropagation applies this rule Apply dropout in testing multiple times
Used combined with: Categorical Cross-
recursively to the computations encoded HC3: Regularization and then average the outcomes. Very
Entropy: generalizes binary cross-entropy
in the network. Automatic Regularization: any strategy whose computationally expensive. Can be fixed:
to categorical targets, where yn is the
differentiation: generalization of purpose it is to reduce test error (e.g. - Multiply the outgoing weight of each
index of the true class for example n and
backpropagation. overfitting). Possibly at the expense of unit by the probability of keeping it (don’t
pn,m is predicted probability of class m for
training error. actually dropout, just use pinput and
example n.
𝑁
HC 2: Convolutional Neural Network Norm penalties: phidden).
𝑙(𝑝, 𝑦) = − ∑ log⁡(𝑝𝑛,𝑦𝑛 ) Use Convolution instead of matrix L2 penalty: shrinks the weights towards - Maintain expected total input to a unit
What is applied to x here? 𝑛=1
multiplication (what MLPs use). zero with the same factor. Adds a penalty at test time similar to the expected total
Matrix multiplication with W, Hidden units:
Specialized for processing data with grid- equal to the square of the magnitude of input to that unit at train time.
Elementwise activation function f, Matrix Hidden units: encode internally
like topology, like images or time series. the coefficients. Will not produce sparse Could also do inverse-scaling during
multiplication with U, Elementwise computed features, based on features in
Convolution: mathematical operation on models. training, and no scaling at testing.
activation function g. the previous layer. Computed using an 𝑀
two functions (f and g) that produces a
affine transformation, followed by non- 𝐽̂(𝑤; 𝑋, 𝑦) = 𝐽(𝑤; 𝑋, 𝑦) + 𝜆 ∑ 𝑤𝑖2 Approximates taking the geometric mean
Terminology: third function expressing how the shape 𝑖=1
linear activation function. of the output of all sub-networks.
Activation of a unit: weighted sum of one is modified by the other. L1 penalty: creates sparsity among the 1
followed by activation function. ℎ = 𝑔(𝑊𝑥 + 𝑏) 2𝑑
𝑠(𝑡) = (𝑥 ∗ 𝑤)(𝑡) weights. Adds a penalty equal to the
Activation functions hidden layer: 𝑝𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 (𝑦|𝑥) ∝ (∏ 𝑝(𝑦|𝑥, 𝜇)⁡)
In the matrix cell i,j, is the weight of the S = the feature map (output) absolute value of the magnitude of the 𝜇
Rectified linear (relu):
connection from the jth input to the ith x = the input coefficients. Limits the size of the
Good default choice.
output w = the kernel with weights coefficients. Can create sparsity because Differences dropout and ensembling
Below 0 horizontal, after that diagonal.
Parameters: set of all network weights Kernel: filter that moves over the input it sets some of the weights to exactly randomly masked networks: 1. Dropout is
Useful gradient as long as h is positive.
Elementwise function: applied and creates the output. Multiply the zero. It is therefore also a feature less computationally expensive. 2.
Initialize bias (b) to small positive
independently to each unit in a layer. values in the input that are in the filter selection algorithm. Keeping all in memory is difficult.
numbers. 𝑀
Feedforward network: network where with the value in the kernel. Dropout and sexual reproduction:
Various generalizations and variants of 𝐽̂(𝑤; 𝑋, 𝑦) = 𝐽(𝑤; 𝑋, 𝑦) + 𝜆 ∑ |𝑤𝑖 |
outputs are not fed back in as inputs. Discrete convolution: 𝑖=1 Chromosomal crossover: randomly
relu exist. ∞ Early stopping: regularization. First
Cost/loss function: function used to drive swapping genes from two parent
𝑟𝑒𝑙𝑢(ℎ) = max⁡(0, ℎ) 𝑠(𝑡) = (𝑥 ∗ 𝑤)(𝑡) = ∑ 𝑥(𝑎)𝑤(𝑡 − 𝑎) validation error goes down, later up
the network towards a good solution. organisms to prevents co-adaptation.
Sigmoid and Hyperbolic tangent (tanh):
How wrong the current network is. 𝑎=−∞ again. Stop at lowest point. Don’t actually Genes are selected that work well with
The hyperbolic tanh calls the sigmoid
Backpropagation: computation of the stop, but save model weights regularly, different versions of other genes.
function, but with adaptions. Preferred 2D Discrete convolution:
gradient of the cost function with respect then choose weights with lowest Dropout encourages modular
over standard sigmoid for hidden layers.
to network parameters. So it goes back 𝑠(𝑖, 𝑗) = ∑ ∑ 𝑋(𝑚, 𝑛)𝑤(𝑖 − 𝑚, 𝑗 − 𝑛) validation error. representation which work well in the
Often used in RNN.
into the model and makes adaptations. 𝑚 𝑛 Final evaluation metric: often we care absence of some parts in the network,
Not ideal for MLP due to saturation issues
Linearity: 𝑓(𝑥) = 𝑊𝑥 + 𝑏 less about loss function, and more about similar to sexual reproduction.
tanh(𝑧) = 2 ∗ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(2𝑧) − 1 Convolution:
Linear models are limited in the type of final evaluation (accuracy, F1, etc). Applicability:
Width vs. depth: 𝑆(𝑖, 𝑗) = (𝐾 ∗ 𝐼)(𝑖, 𝑗)
function they can represent. E.g. XOR. Lowest loss ≠ lowest metric of interest. Standard dropout works well with
A good trade-off between width and
Non-linear activation function is crucial = ∑ ∑ 𝐼(𝑖 − 𝑚, 𝑗 − 𝑛)𝐾(𝑚, 𝑛)⁡ Data augmentation: feedforward, but for recurrent is needs to
depth can be found empirically, on
for a MLP to be able to represent non- 𝑚 𝑛 Data augmentation: adding fake data be adapted. Values for dropout are
validation data. In theory you could do
linear functions. using the data you already have. E.g. for usually a little below 0.5
the same as a deep NN in one layer. Cross-correlation:
Learning of the model: images: translation, cropping, rescaling, Adversarial training: using examples
Universal approximation theorem: a 𝑆(𝑖, 𝑗) = (𝐼 ∗ 𝐾)(𝑖, 𝑗)
The parameters of neural networks are rotation, flipping, adding noise. (not all crafted to ‘fool’ your model. E.g. adding
feedforward network with a single hidden
learned via gradient descent: cost = ∑ ∑ 𝐼(𝑖 + 𝑚, 𝑗 + 𝑛)𝐾(𝑚, 𝑛) always appropriate, e.g. d / p. noise. Using this to make the model less
layer with a finite number of units and a
function, functional form of the model, 𝑚 𝑛 E.g. for speech recognition: manipulate sensitive to being fooled. If you have
non-linear activation function, can
optimization procedure. pitch or speed, add environmental noise / many dimensions, output y will change by
approximate continuous functions on a
Cost functions: With convolution f * g is the same as g * f. random noise, frequency / time masking, a lot if input x changes by a little.
closed and bounded subset of R2 to any
Loss functions for neural networks are With cross-correlation this is not the case, vocal tract length perturbation (VTLP). Adversarial training prevents this.
desired degree of accuracy.
non-convex. and you mirror the formula. Ensembling:
Caveats: 1. Does not mean those
Use the same principles as with linear Properties of convolution: Multitask learning: when part of a model HC4: Practical Methodology
parameters are learnable from data. 2.
models. Common are MSE, categorical Sparse connectivity: kernel is much is shared across tasks, that part of the Steps in DL: 1. Determine goal 2. Establish
Doesn’t say anything about the size of the
cross-entropy, and binary cross entropy. smaller than input. When fully connected model is more constrained toward good model. 3. Determine bottlenecks in
network.
Carefully matched to output unit each input is connected to each node in values. performance. 4. Make incremental
DL relies on the intuition that the function
activation functions. the hidden layer. But now a single filter is Model averaging: combine multiple changes e.g. gathering new data,
we want to learn consists of multiple
Output units: applied on different subsets, hence less models by averaging their outputs adjusting hyperparameters of changing
smaller functions.
Linear: connections. Bagging: randomization: k models trained algorithms.
Deep networks tend to work better.
Models normally-distributed targets Parameter sharing: kernel coefficients by sampling data with replacement. Goals and Performance Metrics:
We put bias in our models when creating
Affine transformation: Wh + b are identical for each input location. In a Why does it work: Every model needs a clear metric to be
deep layers. It’s an inductive bias,
Used combined with MSE loss, where yn fully connected layer each parameter is 1. Different models make different errors. improved.
because we are suggesting to a model
is the true target and zn is the predicted independent, in convolutional all are the 2. Less correlated errors -> better working Accuracy:
that it is best represented in sequential
value. same. ensemble: unlikely that models will make The ratio of the correct predictions to all
𝑁 affine transformations. Equivariant representations: convolution the same mistake on the same example. predictions.
𝑙(𝑦, 𝑧) = ∑ (𝑦𝑛 − 𝑧𝑛 )2 Convolutional model performs the same
𝑛=1 value covaries with input value. When the Errors are lower for average and majority 𝑇𝑃 + 𝑇𝑁
Sigmoid: as fully connected but with less param. input changes, the output changes in the vote models. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Make the convolutional deeper and it
Models binary / Bernoulli-distributed same way. Convolution is equivariant to If classifiers m1, m2, and m3 each have a
outperforms both with less param. Not always informative, e.g. unbalanced
targets translation but it is not for rotation or probability p of making a mistake, the
Affine transformation followed by logistic Architectural variations: scale. probability pmaj is the probability that the classes. Precision:
Skip connections: skipping some layers

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller sophievanlotringen. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $3.69. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

48298 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling
$3.69  35x  sold
  • (2)
Add to cart
Added