100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Cheat Sheet for Machine Learning (880083-M-6) $4.33   Add to cart

Other

Cheat Sheet for Machine Learning (880083-M-6)

2 reviews
 246 views  15 purchases
  • Course
  • Institution

Cheat Sheet with all necessary information to pass the exam of "Machine Learning (-M-6)" at Tilburg University. The course is part of the Master Data Science and Society. It was taught in the academic year 2021/2022 in semester 2, block 4 (April to June 2022) by Ç. Güven. The margins on the page ...

[Show more]

Preview 1 out of 2  pages

  • June 21, 2022
  • 2
  • 2021/2022
  • Other
  • Unknown

2  reviews

review-writer-avatar

By: gigi93chiona • 1 year ago

review-writer-avatar

By: koenmiddelhof • 1 year ago

avatar-seller
•ML involves becoming automatically better at a task T based on some Evaluation Metrices Gradient Descent
experience E, with respect to some performance measure P •𝑀𝐴𝐸 = ∑𝑁
1
| 𝑦𝑛 − 𝑦̂𝑛 | 𝑀𝑆𝐸 = ∑𝑁
1
(𝑦𝑛 − 𝑦̂𝑛 )2 •Aspects of Learner: 1) How it uses inputs to predict outputs
𝑖=1 𝑖=1
•ML process: 1. Find examples of spam and non-spam (training set). 2. Come 𝑁 𝑁 •DT: traverse tree asking questions until arrive at leaf with target
•F1 score: [ 2* [ P * R ] ] / [ P + R ] p pred n pred
up with a learning algorithm. 3. This learning algorithm infers rules from •Perceptron: predict +1 if w * x + b > 0
• accuracy (TP + TN) / (TP + FN + FP + TN) p true TP FN
examples. 4. The rules that you infer from this training set, can be applied •Aspects of Learner: 2) How it learns
• precision / hit-rate (TP) / (TP + FP) n true FP TN
to new and unseen data (emails) → understand how model generalizes •DT: split data using best question and recurse on splits, greedy manner
•purpose of ML - solve the problem when the new problem comes fraction of flagged emails real spam?
regarding optimization (looks for the local optimum at every step)
•types of learning problems: •recall / true positive rate (TP) / (TP + FN) fraction of spams flagged?
•Perceptron: update w and b if made a mistake
o regression (supervised, continuous variable) – response real number •Type I error: false positive, Type II error: false negative •Gradient descent is an optimization algorithm. It involves incrementally
o binary classification (supervised, discrete variable) –response yes/no •Error rate / misclassification rate (FP + FN) / (TP + FN + FP + TN) moving in the domain of an error function to find the values minimizing
o Multilabel Classification: multiple labels of a finite set per sample •Accuracy and error rate are only useful if the dataset is balanced the error. To find weights or coefficients of ML algorithms, gradient
o Multiclass Classification: decide for one label of a finite set per sample • F beta: more weight to recall or precision (β > 1: recall, β < 1: precision) descent is used often (e.g., in linear, logistic regression, artificial NN)
2 P∗R
ranking, sequence labeling, autonomous behavior (self-driving cars) •𝐹𝛽 = (1 + 𝛽 ) ∗ 2 Error and optimization for a linear model
𝛽 ∗𝑃+𝑅
Decision Trees learn nested if-then-else rules
Macro average • Linear regression: y = w * x + b (weighted sum of features)
•more than two possible outcomes (binary decision tree) possible • x is a vector of features; weight vector w contains a weight per feature
•rare classes have same impact as frequent classes (use it for balanced
•decision / internal nodes test for a specific attribute • Goal: Find the best w to minimize the error
classes only): Compute precision & recall per-class, add them up and
•branches / edges are the outcome of tests (e.g., true / false) 𝑃𝐴+ 𝑃𝐵 +𝑃𝐶 𝑇𝑃𝐴+ 𝑇𝑃𝐵 +𝑇𝑃𝐶 • Capture the relationship between 𝑥 = (𝑥1 , 𝑥2 , … , 𝑥𝑚 ) and
divide by the no of classes |↓ 𝑇𝑃 +𝑇𝑃 +𝑇𝑃 +𝐹𝑃 +𝐹𝑃 +𝐹𝑃 ↓
•leaf nodes represent classification or decision (outcome) 3 𝐴 𝐵 𝐶 𝐴 𝐵 𝐶 the target 𝑦, 𝑚 is the number of features
•root node is the topmost decision node Micro average “harmonic mean of Macro Precision and Macro Recall” 𝑁
•data set as aggregate result, calculates 1 metric not average k metrics, • 𝑁 is the number of datapoints, {(𝑥 𝑖 , 𝑦 𝑖 ) } is the set of data points
Impurity Measures as Splitting Criteria 𝑖=1
•Purity is about being homogeneous (all balls in the bag have same color) each sample (not each class) weights equally 𝑗 𝑗 𝑗
• One data point (𝑥 𝑗 , 𝑦 𝑗 ) is in the form of ((𝑥1 , 𝑥2 , … , 𝑥𝑚 ), 𝑦 𝑗 )
•lowest when all labels are the same Perceptron finds simple decision boundaries (hyperplanes) in space
•highest when all labels are equally likely (uniform distributions) •linear model for classification: decision itself is not linear because it has a • 𝑤 ∗ 𝑥 means 𝑤1 ∗ 𝑥1 + 𝑤2 ∗ 𝑥2 + ⋯ + 𝑤𝑚 ∗ 𝑥𝑚
•splitting strategy: Use quality function to determine goodness of them threshold but anything before the decision (the weighted sum) is linear • Use SSE (sum of squared errors) as measure
𝑖 2
(information gain, misclassification error, gini index) Structure of a Perceptron • 𝑆𝑆𝐸 = ∑𝑁 𝑖
𝑖=1(𝑦𝑝𝑟𝑒𝑑 − 𝑦 ) as error function
(expected / weighted) misclassification error (ME): Not used in practice •Input vector with three 𝑁 2
•When we draw a ball from the bag, it is most likely from the majority class. features x, each having a 𝐸𝑟𝑟𝑜𝑟(𝑤, 𝑏) = ∑ (𝑤 ∗ 𝑥 𝑖 + 𝑏 − 𝑦 𝑖 )
𝑖=1
If the majority class has probability 𝑃𝑖 we are likely to make a mistake with weight w and one bias (-1/1)
𝑖 2
probability 1 − 𝑃𝑖 (=ME) when we predict the color to be 𝑖 •Weights determine how • Fix 𝑏 = 0 to simplify the formula 𝐸𝑟𝑟𝑜𝑟(𝑤) = ∑𝑁 𝑖
𝑖=1(𝑤 ∗ 𝑥 − 𝑦 )
Information gain / entropy important a single feature is → search for 𝑤 which minimizes 𝐸𝑟𝑟𝑜𝑟(𝑤)
•entropy to measure the amount of information attached to each attribute for the prediction Slope, derivative, and gradient
pure node: E = 0, maximal impurity (50/50): E = max •Sum the products of x and w up • Use derivative to find min of Error(w)
E(outlook sunny)=E(sunny yes(2)/total sunny(5), sunny no(3)/total sunny(5) •Apply a classification algorithm to predict y • usually, the error functions have a
E(outlook sunny)= [-(2/5) * log2 (2/5)] + [-(3/5) * log2 (3/5)] = 0.971 Classification Rule: Used by the perceptron to classify objects convex form and (e.g., no cosine form) so
do this for all possible outcomes of outlook (sunny, overcast, rainy) •It computes the weighted sum of the input features plus bias it is possible to find a global minimum
•entropy weighted sum per subset / feature (=outlook) •If sum >= 0, it outputs positive class +1, otherwise negative class -1 where the slope is 0
𝑁 • slope is positive if function is increasing
E(v in outlook)=E(sunny)*(sunny/outlook)+…+E(rainy)*(rainy/outlook) •Discriminant Function as exp of class. rule 𝑓(𝑥) = (∑𝑖=1 𝑤𝑖 𝑥𝑖 ) + 𝑏
5x sunny, 5x overcast, 5x rainy, 14 in total times their entropy •Role of bias: If w*x ~ 0, bias decides which class to predict (makes the • slope is negative if function is decreasing
(5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 = 0.693 default decision), biases the classifier towards positive or negative class • slope is 0 if function is at a local minimum or maximum
•information gain per feature (=outlook), compare the single features Update Rule: Basic Idea of the Perceptron Algorithm • For functions with vectors as arguments, e.g., Error(w), there is a
gain(outlook)=E(9,5) – weighted sum(outlook) → 0.94 − 0.693 = 0.247 •Go through examples one by one, try classifying current example with multidimensional concept called gradient: Gradient is collection of
E(9, 5) because 9x yes and 5x no, 14 observations in this feature in total: current (w, b), if correct: keep going, if not correct: adjust (w, b) (partial) derivatives, one for each dimension (feature)
[-(9/14) * log2 (9/14)] + [-(5/14) * log2 (5/14)] = 0.94 •Error based learning: Perceptron is only learning when making mistakes bc • Initialize w to a value, then update: 𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝜂 ∗ 𝑓 ′ (𝑤𝑜𝑙𝑑 )
•The split which maximizes that difference is ideal (lowest E, highest IG) then the decision boundary is updated by adapting the parameters Example: 𝑓(𝑤) = 𝑤 2 , 𝑤𝑖𝑛𝑖𝑡 = 1, 𝜂 = 0.2
•The information gain measure is biased towards choosing attributes with (weights, bias). Repeat until finding good result. Change to a more → 𝑤𝑛𝑒𝑤 = 1 − 0.2 ∗ (2 ∗ 1) = 0.6
many values (extreme case: ID numbers) which may result in overfitting. complicated classifier only when those results are better. Use the derivative of 𝐸𝑟𝑟𝑜𝑟(𝑤)
•(expected / weighted) gini impurity: all possible ways of picking a ball but •Use Update Rule to adjust w and b 𝑁
𝑖
having it from another class, Gini Impurity of a pure node is zero •Example 𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝜂 ∗ 2 ∑ (𝑦𝑝𝑟𝑒𝑑 − 𝑦𝑖 ) 𝑥𝑖
•Like entropy, can lead to misclassification error 1.ypred = predict(w, b), x) x = (1, 0.5), w = (2, 5), b = 1 𝑁
𝑖=1
•imagine you have 2 red 4 blue 6 yellow balls: 2.if y = +1 and ypred = -1 then 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝜂 ∗ 2 ∑ (𝑦𝑝𝑟𝑒𝑑 𝑖
− 𝑦𝑖 )
you can select a red ball when it is blue 2/12*4/12 3.𝑤 ← 𝑤 + 𝑥 as new weight vector ypred = (1*2) + (0.5*5) + 1 = 5.5 > 0 𝑖=1
you can select a red ball when it is yellow 2/12*6/12 4.𝑏 ← 𝑏 + 1 → ypred = +1 but y = -1 • 𝜂 is the learning rate, controlling speed of descent (e.g. 0.01 or 0.2). If
you can select a yellow ball when it is red 6/12*2/12 5.else if y = -1 and ypred = +1 then → wnew = (2 - 1, 5 - 0.5) = (1, 4.5) the learning rate is too large, we might miss the minimum because we
you can select a yellow ball when it is blue 6/12*4/12 6.𝑤 ← 𝑤 − 𝑥 as new weight vector → bnew = (1 - 1) = 0 just jump over it. If it is too small, steps are too small, it takes too long to
you can select a blue ball when it is red 4/12*2/12 7.𝑏 ← 𝑏 − 1 reach minimum (computationally not efficient). If you choose a good
you can select a blue ball when it is yellow 4/12*6/12 •With this method the decision boundary is basically rotating to find the learning rate, you will slowly approach the minimum and will end up
add them up (88/144), or subtract “correct guesses from one” best fitting line. In 3D space it’s a plane, in higher dimension hyperplane. with a good approximation of 𝑤 (e.g., after result does not change more
1 - (6/12*6/12+4/12*4/12+2/12*2/12) = 88/144 •The learning rate determines how much weight and bias will be updated. than a specific value anymore)
Construction of a Decision Tree A higher learning rate leads to faster convergence but also overshoots
•Locally optimizing algorithm: Select one attribute after each other based on more often. Learning rate is multiplied when using update rule.
splitting strategy, partition data using split attribute. No global opt. results •Weight Averaging: remember the weights from each iteration, and
•Number of possible trees grows exponentially with number of features average them to generalize better in practice →
•Tree highly depends on / sensitive to training data, robust to outliers Application of Perceptron
•Greedy algorithm starts at the top level and searches for an optimal split •Perceptrons are used e.g., in Support Vector Machines, where the best
for each level. Not guaranteed to be optimal, but often reasonably good. separation possible is found
•Trees are recursively defined data structures (Base case = leaf node, •If the data is linearly separable, perceptron will Batch and Epoch
Recursive case = branch nodes): partially composed of smaller instances of find a solution by adapting weights and bias • The batch size is the number of samples processed before the model is
same data structure. Tree composed of Online Learning vs Batch Learning updated. The number of epochs is the number of complete passes
smaller trees (subtrees)+leaf nodes •Batch algorithm must remember whole dataset through the training dataset. Both are hyperparameters.
•Binarize categorical values, discretize •Number of iterations is a hyperparameter for • Assume you have a dataset with 200 samples (rows of data) and you
numerical values using numerical batch perceptrons choose a batch size of 5 and 1,000 epochs. This means that the dataset
thresholds to assign them to a class •standard method for ML where you have your will be divided into 40 batches, each with five samples. The model
Efficiency, speed, depth of a DT entire data with labels weights will be updated after each batch of five samples. This also
•Complexity of induced model = depth of means that one epoch will involve 40 batches or 40 updates to the
•Online algorithm only remembers current example
model. With 1,000 epochs, the model will be exposed to or pass through
the tree (no of decision boundaries / no •data is presented one by one in a sequence: Good for streaming data
of questions) the whole dataset 1,000 times. That is a total of 40,000 batches during
•If you want multiple iterations, you must split the dataset to not repeat
•Prediction speed depends on the depth the entire training process.
data points (separate development set)
of the tree Stochastic Gradient Descent and Batch Gradient Descent
•Pure online learning: Make prediction for current exp. Record if correct or
•binary balanced tree: each question • (Batch) Gradient Descent: Going over the entire dataset at every
not (update model), go to next exp. At each point in time: error rate =
eliminates around half of the tree iteration (compute predictions for all the examples before making an
proportion of mistakes made so far to total examples seen so far
•Balanced binary trees: very efficient, fast; repeated halving = few questions Advantages of Perceptron update) → computationally expensive, not possible for large datasets
•Speed depends on no of questions needed to get to a leaf node (=depth) •Quick way to train simple models • Stochastic Gradient Descent (SGD): consider only one randomly chosen
Pruning to reduce complexity of final classifier and to improve accuracy on •always finds solution for linearly separable data if there is one datapoint → Trade-off: computation (or efficiency) vs accuracy, Popular
the test set by reduction of overfitting (reducing accuracy on training set) for training current ML models and deep neural networks, suitable for
•Online Learner: Can work with streaming data (constantly coming data) by online learning scenarios like perceptron update rule
•pruning reduces the depth and creates shallow trees (“inverse of splitting”) simply updating its decision rule
Advantages of Decision Trees • Minimum Batch Gradient Descent: consider subset of the dataset size of
•Decision trees can also work with streaming data, but the tree must be subset is a hyperparameter
•Interpretability: they generate understandable rules, fast classification: rebuilt which requires more resources
don’t require much computation once the tree is there, show which Disadvantages of Perceptron
features are most important for classification, can handle continuous and •Works well only with linearly separable data,
categorical features, require less data preparation (no feature scaling or otherwise the solution will be not optimal
centering), can work with relatively little amount of data •If you have a circle as decision boundary rearrange the data using polar
Disadvantages of Decision Trees coordinates to then find the linear decision boundary
•tend to overfit, tree can be too complex and detailed, less appropriate for •Therefore, not performing well on more complicated models
Cost Function
continuous features (use regression for num values to avoid overfit.), can Early Stopping to avoid overfitting and find the
•Cost Function is the loss function plus a regularization term
be computationally expensive to train, don’t work well non-rectangular: point with the lowest error on the validation set
they examine a single feature at a time, classification boxes are rectangular (after that, overfitting starts), like pruning for DT •Empirical risk minimization: trying to minimize the MAE
1 2 1 2
Types of Decision Trees Sparse Representation: Avoid too many zeros in •𝑅(𝑓) = ∑𝑁 (𝑓(𝑥) − 𝑓̂(𝑥)) = ∑𝑁 (𝑦 − (𝜃𝑥𝑖 + 𝑐))
•Dichotomiser 3 “ID3”: Uses IG concept as splitting criteria, biased to choose your dataset (problem in NLP) 𝑁 𝑖=1 𝑁 𝑖=1 𝑖
•Find polynomial power (example: linear) of loss function and theta
a feature with many values; C4.5 uses an improved version of IG •Dense representation: mention all the zeros Add lambda as regularization term to the cost function: large value of
•Classification and Regression Trees (CART): Uses Gini index, Used in •Sparse representation: leave out all the zeros, all absent values are lambda reduces the size of theta term and overfitting since a simpler
Sklearn, can only generate binary trees implicitly zero (≠ Dense Representation) 2
1
Random forests: combined decision trees “ensembles of decision trees” Multiclass Classification model is assumed 𝑅(𝑓) = ∑𝑁 (𝑓(𝑥) − 𝑓̂(𝑥)) + 𝜆 ∑𝑁 𝑖=1 𝜃𝑖
2
𝑁 𝑖=1
•deep trees → overfitting, but much pruning → less predictive performance •Not ideal: Take one perceptron for every class
•used to overcome the overfitting problem of normal decision tree by •Better solution: stack perceptrons (“multilayer perceptron”) to get a
increasing accuracy on test set vector as output instead of single class (yes/no) = approach of deep
•Sample examples n times (with replacement), Build n trees, each tree uses learning / neural network
a random subset of features, Prediction: average answers (Majority vote)

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller hannahgruber. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.33. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

77254 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$4.33  15x  sold
  • (2)
  Add to cart