Machine Learning •First calculate entropy per sub-feature. The fracture indicates the •Linear classifiers (such as perceptrons) employ a linear combination of
• Automation of problem solving, involves algorithms becoming better proportion of each decision class for that specific sub-feature. features to separate classes, while non-linear classifiers, such as k-
at a task based on some experience with respect to some performance •Note: logbase2 = logbase10 (value / logbase10 (2) nearest neighbors, decision trees, and neural networks, utilize non-
measure. ML should be generalizable – perform on unseen data. linear functions to accomplish class separation.
•Supervised learning: model learns on a labeled dataset, e.g. Usage of perceptrons
classification and regression. Unsupervised learning unlabeled dataset •Next, sum the entropy of the diff. sub-features together using the •Compute weighted sum of the input features + bias. If the sum greater
e.g. clustering and association mining. Semi supervised learning: only a proportional weights of each sub-feature. than or equal to zero -> positive class (+1), otherwise negative class (-1).
portion of the data is labeled, e.g. text classification. Reinforcement •In the example you have three features (x1, x2, x3) with their
learning: learns from rewarding desired behavior and punishing corresponding weight (w1,
undesired behavior (feedback loop), e.g. self-driving car. •Last, calculate the information gain. Info(root) is the entropy of the w2, w3), while 1 indicates the
Learning process root node). Info(fi) is the entropy of the feature. bias with weight b. The sigma
• Find an appropriate dataset with features and a target label -> decide •Repeat analysis for every feature and chose the feature with the best represents the weighted sum
on a model -> train the model (rules) -> test the model on new data . (lowest value for entropy) information gain. Repeat the whole process of the features and the bias.
Learning problems until the decision classes are homogenous. This weighted sum is input to
• Regression: response is a real number. Binary classification: response Gini impurity an activation function.
is yes/no (0 or 1). Multiclass classification: response is one of a finite • Indicates how often a random element •The (example) activation function is a step function, a function which
set of options. Multilabel classification: response one or more of a would be labeled incorrectly if labels were maps positive sum values to one class and the negative sum values to
finite set of options. Autonomous behavior: input are measurements assigned at random from the given distribution. the other (binary classification).
from sensor, response are instruction for actuators. Zero Gini Impurity for a pure node. Again •The goal of the perceptron algorithm is to find the weights that classify
Performance measures calculate for every feature and chose the best one. all data points correctly given the discriminant (activation) function.
•Evaluation, how well is the algorithm performing. Choose a baseline, • Misclassification impurity: Similar concept as Gini • The Perceptron Learning Rule Convergence Theorem states that if a
choose a metric, and compare. Different tasks, different metrics. impurity, slightly different formula. weight vector exists for accurately classifying all data points, the model
Regression (mean absolute error and mean squared error) • In general same performance for Gini and entropy. will eventually converge to such a weight vector regardless of the initial
• MAE: the average absolute difference between true value (yn) and Worse performance for misclassification impurity. weight choice, within a finite number of steps.
predicted value(𝑦̂n) - fails to punish large How to build a tree How to find the weights?
errors in prediction as all errors are treated •A tree is build using a greedy search. First, the best feature at the top • For each combination of weights a hyperplane
equally. node based on a splitting criteria is chosen, then the same is done for can be computed that, which separates the space
• MSE: the average square of the difference between true value and subsequent levels. This produces a local optimum -> best decision at (two dimensions for binary). This can be done by
predicted value – more sensitive to outliers each point, but not necessary for the tree overall. When the stopping setting f(x) = 0.
as the square amplifies the impact of large criteria is reached (or there is a pure node) a leaf node can be created. • The second function is the activation function,
deviations. •Not most optimal solution, but reasonably good. Reason for greedy which maps the outcome of f(x) to binary values.
Classification search is that the more features, the more computationally expensive it •A different way of representing the formula is a dot
• Confusion matrix: Contains info on is, not checking all the trees reduces the computation time. product notation. Here w and x are vectors.
the ground truth and the predictions • Tree is build using a recursive function. A recursive function solves a • Role of bias: biases the classifier towards positive or
of the model. Can be appended. problem by repeatedly applying the function to simpler instances of the negative class. . When w · x ≈ 0 bias decides which class
• Accuracy: Overoptimistic results same problem until a specific condition (base case) is reached. The base to predict (defines the default).
when the problem is imbalanced as it favors the case (simplest form) is used as a termination condition for the •Decision boundary: the hyperplane computed by setting f(x)=0. Under
most frequent instance. recursion. Trees are recursively defined data structures where the base the line represents one class and above the line another.
• Recall: Ability of the model to find all the case is a condition that determines when the recursion should • Goal: find the weights and bias minimize the loss function with an
positive samples. We do not want to miss out on terminate (e.g. Pure node). The recursive cases involve creating branch optimizer algorithm (e.g. SGD).
potential positive classes regardless of the errors we nodes based on a splitting criteria. Manually adjusting (or updating) (w, b)
make. The goal is to minimize FN. • A binary tree is a tree with two decision classes. In a binary tree every • Initialize a starting weight and bias and then classify examples one by
• Precision: ability of the model not to label a negative node has two children. A binary tree is created in the following way. If one. Correct -> keep going. Incorrect -> update (w, b), so that it predicts
sample as positive. Misclassification comes at a high cost. all examples have same label -> leaf node with respective label. the right class. General rules to update are given below:
The goal is to minimize FP. Otherwise, choose most important question -> split date into two parts • if y = +1 and ypred = −1 then, new w = w + x and new b = b + 1
• Specificity (TNR): Ability of model to find all (yes and no) according to question -> remove question from question • if y = −1 and ypred = +1 then new w = w – x and new b = b − 1
negative samples. Opposite of recall, not common. set. After this apply the whole process again, until there are only leaf Streaming data, online learning and batch learning
•Fβ-measure: Computed from the precision and recall. Recommended nodes at the end. •Streaming data refers to continuous and ongoing data sources such as
for imbalanced problems or when • Using a decision tree: start at root -> ask question about example -> sensor recordings, social media posts, and news articles.
establishing a trade-off between follow the edge corresponding to outcome -> repeat, until leaf node •Online learning is the ability to learn from streaming data by taking
precision and recall is needed. (prediction is label of leaf node). one data point at a time (e.g. perceptrons).
•Parameter β quantifies how much more we care about recall than •Two algorithms that can be used: CART (uses Gini Impurity and can •Batch learning involves learning from a whole dataset and requires the
precision, when it is greater than 1, that means, recall is weighted only be used for binary trees) and ID3 (uses information gain). algorithm to remember the entire dataset (e.g. decision trees).
more, when it is smaller than 1, that means precision is weighted more. • Questions in binary decision trees are generated by binarizing •In pure online learning, evaluation involves making predictions for
• Precision, recall and Fβ-measure are computed for each decision class. categorical values into yes/no (0,1) and discretizing numerical values current examples and recording correctness. Error rate is the
Can be averaged over more than one decision class. Accuracy is a global into threshold-based binary questions (xi ≤ thresholdj). proportion of mistakes made. In evaluation with multiple iterations,
measure (computed for all decision classes). Note Fβ-measure is derived Measures of performance in decision trees previously seen examples are included, and a separate development
from these computed averages. • Prediction speed: depth of the tree (length of the longest path from a and validation set needs to be used to ensure the model's performance
• Macro-average: compute measure per-class, and average. Note: rare root to a tree) or in other words the max number of questions needed is measured on unseen date (set should be different per training set).
classes have the same impact as frequent classed (sub-optimal). to get to a leaf node. Additional notes perceptron
• Micro average: treats the entire • Balanced tree: each time a question is asked, half of the tree is • Early stopping: The number of iterations serves as a hyperparameter
dataset as an aggregate results eliminated (in case of binary). Repeated halving can be used to balance for Perceptron, training should be stopped when the error on validation
and calculate 1 metric. the three as it the process of evenly dividing the dataset at each level. A data no longer decreases, as an increase in validation error indicates
Learning workflow – setting the scene balanced tree reduces the depth and accelerates prediction speed. overfitting despite a decrease in training error.
• Ideal outcome is that 𝑓̂(𝑥) = f (x). In other words the goal of a model •Ideal tree: low depth -> less questions -> faster to label •Weight averaging: remember the weights for each iteration and
is to minimize the difference (loss), between the two. An example of a Decision tree advantages vs disadvantages average them (generalizes better in practice).
loss function in regression is MSE and MAE. •Advantages: A white box approach (Interpretable and understandable •Most important features: the bottom and the top 10 features as they
• Problem: no full info on the whole population and distribution of f(x). rules). Fast classification once the three is there. Shows which features affect the outcome the most.
•Compute loss on the data we have are most important for classification (splitting criteria). Can handle both • Representing text: Text is often represented with word counts, which
(empirical risk minimization). continuous and categorical features. Require less data preparation (no can take lots of memory. Therefore, use a sparse representation of the
How to find 𝒇̂(𝒙)? feature scaling or centering needed). Robust for outliers. vector (zero’s omitted) instead of a dense representation (word count is
• If f (̂ x) = θx + c we assume a linear •Drawbacks: tend to overfit as they are complex models. Can be too also shown for words that do not occur in the given text)
relationship. The loss function then complex and detailed. Less appropriate for continuous features. Can be Gradient Descent
computes to this equation. computationally expensive to train. Not suited for regression problems. • Gradient descent is an optimization algorithm, which is an algorithm
• More complex relationship require adding a polynomial function. A Rectangular regions limit their ability to handle non-rectangular (more that seeks to find the best solution or set of parameters for a given
power p can be chosen: 𝑓̂(𝑥) = c + θ1x + θ2x2 + . . . θpxp. complex) regions, as they typically consider one feature at a time during problem e.g. the weights in a linear or logistic regression, or perceptron.
• A higher value of p -> more complex model -> chance of poor the splitting process. • Gradient descent and related models can be treated as black box.
generalization and overfitting. A low value of p can lead to the opposite. • Pruning: reduces the complexity and increases the generalizability -> • Gradient descent involves incrementally moving in the domain of a
•Solution: cross validation. The data is split into three parts a training, a cure for overfitting and increases interpretability. Remove subtrees function to find the value that minimizes it.
development and a test set. The training data is used to minimize that do not contribute much to the predictive capabilities of the How to calculate gradient descent
empirical risk and thus to train the model. The development set is used classifier. Pruned subtrees are replaced by the most popular decision • Goal: find w, b for which error on training data is smallest.
to find the best hyperparameters, different hp are tested on unseen class. • Slope: refers to the steepness of a line in a two-dimensional (x, y)
data, which improves generalizability. The test set is used to test the full Random forest plane, representing the rate of change in y with respect to x. Positive if
models performance on unseen data (generalization capabilities). •The idea consists of building several decision trees each using a function is increases, negative if function is decreasing, 0 if function is in
•Other solution: regularization: a penalty for model random selection (with replacement) of features and instances (also a local minimum or maximum. The slope can be calculated by taking the
complexity is included in the loss function. Penalty called bagging). After enough trees are built the output can be first derivative of the function.
reduces the value of θ -> simpler model, less prone to aggregated using majority voting (most popular decision class is • Gradient: a vector composed of partial derivatives, where each partial
overfitting assigned to the new instance). More random trees -> more fit to the derivative represents the rate of change of the function with respect to
Decision trees - components training data -> risk for overfitting. a specific dimension or parameter. It provides information about the
• Decision nodes checks the value of a feature. A • Advantages: increases accuracy and implies less overfitting. direction and magnitude of the steepest ascent or descent in the
decision tree starts with a root node. The lines • Drawback: loss of interpretability. function for each dimension.
(edges) originating from the nodes correspond to Perceptron – defined • Optimizing: By analyzing the slope or gradient, we can determine the
a test of a specific feature, each edge with a • Perceptrons are the building blocks in a neural network, representing direction of steepest descent in the loss function. Utilizing this
specific feature value, and connect the node to the next nodes. Leaves simplified models of biological neurons that perform computations on information, the algorithm adjusts the model's parameters in the
are terminal nodes where the predicted decision class is shown. input data and pass the output to the next layer of the network. appropriate direction: opposite if the derivative is positive and in the
Splitting criteria – Entropy and Information gain [0,1] • Perceptrons assign weights and a bias term to features, and by same direction if the derivative is negative. This iterative process aims
•Entropy is a measure of unorderdness or uncertainty. Lowest when all utilizing these parameters, they determine a hyperplane that can to find the local or global minimum, minimizing the loss function and
labels are the same, highest when they are equally likely. Goal is to effectively divide the input space, enabling classification or optimizing the model's parameters.
lower the entropy as the decrease indicates information gain. categorization of data points based on their relative positions. • Learning rate (η): the learning rate is used to change the value of w
and b with small steps in order to find the optimal values of (w,b). The