Introduction. Machine Learning: learning to solve problems from Perceptron. Computes weighted sum
examples, come up with algorithms after learning and applying Binary of input features (plus bias), if sum >= 0,
classification: either x-y(positive-negative) or x-nonx(spam-nonspam). ROC outputs +1, otherwise outputs -1. Linear
curve: plots precision vs recall (sensitivity vs specificity). Cross validation: Classifier: simplest linear model, finding
break traindata in 10 parts, train on 9 and test on 1. LOO: cross validation simple boundaries (separating +1 and -
when K=N-1. Good for KNN, otherwise expensive to run. Confidence: 1). Discriminant: f(x) = w · x + b. Bias:
confidence of 95% means if you reran 100 times, 95 of decides which class the node should be
these would do better. Debugging: collect data, choose pushed to, does not depend on input
features, choose model fam, choose traindata, train value. When w · x = 0, bias decides which class to predict makes default
model, evaluate on test. Canonical Learning decision biases classifier towards positive or negative class. In the
Problems: regression, binary classification, multiclass beginning it is all random, after iterating weights and biases are gradually
Gradient Descent. Model that uses inputs to predict outputs. Gradient
classification, multi-label classification, ranking, shifted so next result is closer to desired output. Error-driven: it is online, Descent to find model parameters with lowest error, is not limited to linear
sequence labelling, sequence-to-sequence labelling, looks at one example at a time. If doing well, doesn’t update parameters models only. Optimization algorithm: how it learns: model + optimization.
autonomous behaviour. MSE: average square of (only when error occurs). Finding (w,b): go through all examples, try with Optimization means finding a minimum or maximum of a function. However,
difference between true and pred value. MAE: average absolute difference current (w,b), if correct continue, otherwise adjust (w,b). Streaming optimizing zero/one loss is hard. An option is to concoct an S-shaped
between true and pred value. FP: no spam, marked as spam. FN: not data: data which does not stop coming (recordings from sensors, social function which is smooth and potentially easier to optimize, but it is not
marked as spam, is spam. TP: marked as spam, is spam. TN: not marked media posts, news articles). convex. Convex function: looks like a happy face, easy to minimize. It’s
as spam, is spam. Accuracy: number of correct predictions, (TP + TN) / (P Online: online learners like the perceptron are good for streaming data. always non-negative. Concave function: looks like a sad face. Surrogate
+ N) 1 – error rate. Error: proportion of mistakes. Precision: of all found x, Online algorithm only remembers current example. Can imitate batch Loss Functions: hinge loss, logistic loss, exponential loss, squared loss.
how many were actually x? P = TP / marked. Recall: of all x out there, how learning by iterating over data several times in order to extract more SSE: Sum of Squared Errors. Used for
many found x? R = TP / spam. F-score: harmonic mean between precision information from it. Evaluation Online Learning: predict current example measuring error. Find w value: start with
and recall, F1 = 2 * ((P * R) / (P + R)). Macro Average: compute precision record correct or not update model (if necessary) next example. random value for w check slope of
and recall per class, take average. Micro Average: correct prediction as TP, Always checks error rate, and never evaluate/test on examples which are function descend the slope adjust w to
missing classification as FN, incorrect prediction as FP. used for traindata. Early stopping: stop training when error on validation decrease f(w). First Derivative: if we
data stops dropping. When training error goes down, but validation goes up define f(w) = w2, the first derivative is
Decision Trees/Forest. Generalization: ability to view something new in a over fitting. Sparsity: a sparse representation which omits zero values. f’(w) = 2w. Slope: describes steepness
related way. Goal induction: take traindata, use to induce function ‘f’, evaluate ‘f’ of a single dimension. Gradient is the
on testdata. Succeeds if performance on testdata is high. Advantages of DT: collection of slopes, one for each dimension. To compute: first derivative
transparent, easily understandable, fast (no revision needed). Disadvantages of for function ‘f’, first derivative van be written f’ then f’(a) is the slope of
DT: intricate treeshape, depends on minor details, over fitting, try limiting the depth. function f at point a. Basic Gradient Descent: for f(w) w2. Ready to
Building DT: number of possible trees grows exponentially with number of Descent: initialize w to some
features, needs to be built incrementally. Ask the most important questions first, so value (e.g 10) update
the ones which help us classify. Left branch apply algorithm to NO examples, N is the learning rate,
right branch apply algorithm to YES examples. Recursion: function that calls controlling speed of descent stop when w does not change anymore. If
learning rate is too big we will get further away from solution instead of
itself until some base case is reached (otherwise would continue infinitely). Base
Feature Engineering. Process of transforming raw data into features that closer. Stochastic Gradient Descent: randomized gradient descent, works
case = leaf node, recursive call = left/right subtree. (Un)balanced Trees: balanced
better with large datasets. Momentum: large momentum = difficult to
trees are ‘better’ faster, depends on depth of tree. Prediction time does not better represent the underlying problem to the predictive models, resulting in
change direction. A modification to SGD which smooths gradient estimates
depend on number of questions, but on number of unique combinations. improving model accuracy for unseen data. It gets the most out of your data.
Discretization: use quantiles as threshold or choose thresholds present in data. with memory. No modification to learning rate. Finding Derivatives: in the
Algorithms are generic, features are specific. Feature engineering is often a
Measure Impurity: to find the best split general case: symbolic or automated differentiation get gradients for
major part of machine learning. Categorical vectors: some algorithms (decision
condition (quality of question), stops where no complicated functions composed of differentiable operations automatic
trees/random forests) can easily use categorical features such as occupation or
application of chain rule (Tensorflow, PyTorch). Local Minima: can get your
improvement is possible. Entropy IH(P): nationality. Otherwise convert to numerical. Feature engineering: extracting
measure of uniformity of distribution. More optimizer trapped. Potential problem for non-linear models (such as neural
features, transforming features, selecting features. Feature engineering is
uniform more uncertainty (and thus data is networks). Not really problem in high-dimensional data. In most cases don’t
domain specific, and domain expertise is needed. Common Feature Sources:
not divided enough). Tries to minimize care about local minima. Simplest way to avoid restart from different
text, visual, audio, sensors, surveys. Feature transformations: standardizing (z-
uniformity. Gini Impurity IG(P): measuring how often a random element would be starting point which is more accessible. While searching for the global min,
scoring), log-transform, polynomial features (combining features). Text
labelled incorrectly if labels were assigned randomly. Random Forest: many DT’s, model can encounter many ‘valleys’ and the bottoms we call local minimum.
Features: word counts, word ngram-counts, character ngram-counts, word
Depending on model, if the valley is deep enough, the process might get
randomly distributing features over different trees, increased generalizability, vectors. MEG: signal amplitude at number of locations on surface of channels,
stuck there and we end up with local min instead of global which means that
variance is lower, but interpretability is worse. evolving in time. Feature Ablation Analysis: remove one feature at a time
we end up with less than optimal cost. Not necessarily a big problem in high
measure drop in accuracy quantifies contribution of feature, given all other
dimensional data. Less likely that there is a decrease in the error function in
features. Feature Learning: unsupervised learning: word vectors (LSA,
any direction if the parameter space is high, so there should be less local
word2vec, GloVe). Neural networks can extract features from ‘raw’ inputs while
minima.
learning (speech: audio wave, image=pixels, text=byte sequences). Pairwise
interactions: linear classifiers need information about joint-occurrence. Always
consider the expressiveness of
your model when engineering features.