Pattern Recognition Class Notes
PR CLASS 1 - LINEAR MODELS FOR REGRESSION
– what’s statistical pattern recognition --> field of PR ML concerned
with automatic discovery or regularities in data thru use of computer
algorithms and with use of these regularities to take actions such as
classifying data into categories
– ML approach to do this --> use set of training data of n labeled
examples and fit a model to this data, the model can subsequently be
used to predict class for new input vectors about a new data point,
ability to categorise correctly is called generalization
– supervised: numeric target (regression), discrete unordered target
(classification), discrete ordered target (ordinal classification, ranking)
– unsupervised learning: clustering, density estimation
– always in data analysis --> DATA = STRUCTURE + NOISE --> we want
to capture the structure but not the noise
– coefficients (weights) of a liner and nonlinear function are learned or
estimated from the data
– maximum likelihood estimation: --> probability of observing x events
(successes) for y trials, get N independent observations from the
same probability distribution, find that particular value such that the
observed data are more likely to have come from the probability we
actually sampled from, how likely that future observation is similar to
observations drawn from general distribution
– score = quality of fit = complexity
– decision theory --> when wanting to make a decision in a situation
involving uncertainty, then two steps:
– 1) inference: learn p(x,t) from data (main subject of this course)
– 2) decision: given the estimate of p(x,t), determine optimal
decision (not focus on this in this course)
– suppose we know the joint probability distribution p(x,t):
– the task is --> given value for x, predict class labels t
– Lkj is the loss of predicting class j when the true class is k, K is the
number of classes
– then, to minimize the expected loss, predict the class j that
minimizes the function at page 37 of lecture 1 slides
– useful properties of expectation and variance:
– E[c] = c for constant c.
– E[cx] = cE[x].
– E[x ± y ] = E[x] ± E[y ].
– var[c] = 0 for constant c.
– var[cx] = c2var[x].
– var[x ± y ] = var[x] + var[y ] if x and y independent
, – if E (expected value), c is a constant and x a variable:
– 1) E(c) = c
– 2) E(cx) = cE(x)
– 3) E(x +- y) = E(x) +- E(y)
– to minimize the expected squared error we should predict the mean of
t at point t0
– generally y(x) = Et(t|x) --> population regression function
– simple approach to regression:
– predict the mean of the target values of all training observations
with x = x0
– problem here is that given the numerical nature of the x value, it
wouldn’t really be in practice
– possible solution --> look at data points that are close by to where
i wanna make the prediction, then take the mean of those train
points
– idea here is KNN for regression, ex. with in variable x ad out
variable t:
– 1) for each x define a neighborhood Nk(x) containing the
indices n of the k points (xn, tx)
– with k = small value (e.g. 2), line is very wiggly, but with k =
larger value (e.g. 20), line much smoother because we are
averaging on more k nearest neighbors
– k is a complexity hyperparameter for KNN
– for a regression problem, the best predictore for out variable is the
mean of the out variable for a given input x, but this not applicate with
finite data sample, so:
– NN function approximates mean by averaging over training data
– then NN function relaxes conditioning at a specific input
passphrase for ssh key: Umga!20RuspaSape21?mgaU
PR CLASS 2 - LINEAR MODELS FOR CLASSIFICATION 1
– assumption in linear regression model is that the expected value of t is
a linear function of x
– have to estimate slope and intercept from data
– the observed values of t are equal to the linear function x + some
random error
– given a set of train points (N), we have N pairs of input variables x and
target variable t
– find values for w0 and w1 that minimized the sum of squared errors
– average value is denoted with a bar little dash above the variable term
, – r2 (coefficient of determination):
– metrics for regression prediction evaluation
– define three terms:
– 1) total sum of squares —> is total variation of t in the sample
data
– 2) explained sum of squares —> is the part explained by the
regression, the variation in t explained by regression function
– 3) error sum of squares, part explained by the regression —> is
the sum over all observation to find the difference between
actual value of t and predictions
– therefore we have SST = SSR (part explained by regression) + SSE
– Rsquared = proportion of variation in t explained by x:
– r2 = SSR/SST = 1 - SSE / SST
– output is a number between 1 and 0
– if = 1, all data points are on regression line
– If = 0, they as far away as possible
– the above is the Calculus based solution to this
– but let's check the geometrical approach:
– suppose pop regression line goes through origin, meaning that w0
intercept is = 0, passes thru origin on y axis
– then follow same approach, result is same just regression error is
different
– error vector should be taken perpendicular to the vector of x
meaning that the dot product of x and error vector should be 0
– in least squares solution:
– y is a linear combination of the columns of X
– y = Xw
– want the error vector to be as small as possible
– the observed target values are not a linear combination of the
predictor variables --> if that would be the case, we'd get a perfect
regression line (all the points on it), which is just never the case with
real data
– general linear model:
– the term linear means that it has to be a linear function of the
coefficients/weights, but not necessarily of the input variables
– is linear in the features, but not necessarily in the input variables
that generate them
– interaction between two features modifies the effect of these two
variables on the target variable (e.g. age and income on pizza
expenditure)
– regularized least squares:
– add a regularization term to control for overfitting, in those cases
where the number of predictor variables is very large
– here use ridge regression (weight decay in NN)
– penalize the size of the weights