Machine Learning2025/2026 EDITION GUARANTEED GRADE A+
Machine Learning2025/2026 EDITION GUARANTEED GRADE A+ What is the definition of learning from experience for a computer program? A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Which problems can be solved with unsupervised learning? #2 • approach problems with little or no idea what our results should look like. • derive structure from data where we don't necessarily know the effect of the variables. What is the definition of a supervised learning problem? Given a training set, learn a function h such that h of an input variable x is a "good" predictor for the corresponding output variable y. What is the definition of a hypothesis? The predicting function h. Give the pictorial process for a supervised learning problem. Supervised Learning Problem. What do we call a learning problem, if the target variable is continuous? When the target variable that we're trying to predict is continuous, the learning problem is also called a regression problem. What do we call a learning problem, if the target variable can take on only a small number of values? When y can take on only a small number of discrete values, the learning problem is also called a classification problem. How do we measure the accuracy of a hypothesis function? By using a cost function, usually denoted by J. What is the definition of a cost function of a supervised learning problem? Takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's. Give a pictorial representation of what the cost function of a supervised learning problem does. Cost function of a supervised learning problem. What are alternative terms of a Cost Function? #2 • Squared error function. • Mean squared error. What is a visual interpretation of the cost function? #2 • The training data set is scattered on the X-Y plane. • We are trying to make a straight line (defined by hθ(x)) which passes through these scattered data points. What is the contour line of a two variable function? A contour line of a two variable function has a constant value at all points of the same line. How do we implement an iteration step when calculating Gradient Descent in code? #2 • At each iteration j, one should simultaneously update the parameters. • Updating a specific parameter prior to calculating another one on the j iteration would yield to a wrong implementation. State the algorithm for gradient descent. Repeat until convergence, where j=0,1 represents the feature index number. Depict the graphical implementation of minimizing the cost function using gradient descent. #2 • We put theta 0 on the x axis and theta 1 on the y axis, with the cost function on the vertical z axis. • The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. Why does gradient descent, regardless of the slope's sign, eventually converge to its minimum value? #2 The following graph shows that: • when the slope is negative, the value of theta 1 increases. • when the slope is positive, the value of theta 1 decreases. Why should we adjust the parameter alpha when using gradient descent? #2 • To ensure that the gradient descent algorithm converges in a reasonable time. • Failure to converge or too much time to obtain the minimum value implies that our step size is wrong. How does gradient descent converge with a fixed step size alpha? #2 • As we approach a local minimum, gradient descent will take smaller steps. • Thus no need to decrease alpha over time. What is the algorithm for implementing gradient descent for linear regression? #2 • We can substitute our actual cost function and our actual hypothesis function. • m is the size of the training set, theta 0 a constant that will be changing simultaneously with theta 1 and x, y are values of the given training set (data). Give a derivation of for a single example in batch gradient descent! (Gradient Descent For Linear Regression) Derivation of a single variable in gradient descent. What is batch gradient descent? #2 (Gradient Descent For Linear Regression) • Gradient descent on the original cost function J. •This method looks at every example in the entire training set on every step. How does batch gradient descent differ from gradient descent? (Gradient Descent For Linear Regression) While gradient descent can be susceptible to local minima in general, batch gradient descent has only one global, and no other local, optima. Depict an example of gradient descent as it is run to minimize a quadratic function. #2 • shown is the trajectory taken by gradient descent, which was initialized at 48,30. • The x's in the figure (joined by straight lines) mark the successive values of theta that gradient descent went through as it converged to its minimum. What is multivariate linear regression? Linear regression with multiple variables. What is the notation for equations where we can have any number of input variables? (Multivariate Linear Regression) Notation. What is the multivariate form of a hypothesis function? Multivariate form of the hypothesis function. What is the intuition of the multivariable form of a hypothesis function in the example of estimating housing prices? #2 • We can think about theta 0 as the basic price of a house, theta 1 as the price per square meter, theta 2 as the price per floor, etc. • x1 will be the number of square meters in the house, x2 the number of floors, etc. Give the vectorization of the multivariable form of a hypothesis function. Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as: Why do we assume that x0=1 in multivariate linear regression? Convention. What is the Gradient Descent for Multiple Variables? #2 • The gradient descent equation itself is generally the same form. • we just have to repeat it for our 'n' features. How can we speed up gradient descent? We can speed up gradient descent by having each of our input values in roughly the same range. Why does feature scaling speed up gradient descent? #2 • This is because theta will descend quickly on small ranges and slowly on large ranges. • Thus it will oscillate inefficiently down to the optimum when the variables are very uneven. What are the ideal ranges of our input variables in gradient descent? #2 • For example a range between minus 1 and 1. • These aren't exact requirements; we are only trying to speed things up. What is feature scaling? #2 • Involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable. • Results in a new range of just 1. What is mean normalization? #2 • Involves subtracting the average value for an input variable from the values for that input variable. • Results in a new average value for the input variable of just zero. How do you implement both feature scaling and mean normalization? #2 Feature Scaling and Mean Normalization. How can you debug gradient descent? #3 • Make a plot with number of iterations on the x-axis. • Now plot the cost function J of theta over the number of iterations of gradient descent. • If J of theta ever increases, then you probably need to decrease alpha. How can the step parameter alpha in gradient descent cause bugs? #2 • If alpha is too small: slow convergence. • If alpha is too large: may not decrease on every iteration and thus may not converge. What is the Automatic convergence test in gradient descent? #2 • Declare convergence if J of theta decreases by less than E in one iteration, where E is some small value such as 0.001. • However in practice it's difficult to choose this threshold value. How can we improve our features? (Multivariate Linear Regression) #2 • We can combine multiple features into one. • For example, we can combine x1 and x2 into a new feature x3 by taking x1 times x2. How can we improve the form of our hypothesis function? (Multivariate Linear Regression) By making it a quadratic, cubic or square root function (or any other form). What important thing should one keep in mind if one changes the form of a hypothesis function? (Multivariate Linear Regression) #2 • If you create new features when doing polynomial regression then feature scaling becomes very important. • For example, if x has range 1 - 1000 then range of x^2 becomes . State the normal equation formula! Normal Equation Formula. Compare gradient descent and the normal equation! The following is a comparison of gradient descent and the normal equation: Does feature scaling speed up the implementation of the normal equation? There is no need to do feature scaling with the normal equation. What is the complexity of computing the inversion with the normal equation? With the normal equation, computing the inversion has complexity of n cubed. When might it be a good time to go from a normal solution to an iterative process? When the number of examples exceeds 10,000 due to the complexity of the normal equation. Which function do we want to use in octave when implementing the normal equation? #2 • Use the 'pinv' function rather than 'inv'. • The 'pinv' function will give you a value of theta even if X Transpose X is not invertible. What are common causes for X Transpose X to be noninvertible? #2 • Redundant features, where two features are very closely related (i.e. they are linearly dependent). • Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization". How do we change the form of our binary hypothesis function to be continuous in the range between 0 and 1? By using the Sigmoid Function, also called the Logistic Function. How can we interpret the output of our logistic function? h of theta of a given input variable give us the probability that our output is 1. How can we get our discrete 0 or 1 classification from a logistic function? We can translate the output of the hypothesis function as follows: What is the decision boundary given a logistic function? #2 • The decision boundary is the line that separates the area where y = 0 and where y = 1. • It is created by our hypothesis function. How does the cost function for a logistic regression look like? #2 • We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. • In other words, it will not be a convex function. Plot the cost function J, if the correct answer for y is 1. • If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. • If our hypothesis approaches 0, then the cost function will approach infinity. Plot the cost function, if the correct answer for y is 0. #2 • If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. • If our hypothesis approaches 1, then the cost function will approach infinity. How can we simplify our cost function? (Logistic Regression Model) We can compress our cost function's two conditional cases into one case. Give the vectorized implementation of our simplified cost function! (Logistic Regression Model) A vectorized implementation is: Give the vectorized implementation for Gradient Descent! (Logistic Regression Model) A vectorized implementation is: What is gradient descent for our simplified cost function? (Logistic Regression Model) #2 • Notice that this algorithm is identical to the one we used in linear regression. • We still have to simultaneously update all values in theta. Depict an example of One-versus-all to classify 3 classes! (Multiclass Classification) The following image shows how one could classify 3 classes: What is the implementation of One-versus-all in Multiclass Classification? #2 • Train a logistic regression classifier h of theta for each class to predict the probability that y = i . • To make a prediction on a new x, pick the class that maximizes h of theta. What is underfitting? Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. What usually causes underfitting? It is usually caused by a function that is too simple or uses too few features. What is overfitting? Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. What usually causes overfitting? It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data. What are the two main options to address the issue of overfitting? #4 Reduce the number of features: • Manually select which features to keep. • Use a model selection algorithm. Regularization: • Keep all the features, but reduce the magnitude of parameters theta j. • Regularization works well when we have a lot of slightly useful features. In a basic sense, what are neurons? Neurons are basically computational units that take inputs, called dendrites, as electrical inputs, called "spikes", that are channeled to outputs , called axons. What are the dendrites in the model of neural networks? In our model, our dendrites are like the input features. What are the axons in the model of neural networks? In our model, the axons are the results of our hypothesis function. What is the bias unit of a neural network? #2 • The input node x0 is sometimes called the "bias unit." • It is always equal to 1. What are the weights of a neural network? Using the logistic function, our "theta" parameters are sometimes called "weights". What is the activation function of a neural network? The logistic function (as in classification) is also called a sigmoid (logistic) activation function. How do we label the hidden layers of a neural network? #2 • We label these intermediate or hidden layer nodes. • The nodes are also called activation units. How do we determine the dimension of the matrices of weights? (Neural Network) #2 • The +1 comes from the addition of the "bias nodes. • In other words the output nodes will not include the bias nodes while the inputs will. How do we obtain the values for each of the activation nodes, given a single-layer neural network with 3 activation nodes and a 4-dimensional input? #2 • We apply each row of the parameters to our inputs to obtain the value for one activation node. • Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by the parameter matrix theta 2. Give an example of the implementation of the OR-function as a neural network! The following is an example of the logical operator 'OR', meaning either x1 is true or x2 is true, or both: Give an example of the implementation of the AND-function as a neural network! The following is an example of the logical operator AND, meaning it s only true if both x1 and x2 are 1. CONTINUED...
Written for
- Institution
- Machine learning
- Course
- Machine learning
Document information
- Uploaded on
- January 16, 2025
- Number of pages
- 34
- Written in
- 2024/2025
- Type
- Class notes
- Professor(s)
- .
- Contains
- All classes
Subjects
- machine learning
-
what is the definition of learning from experience
-
which problems can be solved with unsupervised lea
-
what is the definition of a supervised learning pr
-
what is the definition of a hy