ISYE 6501 Midterm 1 ALL SOLUTION 2023/24 EDITION GUARANTEED GRADE A+
Rows Data points are values in data tables Columns The 'answer' for each data point (response/outcome) Structured Data Quantitative, Categorical, Binary, Unrelated, Time Series Unstructured Data Text Support Vector Model Supervised machine learning algorithm used for both classification and regression challenges. Mostly used in classification problems by plotting each data item as a point in n-dimensional space (n is the number of features you have) with the value of each feature being the value of a particular coordinate. Then you classify by finding a hyperplane that differentiates the 2 classes very well. Support vectors are simply the coordinates of individual observation -- it best segregates the two classes (hyperplane / line). What do you want to find with a SVM model? Find values of a0, a1,...,up to am that classifies the points correctly and has the maximum gap or margin between the parallel lines. What should the sum of the green points in a SVM model be? The sum of green points should be greater than or equal to 1 What should the sum of the red points in a SVM model be? The sum of red points should be less than or equal to -1 What should the total sum of green and red points be? The total sum of all green and red points should be equal to or greater than 1 because yj is 1 for green and -1 for red. First principal component PCA -- a linear combination of original predictor variables which captures the maximum variance in the data set. It determines the direction of highest variability in the data. Larger the variability captured in first component, larger the information captured by component. No other component can have variability higher than first principal component. it minimizes the sum of squared distance between a data point and the line. Second principal component PCA -- also a linear combination of original predictors which captures the remaining variance in the data set and is uncorrelated with Z¹. In other words, the correlation between first and second component should is zero. What if it's not possible to separate green and red points in a SVM model? Utilize a soft classifier -- In a soft classification context, we might add an extra multiplier for each type of error with a larger penalty, the less we want to accept mis-classifying that type of point. Soft Classifier Account for errors in SVM classification. Trading off minimizing errors we make and maximizing the margin. To trade off between them, we pick a lambda value and minimize a combination of error and margin. As lambda gets large, this term gets large. The importance of a large margin outweighs avoiding mistakes and classifying known data points. Should you scale your data in a SVM model? Yes, so the orders of magnitude are approximately the same. Data must be in bounded range. Common scaling: data between 0 and 1 a. Scale factor by factor b. Linearly How should you find which coefficients to hold value in a SVM model? If there is a coefficient who's value is very close to 0, means the corresponding attribute is probably not relevant for classification. Does SVM work the same for multiple dimensions? Yes Does a SVM classifier need to be a straight line? No, SVM can be generalized using kernel methods that allow for nonlinear classifiers. Software has a kernel SVM function that you can use to solve for both linear and nonlinear classifiers. Can classification questions be answered as probabilities in SVM? Yes. K Nearest Neighbor Algorithm Find the class of the new point, Pick the k closest points to the new one, the new points class is the most common amongst the k neighbors. What should you do about varying level of importance across attributes with K Nearest Neighbors? Some attributes might be more important than others to the classification --- can deal with this by weighting each dimension's distance differently. Unimportant attributes may be removed as they are not very important for the classification. What is the difference between real and random effects in validation? Real effects: same in all data sets Random effects: different in all data sets How should one generally split their data set? Training (building models) / Validation (picking model) / Test (estimate performance) Rotating versus randomness when validating data? Rotation: can make sure each part of the data is equally separated Randomness: no chance of bias K-fold Cross-Validation takes number of sections (k) and tests against eachother so you don't have to worry about what is being left out. Gives a better estimate of model quality. Clustering takes a set of data points, dividing them into groups so each group contains points that are close to eachother or similar. Distance Norms Given 2 points x and y with coordinates x1, x2 and y1, y2 -- the distance between them is the square root of x1-y1 squared + x2-y2 squared. rectilinear distance norms Absolute value of distance norms P-norm distance generalized version of both distance equations where p would be 2 for a straight-line distance and P would be 1 for a rectilinear distance 3rd most common value for P is infinity Infinity Norm Largest of a set of numbers in absolute value -- infinity norm of a square matrix is the maximum of the absolute row sums k-means clustering algorithm formula meaning X denotes data n data points and m attributes Xij is the value of a data point i's attribute j Y denotes cluster membership Yik is one if data point i in in cluster k and 0 if not Zkj denotes the j dimension coordinate of cluster center k k-means clustering find a set of k cluster centers and assignments of each data point to a cluster center to minimize the total distance from each data point to its cluster center How to decide how many clusters to include in k-means clustering begin by picking k points inside a range of our data K is the number of clusters we want points we pick are called cluster centers Process of k-means clustering 1) choose number of clusters 2) Temporarily assign each data point to the cluster center it is close to 3) Recalculate the cluster centers (centroids) 4) Go back to previous step and reassign each data point to its closest cluster center 5) Continue repeating this loop until no data point changes clusters What models are k-means clustering an example of? Machine learning, Heuristic model: algorithm that is not guaranteed to find the absolute best solution, but in many cases gets pretty close to the best soln. Expectation maximization algorithm: minimizing finding smallest distance to a cluster center or maximizing the negative of the distance to a cluser center Should you remove outliers from kmeans clustering? Only if it does not create inherent bias to the data Should you run kmeans clustering once or several times? Several times -- using different initial cluster centers and find the best solution (also use different values of k as test) How to spot optimal amount of clusters? Look for a kink in the curve observing total distance (y-axis) and number of clusters (x-axis) -- kink is where marginal benefit of adding another cluster starts to be small. Supervised learning Classification -- know each data points attributes and already know the right classification for the data points, already knowing the response. (more common in analytics) Unsupervised learning Clustering -- don't know the right grouping of our data points up front. know their attributes but don't know what group to any of these points are in. model must decide how to cluster based only on attributes of the data. Box and Whisker Plots top and bottom of box are 25th and 75th percentile of the values horizontal line through the box is the median vertical lines up and down are whiskers stretching up and down to reasonable range of values (10th and 90th or 5th and 95th percentiles) All points outside b/w plots are outliers. How to find outliers? Box/whisker plots, exponential smoothing model (points with v. large error might be outlier) How to deal with outliers? If bad data: omit, impute Good data: they are expected, don't remove. Instead build two models to handle outliers appropriately. Ex: first model - logistic regression model (est prob of outliers happening under diff conditions). second model: estimate length of delivery under normal conditions / use data without outliers Change detection determining whether something has changed using time series data CUSUM Cumulative Sum -- can detect when a process gets to a higher level than before or to a lower level than before, or both Xt: observed value at time t Mu: mean of x if no change At time t we observe Xt and see how far above/below the expectation it is. then we add that amount to the previous periods metric St - 1 to give a running total St -- if > 0 we keep otherwise running total is 0 What does the C value do in a CUSUM model? It pulls the Xt down a bit as we expect Xt to be higher than expectation at random. The bigger C is, the harder it is for St to get large, and the less sensitive the method will be. The smaller C is the more sensitive the model will be. How to find correct C (pulls Xt down a bit) and T (threshold for change detected) in CUSUM model? how costly it is if the model takes a long time to notice a change and how costly it is if the model thinks its found a change that doesn't exist. Control chart for CUSUM plots St over time and if it ever gets beyond the threshold T line it shows CUSUM has detected a change. Exponential Smoothing used for analyzing time series data where the respons is known for many time periods Notes: - modeling an assortment of variations and also dealing with randomness - difficult to find a baseline Exponential Smoothing data explanation St: expected baseline response at time period t Xt: observed response How exponential smoothing identifies change Merges two ways of thinking: St = Xt and St = St-1 so St = alpha*xt + (1-alpha)St-1 where 0<alpha<1 if alpha approaches 0 there is a lot of randomness in the system if alpha approaches 1 not much randomness How to start and continue with exponential smoothing? Initial condition is S1 = x1 (doesn't deal with trends or cyclical variations) Future includes trends and cyclical variation How do trends effect exponential smoothing? Tt is the trend at time period t Add it to the (1 - alpha) section of exponential smoothing to yesterdays St (St-1). Initial condition is T1 = 0 Cyclic patterns similar to trend as it has an additive component to the exponential smoothing formula. When we use a cyclic factor to inflate or deflate our observed value, we use the cyclic factor from L (length of cycle) time periods ago. Because that's the most recent cyclic factor we have from the same part of the cycle. If C is 1.1 on Sunday then sales were 10% higher just because it's Sunday. Of 550 sold on Sunday, 500 is basline and 50 is 10% extra. What is triple exponential smoothing Holt-Winters method What is single exponential smoothing? Trends, seasonality -- work exactly the same way How does xt (observed response) vary the St in exponential smoothing? continued..
Escuela, estudio y materia
- Institución
- ISYE 6501 Mdterm 1
- Grado
- ISYE 6501 Mdterm 1
Información del documento
- Subido en
- 21 de septiembre de 2023
- Número de páginas
- 12
- Escrito en
- 2023/2024
- Tipo
- Examen
- Contiene
- Preguntas y respuestas
Temas
-
isye 6501 midterm 1
-
rows
-
columns
-
structured data
-
text
-
upervised machine learning algorithm used for both
-
what do you want to find with a svm model