Lecture Notes ISYE 6501 Midterm 1 PDF

Title	Lecture Notes ISYE 6501 Midterm 1
Author	Sergio Santoyo
Course	Analytic models
Institution	Georgia Institute of Technology
Pages	40
File Size	237.8 KB
File Type	PDF
Total Downloads	3
Total Views	156

Preview

CLICK TO PREVIEW PDF

Summary

Lecture notes from first midterm exam...

Description

# Week 1 Introduction To Analytics Modeling - GTX ISYE 6501 - Introduction to Analytics Modeling answer important types of questions: what happened? = descriptive what is going to happen? = predictive what actions are best? = prescriptive how do we create value with data? when can analytics answer these questions? Modeling: taking a real life situation and expressing it in math analyze in math and turn it into a solution best ways to learn: ask questions, discuss answers Course Structure: - knowledge building - experience building based on knowledge built in part 1 Knowledge Building: Models - learn all the models Cross Cutting - data prep, output quality, missing data will include mathematical intuition but keep it agile all developed with situations and examples basic mathematical detail Experience Based: case studies practice using models practice using models using the commonly used analytics softwares make sure you learn key basic concepts link material with real analytic questions develop learning beyond the videos learn to use software without being told exactly what to do Summary: knowledge building and then experience What is Modeling? - real life situation described in math - analyze the math - turn math analysis back to real-life solution the mathematical description of the problem is the model all the detail involved in modeling is 'the model' Introduction to Classification classification = putting things into categories

put into groups of 'yes' and 'no' many analytic questions need to bin answers into a group based on the past examples we can use classification models to sort these items into these groups we can also have multiple classification groups - not just 'yes' or 'no' we need data to get these answers! we can infer and model from the data to classify a new point into the correct group! credit score and income example: scatterplot if repaid - green if defaulted - red these points could have an entire set of features associated with it we can draw a decision line between the points and sort them based on our decision line there are many lines! how do we know the 'right' line they could all separate the groups evenly! Choosing a Classifier what are the trade-offs in building classification models? we want to put things into categories! should we give someone a loan? we draw a line to sort groups into classification groups... what is the right line to draw? which one should we chose - the line that it further from making mistakes! we might not have all the data - we want find the line that is not close to make misclassifications what if it is impossible to avoid making classification mistakes... i.e. no line to separate between points? we need a 'soft' classifier rather than a 'hard' classifier we need as good as separation as possible - minimize the number of misclassified points we want to trade off between actual mistakes and 'near' mistakes not all mistakes are equal! the best separator - the most costly one type of decision is the further we shift our line away from this group!

we can set a high classifier in order to limit cost of classification errors we can use the same idea for 'soft' classification also! we can tell from our decision line which variable is important to the classifier based on the scatterplot between the two variables horizontal line = the classifier only takes the vertical access into account vertical line = the classifier only takes the horizontal axis into account Data Definitions what data comes up in analytics? what terminology do we use in different types of data? important to understand the analytic vernacular Data Tables: rows are data points columns are variables - information about each data point - features, predictors response - the outcome column or the data point we want to predict - this is a column Data Types: Structured - described and stored in a structured way Unstructured - cannot be stored easily - ex. written text Structured Data: quantitative - numbers with meaning categorical - numbers without meaning - categories of data - numbers denote groupings binary - takes on 1 or 0, takes on two values only! unrelated data - no relationships between data points related data - data linked together - time series data - recorded at the same intervals Support Vector Machines basic mathematical model for classification models we want to put things into categories should we give loans to people based on who they are? think of the scatterplot - green is repaid, red is default different lines can be better - be far away from mistakes and further away from more costly mistakes Support Vector Machines n - data points m = number of attributes xij = ith attribute of the jth data point x1j = credit score of person j (i is the attribute associated with the jth row) x2j = income of person j yj = the response for data point j yj = is 1 if data point j is green (repaid)

yj = is 0 if data point j is red (default) a line through our classification space (scatterplot) would be defined as: this is a set of coefficients! a1*x1 + a2*x2 + ... + am*xm + a0 = 0 where a1 through am are the number of attributes or features! we can also write this as: Σ(ai*xi + a0) = 0

we can draw to parallel lines through our classification space such that: parallel lines have the same coefficients but different intercepts! we want to draw two parallel lines that separate our red and green points... such that a0 is the line exactly in the middle of the two groups (splitting the two groups) this will be our classifier - the line with intercept evenly splitting the two groups we want to classify we want to find values of a0, a1...am that classify the points correctly and have the maximum MARGIN BETWEEN THE TWO POINTS we need the maximum gap between the parallel lines we are drawing to parallel lines as close as possible to our group of points this means we have a line of a0,a1...am for the green and we have a line of a0, a1...am for the red we will use the midpoint of these two lines to be our classifier the support vector machine aims to find the lines with the largest distance from the classifier (midpoint) to the margin (individual lines separating green points) Distance between solid lines: = 2 / √(Σ(ai)^2) this is 2 divided by the square root of the sum of a coefficients squared this converts to Σ(ai)^2 (sum of coefficients squared for all coefficients) if we can minimize this sum - we can maximize the margin between the two groups of data! this is our objective function - we aim to build lines that minimize this distance and maximize the margin! Hard separation problem: minimize the sum to maximize the margin minimize over all a's the sum of the squares of the a's subject to the sum has to be greater than equal to 1 for all data points we minimize the sum of squares for all a's but only if we can accurately classify the groups! our function is bounded by the original separation lines we want to find two separation lines that accurately classify all points and have the largest distance between the two lines!!

what if there is no way to separate between the two groups? we need a 'soft' classifier! this means we account for errors in classification while trade-off the most costly errors

we trade off errors vs. maximizing the margin the error for a data point j of our soft classifier will be: max 0,1 (Σai*xij) + a0)*yj) if we are on the correct side of the line: (Σ ai * xij + a0) * yj >= 1 if we are on the wrong side of the line: this rewrites from the above step (Σ ai * xij + a0) * yj - 1 < 0

Total Error we want to minimize: Σ max(0,1 - (Σ ai * xij + a0)yj) Margin Denominator: we want to maximize this Σ(ai)^2 we want to develop a trade off between our total error of classifying points and maximize the margin denominator - the distance between the decision lines to do this we can set a tuning parameter lambda and add it to our equation this turns our problem into minimizing this equation: minimize Σ max(0,1 - (Σ ai * xij + a0) * yj)) + λ * Σ (ai)^2 the first piece is the total error and the second piece is the margin as λ gets large - the margin term in our equation gets large... this means the need for a large margin distance overtakes the need to be completely accurate! - this is the key tradeoff between total error and maximizing the margin!!! as λ drops towards 0 - the margin term goes to 0... this means the need for accuracy or minimizing total errors overtakes the need for a large margin decision boundary! THIS SUPPORT VECTOR MACHINES!! it is the trade-off between total error and maximizing the the margin distance! THIS IS THE SUPPORT VECTOR MACHINE EQUATION!! minimize Σ max(0,1 - (Σ ai * xij + a0) * yj)) + λ * Σ (ai)^2 notice the two pieces to this equation! - total error = Σ max(0,1 - (Σ ai * xij + a0) * yj)) - largest margin = λ * Σ (ai)^2 our tuning parameter λ denotes how important once piece is in minimizing the entire equation!

SVM: what does the name mean? if we have a set of points and connect the dots - we get the convex hull this looks like a shape floating in space that could fall down a line that 'holds up' this convex hull is called a support vector support vector can be supporting from any side or the top we can have multiple lines supporting our geo shape in classification we want two lines that maximize the margin between the two groups - two parallel lines that are as far apart as possible these two lines are support vectors! the support vector machine is a machine that automatically determines the best 'support vectors' (parallel lines) that support our classification space! these lines correctly classify the data points while maximizing the distance between the two parallel lines! the machine does this automatically and is why we give it the name: Support Vector Machine the classifier we are looking for is between the two vectors - the classifier is not one of the lines touching the support vectors!! Advanced SVM this lesson looks at extensions of the basic support vector machine model how do we account for most costly classification errors? how do we prepare the data before running an SVM model? can we use other types of classifiers to do the same job?

SVM = classifier to find the maximum separation or margin between two sets of data points if it is impossible to avoid classification errors we can use SVM to find a classifier that trades off reducing errors and enlarging the margin enlarging the margin reduces variance - large margin classifiers will not misclassify on new data because the space is large between the two groups! this means small changes in data will still be correctly classified!

costly errors - we can shift our classifier to be more conservative the true classifier is based on an intercept that in our example can range from

1 to -1 we can modify our intercept based on cost to shift this classify up or down all within the space defined by the margin we do this by weighting or reducing the intercept based on cost! in a soft classification context - we could add and extra parameter to the total error side of the SVM equation larger the penalty the less we want to accept misclassifying the data our penalty can be different based on what costly errors we want to be opposed to example: mj > 1 for more costly errors mj < 1 for less costly errors mj is our penalty multiplied to the total error side of our SVM equation

remember: we are trying to minimize the sum of squared coefficients - the difference between the classifier line and the data points if our data is not scaled we could run into problems ex. credit score 350 - 850 range ex. HHI the range could be in the millions the coefficient values could be different by three orders of magnitude when we are adding their squares a small change in one could swamp a huge change in the other!! we need to scale our data before running SVM once the data is scaled we can pick out attributes that are not needed for classification remember - the vertical line classifier with many attributes how do we see which ones matter? if a coefficient is close to zero - that attribute is not important for classification we may not need that variable to classify our data points! summary: we know what classification is we know what it looks like in two dimensions and we know what it looks like in pictures and in math does SVM work the same way in more dimensions - yes! does a classifier have to be a straight line - no! we can modify our straight line classifier to more flexible lines using kernels kernels map specific functions to more flexible boundaries to better classify results this can produce a 'curvy line' svm to classify data! there are many other methods! like probability! is there a 37% chance of default - logistic regression

are there other approaches to classification - yes and many for multiple classes! Scaling and Standardization two common data preparation tasks - scaling and standardization of our data points consider this example: - svm model - classifying loan applicants into credit risk categories - based on credit score, and household income - these variables are scaled very differently - we aim to draw to parallel lines that correctly classify each point while... - maximizing the margin between the two parallel lines - the actual classifier is the line exactly between these parallel points the classifier is the line 0 = a0 + Σ aj * xj we maximize the gap by minimizing Σ aj^2 the sum of squared coefficients: Σ aj^2 = 5^2 + 700^2 = 490,025 what if we make a small change to the credit score coefficient? new sum of squared coefficients: Σ aj^2 = 5^2 + 701^2 = 491,426 (+1,401) how much would we need to change the income coefficient to get this same level? new sum of squared coefficients: Σ aj^2 = 37.8^2 + 700^2 = 491,426 - this requires a large difference in the first coefficient - our resulting line is very different from the original line - data is not on the same scale - small changes to the large magnitude variable: huge swings in our plotted line - large changes to the small magnitude variable: little swing in plotted line the sum of squared coefficients will be sensitive to changes in coefficients our SVM model will not perform well Scaling Data: we scale our data down to the same interval common scaling: data between 0 and 1 scale factor by factor: - let xminj be the min of xij - let xmaxj be max xij - for each data point i: - xij scaled = (xij - xminj) / (x maxj - xmin j) general scaling between B and A: - xij scaled = xij scaled[0,1] * (A - B) + B

Standardization: scaling to a normal distribution

scaling to a mean of 0 and a standard deviation of 1 factor j has mean uj = (Σ xij) / n factor j has a standard devation of σj for each data point i: - xij standardized = (xij - uj) / σj

when do we want to use these? scaling - important to have data within a bounded range - nueral networks - optimization - we need values to be restricted between certain values standardization - important to use in PCA and clustering in some cases its not clear which works better! it is often necessary to scale or standardize the input data should be used throughout this course 'its not just a good idea - it's the law' = not quite but it is a good idea K Nearest Neighbor Classification simple model for solving classification problems this can deal with multiple classes this is called the K Nearest Neighbor Model suppose we have our bank example: loans to applicants based on credit score and HHI each previous applicant is green or red green for full payment red for default instead of trying to draw a line or decision boundary between the two classes... we can predict the next class based on how close the new data point is to other neighbors in the data based on the results of the 'nearest neighbor' - we slot in a new observation into a specific class we typically select 5 nearest data points but there is nothing magical about how many we pick we get the K in K nearest neighbors look at the K nearest points and pick the class that appears the most k closest points - more than one way to measure distance distance metrics: √Σ(|xi - yi)^2 is the straight line distance

we can also weight different neighbors by importance √Σ( wi * |xi - yi)^2 - different attributes might be more important that others - we weight the distance heavier to account more for that feature in that distance calculation likewise - unimportant attributes can be removed - our weight is zero for these attributes wi = 0 for unimportant attributes wi = large for important attributes how do we choose a good value of K? we try different values and see what works - this is called validation Classification Summary: what it is - divide data points into groups graphical intuition basic solution models: support vector machine (SVM) K Nearest Neighbors (KNN) both of these are machine learning algorithms additional subjects: data types data definitions distance metrics confusion matrices these are cross cutting that can be applied to all types of models Model Validation what is validation? how good are our models? is it accurate? does it predict? does it classify? example: we could look at the data points used to build the classifier to check accuracy using the training data we could see our classifier only misses three out of 24 points - this is 87% accuracy! do we have a good model? no! we cannot build a model on the training set and then measure it based on the accuracy on the training data! we cannot measure accuracy on the data we used to train the model!! why? data has two types of patterns: real effect: real relationships between attributes and response

random effect: random effect or noise - it can look real in our training data! we don't know which patterns are real or which are random in the training data we are modeling a mix between real and random when we apply it to new data we should only see the real effects the random noise associated with that training data will not be present in the new data! real: same in all datasets random: different in all datasets our solutions performance will be less than the training data performance we want to model the true effect!! the random effect is why we cannot measure accuracy on the training set!! we will not generalize to new data! with small samples we can see lots of noise that is not the true effect we can always find patterns that are noise only the real effect are likely to show up in real data! Test Sets - Validation validation - measuring effectiveness of the model don't judge a model based on how it fits the training data this will overfit the random effect of the training data and it will not generalize to new datasets our performance will look too good on the training data how do we then measure model effectiveness? we split data into two groups: - a larger training set to fit our model - a smaller set to measure the model's effectiveness in our loan application example: - create the classifier on the training set - measure effectiveness on the validation or test set - we could get 90% right on the training set - we could then get 80% right with the test set! this is a variance problem - our model is having a hard time generalizing to new data there is one more level - what if we are comparing more than one model? we have both SVM and KNN and 5 of each? which is the best of the 10 models? for each one we can measure on the validation set and choose the best one!! even on the validation set - the randomness will still be there the performance may vary based on the randomness of the validation set model with the best performance may be most likely to be the best model...

but it is also most likely to be given a boost by the random effect of the validation set the performance we measure on the validation set could be random! how do we account for this? we measure a third set of data on a test set! another held out data to measure the true effectiveness! this will make sure our model is not boosted up by randomness in the validation set! we can use the measure on the test set to accurately gauge performance! we use the training set to build our model: we use the validation set to choose between models: we use the test set to get the final perf...