Lessons 17-21 Notes ISYE 6501 PDF

Title	Lessons 17-21 Notes ISYE 6501
Author	Matthew Holland
Course	Analytic models
Institution	Georgia Institute of Technology
Pages	8
File Size	101.2 KB
File Type	PDF
Total Downloads	16
Total Views	157

Preview

CLICK TO PREVIEW PDF

Summary

Download Lessons 17-21 Notes ISYE 6501 PDF

Description

Lesson 17 ●

17.1 A Format For Discussion  ○ Given the amount of crude oil available each day at supply ports, the daily demand at each refinery, the availability, speed, and cost of the company’s fleet of oil tankers, the port capacity at supply ports and refineries, and the range of weather effects on travel time, use an optimization model to determine the best pickup, delivery, and routing schedule for all of the tankers. ○ Given minute by minute traffic data on 12 popular websites, use ARIMA and GARCH models to predict the mean and variance of the traffic on each site five minutes into the future. Then, given those predictions from the ARIMA and GARCH models, pass data on the fraction of visitors to each website who click on your company’s banner ad at each time of day and the fraction of banner ad clickers who purchase your product use a simulation model to suggest the break even price for advertising on each of the websites over the next five minutes.

Lesson 18 ●

●

18.1 Introduction to Power Company Case  ○ For all of these cases, what data do we need and what models will we use? ○ It can be important to prioritize and be efficient. ○ This problem has capacity constraints, and some of the constraints are taken up by the worker’s time. 18.2 Models for Customer Identification ○ For this case, we need to rely more on payment history, credit score, and other factors that are lawful rather than zip code, race, etc. ○ Possible Models to Use ■ Classification (SVM or kNN) (Pay, can pay but do not pay, not able to pay) ■ Clustering ■ Logistic Regression to determine probability of payment ● This would require the second step of deciding what threshold to use to convert the probabilities into yes/no decisions. ● It would let us consider the expected value of turning power off while classification or clustering would not. ■ Single-model approach ■ Tree-based approach ■ Hybrid Approaches ● Example: cluster then analyze each cluster separately with linear or logistic regression ○ Pros and Cons ■ Unsupervised ● Clustering ○ Quick ○ Not exact clusters that you would expect ■ Supervised ● Classification ○ Clear decision ● Logistic Regression ○ Requires a threshold ○ There are clear wrong answers. ■ GARCH would not be used to estimate probability since it estimates variance and is used for time-series data. ■ A queuing model to classify customers would also be wrong.

●

●

18.3 (X): Models for Cost Estimation ○ Costs ■ Leaving power on vs. shutting power off ■ The cost of turning the power back on ■ Personnel and legal costs ■ Societal cost of shutting off power and how it would affect the power company’s reputation. ■ The cost of driving around to shut the power off. ■ Turn off/turn on costs could be treated as constants. ○ Approach #1 (Time Series) (if there’s a lot of past data) ■ Time series with exponential smoothing with trend and seasonal effects to estimate the amount of power used next month. ■ ARIMA ■ GARCH if we need to consider variability ○ Approach #2 Simple Linear Regression ■ Tree-based, clustering ■ First separate the data points into groups, and then use a separate regression model for each group to determine which customers should be shut off. ○ Pros and Cons ■ Time Series ● Pro: Good if there’s enough past customer data ● Con: Effective only for short-term forecasts ■ Factor-based Regression ● Pro: Can be effective even when there’s not much specific customer data. ● Normalize to account for seasonality. ○ Approach #3 Hybrid Approach ■ Model the amount of money someone will owe at the end of the next month. ■ With this approach, there will be a lot of zero values. ■ With this approach, it’s usually more effective to first try to identify which data points will have the one popular value and then analyze them separately from the rest. 18.4 (X): Models for Shutoff Selection ○ We can envision having a “priority list” of shut offs based on the difference between the expected cost of leaving the power on vs. turning it off ○ The difficulty in modeling is how long it takes to travel between pairs of points. ○ This problem is called the vehicle routing model. ○ Approach ■ Data ● We don’t have information like the multimillion dollar Orion system that UPS has, but they can use generic drive time estimates. ● We need generic drive time estimates, time it takes to shut off power at location, and the results of the models from previous methods that identified whose power should be shut off and estimated the amount of power that they might use next month. ■ Models ● Optimization ○ What is the highest value set of customers that we can choose for power shutoffs? ○ Binary variable to denote yes of no for the customer.

○

■

The objective function could be the sum over all customers of the expected difference between shutting off power or not times the binary variables. ○ Writing constraints is the hard part. ● Clustering ○ Cluster physical locations ● Simulation ○ Possible distribution fitting ○ We could use simulation to figure out the value of adding new workers. Student Approach ● 1. Logistic Regression to estimate the probability of each customer not eventually paying their bill. ● 2. Linear regression with Box-Cox to predict the amount of power used next month. ● 3. Special Vehicle Routing Optimization model to maximize the total value of shutoffs ● *** 4. Ideally, they would have also used simulation.

Lesson 19 ●

●

19.1 (X): Introduction to Retailer Case ○ Hypotheses ■ 1. More shelf space will lead to more sales ■ 2. The more of a product that gets sold, the more that complementary products will be sold. ■ 3. If two complementary products are next to each other on the shelf, then the complementary effects will be greater. ○ Model to be used: Optimize shelf space ■ Constraints: MInimum and maximum amount of shelf space, physical size of the store ○ Are the retailers hypotheses true? ○ In some cases finding appropriate data is harder than setting up appropriate models. 19.2 (X): Testing the Hypothesis ○ Doing a hypothesis test is pretty straightforward, but the difficulty can be in collecting the data to do it. ○ Hypothesis 1 (Trickiest of the three.) ■ Approach #1: Vary shelf space day by day ● A/B Testing: difference between two shelf space amounts ○ Problems: Trends, Seasonalities, Other Factors ■ Approach #2: Increase shelf space (once or gradually) ● Change Detection: Change in sales? ○ Problems: Trends, Seasonalities, Other Factors ○ Moving objects around daily in order to do A/B tests was too time consuming, so that wasn’t an option. ○ We don’t have years of data or time to wait to collect time series data ■ Approach #3 (Given we have years of available data) ● Exponential smoothing to account for trend and seasonality effects. ● Hypothesis test on baseline values ■ Approach #4 (if many different shelf space amounts) ● Regression ■ What was used:

●

●

○

●

●

Test across stores: Different stores had different amounts of shelf space for each product type so we could use a factor based model (regression, SVM, etc.) to determine the impact on sales of shelf space. Problem: Easy to show correlation, but hard to show causation. ○ Example: At the store level, the store managers believed they were setting shelf space based on demand. At the corporate level, the belief was that shelf space was causing demand.

Hypothesis 2 ■ POS data was used in a hypothesis test to check if this hypothesis was true. After that a simple regression model was able to estimate the magnitude of the complementary effects. ○ Hypothesis 3 ■ Statistical significance was shown in most cases. The magnitude was also estimated, but with higher error. 19.3 (X): Using a New Data Source ○ New data source: cameras tracking customers throughout the store ■ This allowed us to check how much was purchased by a customer who walked by a product. ■ Also, how much was purchased by customers who stopped at a product type. ■ This helps eliminate seasonality, region etc. ○ The new system wasn’t good at figuring out who was whom. ■ If it worked better, it could use logistic regression to determine the probability that the same person is in camera shot. ■ Also, an optimization model could be used to compile the probabilities and find the overall maximum probability way of matching people. ○ All of this information is great for analytics, but bad for privacy. ○ Sometimes more analytics isn’t necessarily better. ■ Case of father finding out his daughter was pregnant. ■ Social security numbers ■ Finding information about you with only 12 pieces of data 19.4 (X): Making Recommendations ○ Solution ■ Start with clustering. ■ Define the distance between a pair of products to be inversely related to the fraction of times pairs of products were purchased together. (The more often they’re purchased together, the smaller the distance between them.) ■ Then find groups of products that should be near each other using a clustering model. ■ Could also use the Louvain algorithm or an optimization model also ■ Once clusters are defined, use an optimization model to assign shelf space in order to maximize total sales or profit. ■ *Could also add a third level that locates clusters around the store. It would try to maximize the sales value of product types that customers need to walk past in the store. ● Example: Candy at the checkout aisle. Sell items that weren’t necessarily on customer’s shopping lists. ○ If the camera tracking data had been available, it would have really helped, but it wasn’t ready yet. ○ This is a good example of why it’s important to consider not just the modeling aspect of analytics, but also how it meshes with the data of analytics. ○ Without the necessary data, fancy models are pretty useless.

Lesson 20 ●

●

20.1 (X): Introduction to Monetization Case ○ How can the company use datasets to generate revenue? (monetize the data) ■ Example: A person with high net financial worth and an interest in travel who donates money to archaeological museums might be a prime target for marketing high priced archaeological trips to Israel. ■ Predictive relationships could be found too. ● Example: People who play chess and donate money to religious organizations are less likely to default on their loans. ○ How can the company monetize the data? Which analytics models are needed in order to do it? 20.2 (X): Sample Models ○ Single Data Set 1 (Optimize products that get shown to the customer by using browsing pattern data) ■ Design of Experiments ● Show variety of products & track length of customer’s gaze ● Multi-armed bandit ○ Exploration vs. exploitation ○ Decide which images to show based on this. ■ Clustering/Learning Model ● What does similar mean for a customer? ● What about these products makes it attractive for the customer? ○ Single Data Set 2 (Use shipping data & repeated shipping patterns) ■ A person who gets diapers shipped to their house every two weeks probably has a baby. Based on the size of diapers, the age of the baby could be estimated with a regression model. Then useful products for parents with a baby of that age could be shown to the customer. As the baby gets older, products can still be suggested. ■ If a gift is sent to an address once a year, helpful reminders could be sent to remind and suggest gifts for that occasion. ○ Multiple Data Sets 1 ■ Credit offers ● Determine what level of credit should be offered based on credit score, income, etc. ● Classification, regression, or several other models could be used. ● The results could be sold to credit providers. ○ Multiple Data Sets 2 ■ All three data sets ■ Clustering or Regression  Model ● What types of products should be shown on the website? ● Types of products to show on the website ● Based on interests from the magazine ■ Regression model ● What price level of products should be shown to the user? ● Price level of products to show ● Based on purchasing and credit data ■ Real-time Updates ● Based on real-time data of gaze, click, and purchase data. ○ It’s great if something works from an analytics point of view, but that’s meaningless if it has no business viability. ○ Often, we won’t know if an idea will be profitable until we try it.

●

20.3 (X): Matching Across Data Sets ○ Matching names is a hard analytics problem. ○ George P. Burdell! ○ How do we match names across data sets? ○ We use a factor based approach ■ 1. Measure how good a match is for each field. ● Sometimes it’s just a binary factor. ○ 1 for an exact match. ○ 0 for a non-match ● The range could also be between 0 and 1 ○ Smith - Smuth ○ Smith - Smiht ● Year of birth could be matched with year of graduation if you add 21 -23 years to it. ■ 2. Base it on the field’s value itself. ● A match of Smith or Jones is worth less than a match of Sokol or Holland. ○ What kind of models can we use? ■ Supervised learning models ● Deep learning ● Bayesian learning ● Neural networks ■ To determine if they are the same person if a match has been made… ● Classification with a nonlinear kernel based SVM approach (either zero or one) ● Probability ○ Logistic Regression ○ Bayesian Learning Model ○ Neural Network or Deep Learning

Lesson 21 ●

21.1 (C): Many Analysts, One Dataset ○ Study of soccer with a dataset of 12 attributes. Based on one particular attribute, is this player more likely to get a red card? ■ ⅔ of researchers said yes, ⅓ said no. ■ Different data, models, attributes, etc. used by all researchers. ■ What’s going on here? ● 1. Modeling is an art. ○ Individual artistry is both important and hard to measure. ● 2. Everyone comes to analytics with their own biases. ○ People’s biases can easily affect their modeling choices, even when we’re trying to avoid being biased. ■ Is crowdsourcing a good idea? ● Not necessarily. An excellent answer can be swamped by a bunch of mediocre ones. ● It can help to avoid the worst results, but it won’t necessarily give the best results. ○ Takeaways ■ 1. Modeling really is an art. ■ 2. We need to be careful about our own biases. ● Think about what your gut feeling is about an answer. ● Build a model that you think is right.

●

●

Also, build a test model that we give the best chance of proving our gut feeling wrong. ■ 3. We need to keep an open mind to other approaches. ● Don’t let yourself be marginalized by someone with a different opinion. ■ 4. What you need to be successful is your own individual intuition, insight, and artistry in your analytics career. 21.2 (C): Course Summary ○ Basic Machine Learning ■ Classification Models ● SVM ● kNN ■ Clustering Models ● K-means clustering ○ Time Series Models ■ CUSUM ■ Exponential Smoothing ■ ARIMA ■ GARCH ○ Regression Models ■ Linear Regression ■ Logistic Regression ■ Advanced Regression ● Regression Splines ● Bayesian Regression ● kNN Regression ○ Tree-based Models ■ CART ■ Random Forests ○ Design of Experiments ■ A/B Testing ■ Factorial Design ■ Multi-Armed Bandit Models ○ Probability-based Models ■ Distribution Fitting ■ Queuing ■ Markov Chains ■ Simulation ○ Optimization Models ■ Prescriptive Analytics ■ Statistical Modeling ○ Advanced Topics ■ Non-parametric Models ■ Graph Analysis ■ Competitive Models ■ Deep Learning ○ Parametric Methods ■ Lessons 1-10 ■ We choose the form of the predictor. ○ Non-parametric Methods ■ Beyond Lesson 10 ■ We don’t force any specific form onto the predictor.

○

○ ○

○

○

○

Preparing data ■ Detecting and dealing with outliers ■ Box-Cox transformations ■ De-trending ■ Scaling ■ Standardization ■ Principal Component Analysis Dealing with missing data ■ Imputation Variable selection ■ Evaluation of the quality of the model ● Greedy methods ● Global approaches Validation ■ Training, validation, and test sets ■ Cross-validation ■ LOOCV Methods for solving Case Studies ■ 1. Most solutions for the case studies involved linking several analytics models together. ■ 2. Several models used, but finding proper data turned out to be the biggest challenge. ■ 3. Several models used again, the toughest part is matching records from different data sets.

My intuition is just as important as the analytics models that I’ve learned this summer....