Quiz 1 Study Guide PDF

Title	Quiz 1 Study Guide
Course	Business Analytics and Modeling
Institution	Indiana University
Pages	11
File Size	582.7 KB
File Type	PDF
Total Downloads	25
Total Views	158

Preview

CLICK TO PREVIEW PDF

Summary

Study guide with key terms and concepts that appear on the quiz...

Description

Quiz 1 Review Guideline Requirement: [K] – “Know”, know the concept; [C] – Comprehend, know the concept, and the “why” behind the concept. Day 1. Introduction - Business Understanding (about 15%) ● [K] Applications of data analytics for businesses ○ Manufacturing → use predictive analytics to ■ Develop failure predictions based on warranty transactional systems and vehicle diagnostics ■ Enable quick action on emerging problems and feedback to design teams for quality improvement ○ Retailing → use predictive analytics for ■ Price optimization ■ Market basket analysis for upsell and cross-sell ■ Campaign management and targeting ■ Loyalty programs and increasing lifetime customer value ○ Financial institutions → use predictive analytics to ■ Understand patterns of product selection as related to life stage of customers ■ Predict loan defaults and credit risks ■ Identify fraudulent transactions ● [K] The six steps in the predictive data analytics process ○ 1. Business Understanding → Knowing: ■ The state of the business and trends that affect it ■ Key business processes and how decisions are made with them ■ How analytically solving a decision-making problem will influence the business ■ Variables of interest ■ Causal structure among variables ○ 2. Data Understanding ■ Data Sources: Internal, external, and data quality ■ Data Exploration: Structure, variable types, summary statistics, distribution plots ○ 3. Data Preparation ■ Data Cleaning & Pre-Processing: outliers, missing values, data transformation, dimension reduction ○ 4. Modeling ■ Numerical Prediction: Linear predictions (ex: predicted vs. actual sales) ■ Classification: Decision trees, logistic regression, k-nearest neighbor, neural network, naive bayesian method ■ Association Analysis: Cluster analysis, association rules ■ Natural language: Text analytics ○ 5. Model Evaluation ■ Evaluating predictive power

■ Evaluating overfitting problem (biggest danger to predictive analysis) ○ 6. Deployment ■ Visual perception ■ Quantitative reasoning ■ Decision making ● [K] Supervised and unsupervised models ○ Supervised: process of providing an algorithm with records in which an output variable of interest is known and the algorithm “learns” how to predict this value with new records where the output is unknown. Learning happens from training data. ■ Numerical Predictions ■ Classifications ○ Unsupervised: An analysis in which one attempts to learn patterns in data other than predicting an output value of interest. No “learning” from cases with a known outcome variable. ■ Association Analysis ■ Natural Language Supervised Data

A target variable and predictor variable(s) to estimate the target variable

A collection of variables (features) with no distinction between predictor and target variables

Goal

Predict value/classify the label (class ) of the target variable

Discover useful patterns from the data

Linear regression model - predict numerical value Decision trees - classifications

Cluster Analysis Association Rules

Models

●

Unsupervised

[C] Predictive analytics and explanatory analytics ○ Predictive = Output focus ■ Goal-driven, enables predictions, classifications, and detection of relationships ■ Identifies associations, discovers novel relationships and trends ■ Prospective (forward looking) ○ Explanatory = “Why and How” focus ■ Aims to develop and test theories by way of causal explanation and description ■ Test causal hypothesis based on theory ■ Retrospective ○ Examples Predictive

Explanatory

Build model to predict next period sales

Examine factors which cause increase in sales

Build model to predict whether customers respond to ad campaign

Examine factors which make customers more likely to respond to ad campaign

Day 2-3. Data Understanding (about 25%) ● [K] Data quality inspections ○ Criteria for good quality data: ■ Accuracy: data entry errors, measurement errors, repeat entries, outliers ■ Completeness: missing values ■ Consistency: values for each variable consistently measured using the same scale (e.g. inches vs centimeters) and stored the same way (e.g. number vs text) ● [K] Dealing with missing data ○ 1st, determine if values are randomly or non-randomly missing ■ Non-random → poses more serious problems ■ 3 indicates skewed distribution ● [C] Plots: boxplot, histogram, scatterplot ○ Histogram

■ Bell-shaped: ideal shape, indicates normal distribution ■ Bimodal: 2 peaks, often a result of data coming from 2 different sources. Must example to see if you can separate data ■ Left skewed (negatively skewed): peak is on the right, most of the data is clustered around the larger values, fewer observation with smaller values ■ Right skewed (positively skewed): peak is on the left, most of the data is clustered around the smaller values, fewer observation with larger values ○ Scatterplot ■ Used to visualize relationships between pairs of variables ■ Scatterplot matrix → used to examine relationships between multiple pairs of variables together ■ Correlation coefficient R → between -1 and 1 ● -1 = strong negative linear relationship ● 1 = strong positive linear relationship ● 0 = no/weak relationship ○ Boxplot

● [C] Reading the charts Day 4. Data Preparation (about 25%) ● [K] Transformations on variable types – Encoding and Binning ○ Encoding ■ Takes categorical variables and produces a series of dummy variables (flag variables in SPSS) ■ Example:

○ Binning

Neighborhood

House Price Trend

T_Up

T_Down

T_Flat

Hyde Park

Up

1

0

0

Kensington Park

Up

1

0

0

University

Down

0

1

0

Canada Park

Flat

0

0

1

■ Transforms numerical variables into categorical counterparts ■ Improves accuracy of the predictive models by reducing the noise in the data ■ Predict the value range an observation falls into, rather than focusing on exact values ■ Work with variables with multimodal (multiple peaks) distributions ■ Fixed-width binning: divide variable values into equally spaced intervals based on the difference between the max and min values. Bucket width = (max-min)/k where k is the # of buckets

[K] Monotonic transformation ○ General rule of variable transformations: data transformations should be monotonic: ■ 1-1 correspondence ■ Preserves rank order of data points ● [C] Linear Transformations ○ Changes scale of measurement ○ Examples: ■ Transforming height in centimeters to inches ■ Fahrenheit to celsius ■ Housing price to housing price in thousands of USD (divide by 1,000) ● [C] Nonlinear Transformations ○ Changes shape of distributions ○ Goal: transform variables so their distributions more closely resemble normal distributions ○ Meet statistical properties requirement of certain models ○ Better represent inherent relationships in the data

●

○ [C] Identifying skewness in distributions ○ Left skewed (negatively skewed): peak is on the right, most of the data is clustered around the larger values, fewer observation with smaller values ○ Right skewed (positively skewed): peak is on the left, most of the data is clustered around the smaller values, fewer observation with larger values ○ Bell-shaped: ideal shape, indicates normal distribution ○ Bimodal: 2 peaks, often a result of data coming from 2 different sources. Must example to see if you can separate data ● [C] Transforming skewed distributions ○ Transformations for positive skews:

●

Skewness

Transformation

Moderate: Positive values of X

New X = SQRT(X)

Substantial: Positive values of X With 0s

New X = LOG(X) New X = LOG(X + C) ← C is a constant

Severe: L-Shaped L-Shaped with zero

New X = 1/X New X = 1/(X + C) ← C is a constant

○ Transformations for negative skews: Skewness

Transformation

Moderate: Negative values of X

New X = SQRT(K - X) ← K is a constant, usually K = max(X) + 1

Substantial: Negative values of X

New X = LOG(K - X) ← K is a constant, usually K = max(X) + 1

Severe: J-Shaped

New X = 1/(K - X) ← K is a constant, usually K = max(X) + 1

[K] Outlier detection ○ Visualize using boxplot ○ Tabulate Z scores → any value >3 (more than 3 std deviations from the mean) is considered an outlier ○ Z score = (Y - Ȳ)/s ○ Multivariate outliers: could potentially bias the model that we fit o the data and the prediction results. “Anomaly Mode” in SPSS uses clustering to detect anomalies ○ How to handle outliers: ■ Diagnose first: what happened, is this legitimate? ■ Usual suspects: data entry error, bad records from extract process ■ Standard options: filter out records by coding them, do separate analysis for outliers Day 5-6. Linear Predictions (about 35%) ● [K] What we use linear predictions for ○ Goal: predicting a numerical target variable using predictor variables e.g. sales, revenue, number of credit card applicants ● [K] How to fit the linear model ○ Y = a + b1*X1 + b2*X2 + … + bn*Xn + Ɛ ■ Ɛ = actual - predicted value (Y - Ŷ) ○ Simple Linear Prediction Model: 1 dependent variable, 1 independent variable ○ Multivariate Linear Prediction Model: 1 dependent variable, 2+ independent variables ○ Line is fitted by minimizing the sum of the squared distance of the points from the regression line (sum of squares) → minimize (Yi - Ŷi) 2 ○ Y = a + bX + Ɛ ■ a = intercept, b = slope, Ɛ = random error ■ Fitting the line = determining values of a & b ●

● [C] Evaluate predictive accuracy using MAE ○ MAE = Mean Absolute Error ○ Want to minimize

○ Shows magnitude of error, independent of direction

● [C] Overfitting and data partitioning ○ Overfitting ■ Creating an overly-complicated linear function which fits the current data points very well, but will not predict future variables (ex below) ■ Often caused by including too many predictors, building too complicated a model on available data ■ May fit the data used to train the model very well, but fails to generalize to additional dat or predict future observations reliably ■ Detect overfitting with data partitioning

○ Data Partitioning ■ Used to detect overfitting ■ Split data into ● Training data: use this data to build linear regression model ● Validation data: use this data to compare models and pick the best one ○ When comparing models, look at MAE of validation, not training. If model is overfitting training data, MAE will be lower on training data but higher on validation data ● Optional: can also split into test data to evaluate performance of chosen model on new data) ■ 1. Randomly choose 50% of the data to be training data, and the remaining 50% as validation data ■ 2. Build the linear regression model with the training data ■ 3. Estimate predictive accuracy with the validation data ■ Model 2 is more accurate, because it has a lower MAE on the validation data

■ Comparing 2 models if 1 has been transformed? ● [C] Using the linear models to make predictions ○ Use models to determine what impact a 1 unit change in the predictors would have on the target variable ○ Ex: Sales = a + b*adv + Ɛ ■ When b = 0, 1 unit change in the predictor variable leads to no changes in the value of the target variable; therefore the predictor does not provide information to predict the value of the target variable ■ Look at p-value to determine if the effect is significant or not (in the sig. column) → if p 1 have high influence

The quiz is close-book, done in-class, on Canvas, with 25 questions, 30 minutes allowed. The quiz questions focus on understanding concepts and interpreting results/graphs, rather than on hands-on part of performing specific analysis. Sample Questions (for [K] level): 1. Which of the following charts could you use to most easily visualize the mode in the distribution of a

continuous variable: A. Scatterplot B. Boxplot C. Histogram D. Bar chart 2. The following scatterplot shows the correlations between GDP and internet speeds. From the graph, you can see that there is likely to be _____________ between the two variables. A. Positive relationship B. Negative relationship C. No relationship 3. To evaluate the significance of individual predictors, we look at the p-value from the __________ for each predictor. A. B. C. D.

F-test T-test Chi-squared test Rank test...