UGBA 104 Final Study Guide PDF

Title	UGBA 104 Final Study Guide
Author	Rea Savla
Course	Analytic Decision Modeling Using Spreadsheets
Institution	University of California, Berkeley
Pages	12
File Size	155.2 KB
File Type	PDF
Total Downloads	1
Total Views	145

Preview

CLICK TO PREVIEW PDF

Summary

Final Exam material from from midterm to final....

Description

Week 8 ●

●

●

●

●

●

Big Data’s 7 V’s: ○ Volume - 1 yottabyte = 10^24 bytes ○ Velocity - Speed of transformation of data, and speed of analysis of transmitted data to generate an action ○ Variety - text, audio, images, and video ○ Veracity - missing data, ambiguous data, conflicting data, falsified data, and biased data ○ Visualization - understanding the best ways to display data ○ Vulnerability - security of data ○ Value - analysis of data must create value, such as a new product line, a cross-sell opportunity, or a cost-cutting measure Types of Cloud ○ Public Cloud - vendor hosts hardware and software that is shared among tenants, the tenants only pay for what they use and the vendor maintains ○ Private Cloud - organization hosts hardware and software own or leased only for its internal usage; organization’s data at the organization’s data center Data set ○ Each row is observation ○ Each column is a variable ○ Continuous variables and categorical variables ○ Can make categorical data into binary dummy variable ○ Dimension reduction - cutting out variables without losing data Excel commands ○ Modify: sort, filter, conditional formatting ○ Summarize categorical: frequency distribution, COUNTIF, AVERAGEIF, SUMIF, relative frequency distribution and percent frequency distribution ○ Quantitative variable - frequency, relative frequency distribution and percent frequency distribution, histogram, mean range, variance, and stddev, percentiles, median, quartiles, IQR, outliers, box and whisker chart P’th percentile: ○ Approximately p percent of the observations are below the pth percentile ○ 50th percentile is (p/100)(n+1) IQR: ○ Q3+1.5*IQR = outlier ○ IQR: Q3-Q1 ○ standardized : X-mean/Standard Deviation ○ Normal: 68-95-99.7 ○ Outlier = z score >= +-3

Week 9

●

●

●

●

Scatter Chart - plots each points coordinates ○ Insert scatter chart by highlighting the data you want ○ Add chart elements - chart title, axis titles ○ Format axes - axis options: change minimum and maximum, and under number, change decimal places from 2 to 0 ○ Add linear trendline - right click on any data point and select add trendline ■ Under trendline options select linear ■ Under forecast select “display equation on chart” and “Display R^2 value” ○ Darken the horizontal and vertical axes ■ Right click anywhere on horizontal axis and select format axis ■ Under “line” to the right of color select drop down menu and select black ○ When regressing Stock A against S&P 500 the slope of the stock’s trendline is the Stock’s Beta, if the slope is greater than 1 then its more volatile, if less than 1 then less volatile. Column Chart or Bar Chart ○ Compare the magnitude of a quantitative variable associated with a categorical variable ○ Select the cells - exclude the labels ○ Insert tab ○ Insert column or bar chart ○ Clustered column button ○ Select the chart ○ Click chart elements button ○ Axes, axis title, and chart title ○ Sort the table based on % above/below quota of $75 (smallest to largest) ○ Select the chart, click on chart elements, and check the boxes for “data labels” Line Chart, Line Sparkline, Win/Loss Sparkline, and Heat map ○ Line Chart - connect a sequence of points to display how a quantitative variable evolves over time (on horizontal axis) ○ Line Sparkline - is a line chart squeezed into a single cell ○ Win/Loss Sparkline - “squeezes” markers into a single cell to indicate the signs of a sequence of cells where an upward facing block indicates a positive number and a downward facing block indicates a negative number ○ Heat Map ■ Uses different shades of colors to indicate the relative magnitude of a selection of cells Recipe for Line Chart ○ Insert Chart - select cells, insert line or area chart ○ Change labels and axes - right click on any label along horizontal axis and select data ■ When the select source data box appears, in the horizontal axis labels box click edit, when the axis labels box appears first select cells B2:K2 and then click ok, when the select source data box appears click ok

○

Add chart elements - select the chart, click the chart elements button, check the boxes for “axes” “axis titles” “chart title” and “legend” ● Recipe for line sparkline ○ Insert, within sparklines group select line ○ Select data range ○ Select the cells, this reveals the “design” tab ○ In the show group select the High Point and Low Point ● Recipe for Heat map ○ Common Heat Map ■ Select the cells, the labels are excluded ■ Go to conditional formatting ■ Under color scales, select the Red-White-Blue color scale ○ Individual Heat Map ■ Select the cells ■ Conditional formatting - color scales ● Recipe for Win/Loss Sparkline ○ Insert, Sparklines - Win/Loss, select the data and location range ● Longitude and Latitude - GPS Coordinates ○ Lon: between 180 and -180, positive is east, negative is west ○ Lat: between 90 and -90, pos is north, neg is south ○ SF: -122 W, 37.775 N ● Bubble Chart ○ Two-dimensional plot of three variables with magnitude of the 1st and 2nd represented by the axes of a typical two dimensional graph and the magnitude of the 3rd variable represented by proportionally sized “bubbles” ○ PRACTICE MAKING THIS!! ● PivotTable and PivotChart ○ Allows you to slice and dice a data set containing a mixture of categorical variables and quantitative variables ○ Pivot chart conveys the same info as a pivottable but as a chart (clustered column chart) instead of a table ○ Pivottable and pivotchart are complements not substitutes ○ To sort hit right click on a cell then sort ○ Grouping ■ Drag average salary form sum values to columns zone ■ Right click any individual salary in row ■ Click group ■ In the grouping dialog box that open enter “starting at” 2000 and “ending at 5999” enter 500 for by ■ Then click ok Week 10 ● Descriptive Data Mining - extracts from the dataset patterns and relationships that are too complex to describe using charts and tables.

●

●

●

●

○ Also known as Unsupervised Learning ○ Predictive Data Mining = Supervised Learning Cluster Analysis ○ Given info for cluster analysis: ■ Data set with rows and columns; each row is an observation ■ Each column is 1 quant variable (any categorical variables are already transformed to binary) ■ Goal: use the values of the dataset variables as a basis for dividing the observations into distinct cluster, where observations within a cluster are similar to each other ○ Start with making a scatter chart ○ Applications of a cluster analysis: ■ Divide a set of customers into k cluster, with each cluster having a different marketing strategy , divide a set of stocks into k clusters and diversify your portfolio by investing in rep stocks from each cluster, etc… Two Alternative Algorithms for Performing Cluster Analysis ○ k-Means Clustering - is an iterative process that begins and ends with k clusters ■ First iteration begins with k randomly created clusters ■ During each successive iteration, the current k clusters are regrouped into the next k clusters by considering each, and if necessary, reassigning it from its current cluster to a more similar cluster ■ The iterations continue either for a pre-specified number of iterations or until each observation is in a cluster of “similar” observations - whatever happens first ○ Hierarchical Clustering - an iterative process that begins with n clusters and ends with 1 cluster ■ The first iteration has n clusters, with each of the n observations being its own distinct cluster ■ Each successive iteration merges the 2 clusters that are most similar to each other, thereby creating a series of clusters ■ In the final iteration, the last 2 cluster are merged into a single cluster Euclidean Distance Between Two Points (x1, x2, x3…, xn) and (y1, y2, y3,.., yn) ○ Given 2 points, the euclidean distance between the points is = sqrt((x1-y1)^2 + (x2-y2)^2+...+(xn-yn)^2)) ○ Depending on the units used to measure the variables, the Euclidean distance between two variables can be less or high Standardizing (Normalizing) a Data Set ○ Should we measure Age in years, months, days or hours -- distance depends on how you measure the units ○ (Vi - mean)/Std dev = z score ○ Use excel to standardize the data: =STANDARDIZE(cell, B$35(mean), B$36(standard deviation))

○

●

●

Centroid of a data set = the coordinates are equal to the average of each coordinate ○ Within-Cluster Distance - the average of all the points in the cluster ○ Between cluster Distance = the distance between the centroids of each cluster K-means Clustering Algorithm ○ Initialization Steps: ■ Specify k (the desired number of clusters) ■ Convert the data into standardized (normalized) data ■ Because the iterative steps below might not converge, specify the maximum allowable number of iterations ■ By randomly assigning each observation a number between 1 and k, randomly partition observations in k clusters, called “random start” ○ Iterative Steps ■ Using the standardized data and metric of Euclidean distance ■ Compute each cluster centroid ■ Using the cluster centroids, reassign each observation to the cluster the the closest centroid ■ Did the clusters change and is the number of iterations less than maximum allowed, if yes return to Iterative step 1, otherwise proceed ○ Final Steps ■ For each pair of cluster i and cluster j compte the between cluster distance, the larger the between cluster distance, the more distinct from each other the 2 clusters are ■ For each pair of cluster i andj, compute the following 2 ratios: the larger the average and minimum of these ratios are, the better the clustering is ● (between-cluster distance for cluster i and cluster j)/(within-cluster distance for cluster i) ● (between-cluster distance for cluster i and cluster j)/(within-cluster distance for cluster j) How to do it on XLMiner: k-Means Clustering ○ Launch k-means clustering ■ Select any cell in the range of the data ■ Click XL Miner Platform tab ■ In the data analysis group click “cluster” ■ Click k-means clustering ○ In Data Source, confirm each array ○ In the variables area, in the variables in input data list box, first select the variables on which to base the clustering, and move the selected variables to the selected variables list box, then click next ○ Select the checkbox to normalize input data ○ In the parameters area, proceed as follows ■ In the # clusters, enter the desired number of clusters ■ In # iterations, enter the desired number of iterations

■ ■

○

○ ○

Select random starts and enter 10 In the centroid initialization area, check the box for “set seed: and enter 12345 and click next Click box for show data summary and show distance from centroid, click finish ■ Best is when there is the smallest sum of square distances within the clusters How to choose the best value of k ■ Budget, policy, practicality, evaluate multiple values of K Two ways of choosing k using distance based graphs ■ Choose the knee of the graph of overall average distance between ■ Choose the cluster after which the ratio skyrockets

Week 12 ASSOCIATION RULES ●

● ●

●

●

Given info for association rules ○ Data set with rows and columns - each row is an observation we call transaction and each column is a variable we call item ○ Association rules - if then rules that describe associations among the items in the transactions ○ I.e. “if a transaction includes item Y it also probably includes item Z” Applications ○ Suggested items, grocery store placement 3 Formats of the DataSet ○ Text - only one column, comma separated values in each transaction ○ Item List (terminology used by XLMiner) ○ Binary Matrix (terminology used by XLMiner) ○ ORDER DOESN’T MATTER ■ Excel contains, under “Data,” Text to Columns format (click Delimited -comma and space) ■ Check box for treat consecutive delimiters as one ■ Enter box for the destination’s upper left corner ○ Optional: convert item list format to binary matrix format ■ Use =countif - to see whether that item is contained in the transaction ■ COUNTIF(range, item) ○ XLMiner accepts both item list format and binary matrix format Format of Association Rule ○ If then -- > antecedent and consequent, where both parts are disjoint sets of items ○ {N,O,P} → {R,S} if the transaction includes N, O, and P, it probably includes R and S ○ Usually antecedent has at most 2 things and consequent has at most 1 thing Support for a set of items ○ It is the number of transactions that include the set of items

●

●

●

● ●

○ Measures how frequently the items appear together ○ I.e. support for set Q = the number of transactions containing Q ○ As items are added to the set, the set’s Support decreases 2 Metrics for measuring the strength of an association rule ○ Confidence = support for antecedent and consequent/support for antecedent ○ Lift ratio = confidence/(support for consequent/total # transactions) ○ Confidence always between 0 and 1 ○ 0 1 -- positive association ■ 20% total transactions ○ Confidence >= 0.80 (min req is .5) ○ Lift ratio > 1 Primary challenge of association rules - the total number of associations grows exponentially with the total number of items XLMiner Association Rules ○ Launch Association Rules ■ Select range of data, click Associate, click Association Rules ■ In input data format, always select data in binary matrix format, for this course ■ For minimum support, enter 20%(of total transactions) as rule of thumb ■ For minimum confidence, enter 80% as rule of thumb

Week 13: Data Partitioning and Accuracy Measures Predictive Data Mining == Supervised Machine Learning

Given info for predictive data mining: ● Each row is 1 observation (Record) ● Each column is 1 quant variable ● A subset of the variables are input variables ● 1 variable is the output (Dependent) variable ● ● ●

● ●

● ●

●

Goal of mining = accurately predict (forecast) the value of the output variable given the values of the input variables Descriptive data mining (cluster analysis and association rules) - unsupervised machine learning Predictive data mining = supervised machine learning, because the dataset’s known values of the output variable supervise the algorithm as it learns how to predict the unknown values

Examples of predictive mining ○ Credit score Binary classifier ○ Output variable - the variable to be predicted can be one of the 3 types: ■ Quantitative variable ■ Non-binary categorical variable ■ A binary categorical variable with exactly 2 categories -- we only do this one ○ Answer yes or no questions Types of Binary Classifiers ○ k-Nearest Neighbors, classification tree Partitioning a data set to avoid model overfitting ○ Model overfitting = occurs when the model accurately classifies the data on which the model is based but poorly classifies never before seen data ○ Training set = contains observations used to construct one or more models ○ Validation set = contains observations used to fine tune a single model and or select the most accurate model for a set of alternative models ○ The test set contains observations NOT used to construct, fine-tune, or select the final model - therefore applying the final model to the test set provides an unbiased estimate of the model’s accuracy on never before seen data Use XLMiner to partition a dataset ○ Standard partitioning ■ Pick up rows randomly ■ Automatic percentages - this option randomly allocates observations according to the following fixed percentages -- 60% to training, 40% to validation, 0% to test set ■ Specify percentages - randomly allocates to each set you determine ■ Equal percentages - 33.33% to each of the 3 sets

○

●

●

●

Paritioning with over sampling ■ To improve accuracy in the model, training set has 50% class 1 observations and 50% of class 0 observations ■ To reflect overall data set - the validation set and the test set have the same very low percentage of class 1 observations as in the overall data set ■ If the percentage of class 1 observations is v low, use partition with oversampling ■ Go to XLMiner Platform tab - click standard partition or partition with oversampling ■ % success in training set - 50% , % validation data to be taken away = 40% ■ 50% training, 30% validation, 20% test set otherwise Cutoff Value ○ If the probability is above 0.5, then take that prediction as 1 otherwise 0 ○ True positive = actual = 1, predicted = 1 ○ False positive = actual = 0, predicted = 1 ○ False negative = actual = 1, predicted = 0 ○ True negative = actual = predicted = 0 Confusion Matrix ○ Predicted class (x axis) vs Actual class (y axis) ○ Class 1 error rate = FN/(TP+FN) ○ Sensitivity/recall = 1 - class 1 error rate -- ability to recognize class 1 ○ Precision = TP/(TP+FP) -- accuracy of prediction of class 1 ○ F1 score = 2(sensitivity*precision/sensitivity + precision) ○ Class 0 error rate = FP/(FP+TN) ○ Specificity = 1 - class 0 error rate -- ability to recognize class 0 ○ Overall error rate = (FP+FN)/(TP+FN+FP+TN) -- ability to recognize class Class 1 percentage error vs Class 0 percentage error are inverse functions

Week 14: Binary Classification via K-Nearest Neighbors ●

Overview of k-Nearest Neighbors Binary Classifier ○ Binary classifiers estimates the probability that an observation output variable is Class 1 - this estimate is converted into a prediction of class 1 or class 0 by the analyst-specified cutoff value ○ The k-nearest neighbors binary classifier estimates the probability that the observation’s output variable is class 1 as follows ■ k-NN identifies the observation’s k nearest neighbors - where nearest is determined from the Euclidean distance ■ k-NN’s estimate of the probability that the observation is Class 1 equals the fraction of class 1 observations among the k nearest neighbors

●

●

●

●

●

If 3 of the 5 nearest neighbors are 1, then probability of being 1 is .6 ○ Euclidean distance standardizes the values of the input variables Determining the “Best” k for a given cutoff value ○ Tell the XLMiner to identify the best value of k by running k-NN for k ranging 1 to 20 ○ For an observation in the training set, the observation is ALWAYS its own nearest neighbor -- ties are randomly broken ○ XLMiner determines the best k as whatever minimizes overall percentage error for the validation set ○ To explore the tradeoff between the class 1 percentage error and the class 0 percentage error, the analyst varies the cutoff value between 0 and 1 ○ The lower the cutoff value, the lower the class 1 percentage error and the higher the class 0 percentage error XL Miner - k-NN ○ Launch k-NN classification ■ Select variables, XLMiner, click Classify, click k-NN ○ Enter cutoff probability ○ Check checkbox for normalizing data ○ In number of NN, hit 1-20 ○ Score on specied value of k - if you know what k you want ○ Score on best k between 1 and specified value - to test all of them ○ If not oversampling - select according to relative occurrences in training data for prior class probabilities ■ If it is oversampling - select user specified prior probabilities ○ Get summary report of everything How to classify new data ○ In variables portion = hit match by name ○ Click k-NN classification - hit “in worksheet” in step 3 of 3 Questions to ask ○ How did you decide which variables to add ○ How did you partition test and training ○ Did you standardize the variables ○ Did you try multiple values of k ○ What is the tradeoff between class 1 and class 0 ○ Did you try other binary classifiers PRACTICE THIS Week 15: Binary Classification via Classification Tree ●

Overview of Binary Classification Tree

○

●

●

It is a divide and conquer iterative process that successively divides te data’s training set into increasingly smaller and increasingly more similar subsets...