1DA3 Data Analytics Notes 1 PDF

Title	1DA3 Data Analytics Notes 1
Course	Business Data Analytics
Institution	McMaster University
Pages	44
File Size	1.6 MB
File Type	PDF
Total Downloads	7
Total Views	370

Preview

CLICK TO PREVIEW PDF

Summary

My formula sheets:photos.app.goo/JtC5p9wRKerggiW3AChapter 2 Data Terms ● Data : data value/observations are info collected regarding some subject. They are useless without context. Often organized into a data table ● Data mining : aka predictive analytics is the process of using transactional data t...

Description

My formula sheets: https://photos.app.goo.gl/JtC5p9wRKerggiW3A

Chapter 2 Data Terms ● Data: data value/observations are info collected regarding some subject. They are useless without context. Often organized into a data table ● Data mining: aka predictive analytics is the process of using transactional data to make other decisions and predictions ● Analytics: any use of statistical analysis to drive business decisions from data ● Respondent: person who answers a survey ● Subject/participant: person being experimented on ● Experimental unit: company or inanimate object being experimented on ● Record: a row in a database. It corresponds to a case ● Case: who you have data on ex. Individual customers (the who) ● Variable: characteristic recorded about each individual or case (the what) ● Relational database: 2 or more separate data tables are linked so info can be merged. Usually has 3 relations - customer data, item data, transaction data ● Relation: each data table in the relational database ● Counting: very important and is the core of statistics Variables ● Categorical variable: qualitative variable isn’t measured but is a category (ex.location) ● Quantitative variable: variable is measured with units (ex. dollars spent) ● Identifier variable: identify cases in databases and are unique. It’s categorical, doesn’t have units, helps combine different datasets and makes relational databases possible, and are not analyzed. (ex. student ID number) ● Nominal variable: categorical variable only used to name a category (ex. School name) ● Ordinal variable: the order of its values (ex. ranking satisfaction level) ● Time series data: single variable is measured in regular intervals over time. Usually months, quarters or years. It’s over time. (ex. Temperature for days in May) ● Cross sectional data: several variables are measured at the same time. It’s a snapshot in time. (ex. Revenue, customers and expenses for May). ● Primary data: data collected yourself ● Secondary data: data collected by someone else

Chapter 3 3 Principles of Sampling ● Principle 1: Examine part of the whole ○ Settle for sampling a smaller group of individuals to represent the whole population (sample survey) ○ Angus Reid Daily Omnibus is an example of a sample survey ○ Biased samples over or underemphasize some characteristics of the population ● Principle 2: Randomize ○ Selecting individuals at random can decrease biases ○ Randomizing protects us from the influences of all the features of our population by making sure that on average the samples looks like the rest of the population ○ It’s fair bc nobody can guess the outcome before it happens, and when we want things to be fair, the outcomes are usually equally likely ○ Sample variability: samples differ from each other naturally ○ Random is better than matching the sample to the population bc you won’t be able to match all the characteristics ex. Age, income etc ● Principle 3: Sample size matters ○ Size of the sample determines what we can conclude from the data regardless of the size of the population ○ Fraction of population doesn’t matter ex. 100 people can be a sample for a uni or country and wouldn’t matter Census ● Census is sampling the entire population, but have many disadvantages ● They are difficult to conduct bc you can’t get everyone ● Impractical ex. Whole population taste testing a product ● Populations change ● Cumbersome due to hiring experimenters, locating people without double counting etc. Terms ● Structured data: well defined length and format ● Unstructured data: No pre-defined format (ex. Doctors notes, reports, video data) Populations ● Parameter: key numbers in models that represent reality ● Population parameter: parameter used in a model for population ● Representative sample statistic estimates the corresponding population parameter accurately ● A sample is used to calculate a statistic which is used to estimate the parameter which tells us about the population ● Ex. survey 100 customers (sample). 60% of the sample prefer chocolate (statistic). That's used to estimate that 60% of Canada likes chocolate (parameter). This end result tells us about the whole country (population).

1. Simple Random Sampling (SRS) ● SRS:The standard sample. Checks that every possible sample of the size you plan to draw has an equal chance of being selected ● Sampling frame: list of individuals from which the sample will be drawn. Best is to choose at random 2. Stratified Sampling ● Stratified sampling: Divide the population into homogeneous groups (strata) and then use random sampling within each stratum, combining the results at the end ● Reduces sampling variability ● Important to not ignore some strata bc we think they are small and insignificant ex. Don’t ignore small companies when counting country’s imports and exports ● Ex. Population is 60% female, 40% male. Instead of simple random sampling 10 people which may result in 9 males 1 female, you instead make sure to randomly pick 6 females and 4 males, to accurately represent the population 3. Cluster Sampling ● Randomly picking a cluster out of the whole population, to make the task manageable ● Clusters are similar to each other, select some clusters at random and then choose a random sample within them (stratas are different from each other) ● Ex. Pick 4 out of the 12 BMO offices to interview 4. Systematic Sampling ● Systematic sampling: sampling in a certain order Ex. every 100th item is tested 5. Multistage Sampling ● Multistage sampling: Combining different types of sampling together Ex. strata sampling within cluster sampling Valid Survey ● Need to make sure your survey can yield the info you need, must have 4 questions in mind: ○ What do I want to know? ○ Who are the appropriate respondents? ○ What are the best questions? ○ What will be done with the results? ● Things to remember: ○ Know what you want to know - don’t ask unnecessary questions, and make sure you have a clear idea of what you want to learn from the data ○ Use the right sampling frame - identified the proper participants ○ Ask specific rather than general questions ○ Watch for biases - nonresponse bias or voluntary response bias Ex. angry customers voluntarily submit surveys

○

○ ○

○

Don’t confuse inaccuracy and bias - Bias means the survey will be systematically off, no matter how many people you interview. Accuracy can be improved with larger samples Be careful with question phrasing - Ex. say parents instead of just family Measurement errors - occur when there’s intentional or unintentional wrong answers due to not having enough range of possible answers to pick from. Be careful with answer phrasing. Do pilot tests on a small sample with the draft survey to find flaws Be sure you want a representative sample - Be a good experimenter and not want to create bias results

Bad Sampling ● Nonresponse bias: when individuals don’t respond to questions, don’t answer the phone, not want to be part of the sample ● Voluntary response bias: only counting voluntary participants result in biases because they are strongly opinionated or motivated ex. Angry customers ● Convenience sampling: interviewing easy people to approach ● Bad sampling frame: interviewing people not in your survey sample frame ● Undercoverage: Some portion of the population isn’t sampled at all or has a smaller representation than it should have ● What can go wrong: nonrespondents, long dull surveys, response bias, and push polls

Chapter 4 Displaying data ● Making a picture for data visualization is the most important rule of data analytics ● Summarizes huge amounts of data into easy to follow graphs and plots ● Well designed data graphics are strong tools to convey the meaning behind data ● Visualization plays an important role in telling the story of the data Tables ● Frequency tables: put data into piles and a table. ● Frequency distribution: groups data into categories and records the number of observations in each category ● Relative frequency table: Using % for each data, relative to the 100% total ● Contingency table: shows how the values of one variable is contingent on the value of second variable (Example on the right) ● It has bivariate data since it features two variables. ● Marginal distribution of a variable is the total count that occurs without reference to the value of other variables. ● Each cell of a contingency table gives the count of a combo of values of both variables ● Conditional distribution: distribution of a variable restricting the “who” to consider only a smaller group of individuals Ex. I only want to see the results for Russia out of all the countries. Charts ● Area principle: the area occupied by a part of the graph should correspond to the magnitude of the value it represents (take up more space on graph for large data) ● Bar chart: displays the distribution of one categorical variable, showing the counts for each category next to each other for easy comparison ● Segmented stacked bar chart: treat each bar as the whole, and divide it proportionally into segments corresponding to the % in each group (example here) ● Pie chart: whole group of cases as a circle. Each slice of the pie is proportional to the fraction of the whole in each category. Make sure your data represents 100% of something (categories can’t overlap) Simpson’s Paradox ● Phenomenon that arises when averages or % are taken across different groups, and these group averages appear to contradict the overall averages ● Reason due to different sizes of groups leads to different % ● Ex. Student A gets two quizzes with a lower mark than Student B but when the quizzes get marked overall, Student A has a higher score. However that doesn’t make sense ● Lesson is to make sure to combine comparable measurements for comparable individuals, bc combining percentages can be misleading ● Usually better to compare % within each level rather than across levels

Chapter 5 Histogram ● Histogram: plots bin counts as the height of bars and describes the overall shape of data ● No gaps between bars, if there is it means there’s gaps in the data ● Quantitative data is separated into bins with equal width ● Categorical qualitative data are represented by its own bar ● Number of bins depends on how much data there is. ln(n) / ln(2) with n data points ○ Ex. 29 data points, ln 29 / ln 2 = 5 bins (rounded) ● Determine the count for each bin ● Have a consistent rule for data that falls on bin boundaries ○ Ex. Go into the higher bin when on boundary. $5 goes into the $5-$10 not $0-$5 Relative frequency histogram ● Vertical axis is % of total number of cases ● Shape is exactly the same as normal histogram Stem and leaf display ● Like a histogram but gives individual values ● Used for quantitative values ● Use part of each number to name the bins (stem) ● Use next digits in the number to make the leaves ● Round numbers to 1 decimal place ● May split leaves if there’s too much data in one leaf ● Ex. 2.06, 2.22, 2.44, 3.28, 3.34 = 2 | 124 and 3 | 33 ● Ex. 0.3, 0.5 = 0 | 35 ● Ex. -0.4, -0.8 = -0 | 48 Shape of graph ● Mode: the value of the centre of bumps/peaks in the graph ● Unimodal distribution: one main peak ● Bimodal distributions: 2 peaks. Indicates there’s two groups in the data. ● Multimodal distributions: 3+ peaks ● Uniform: no clear modes, bars are approx same height ● Symmetry: if graph can be folded through the middle and be symmetrical ● Tail: thinner end of a distribution ● Skewed: one tail stretches out farther than the other. Amounts of things (ex.time, people) can’t be negative and have no natural limit, so a right skewed distribution is common, with tail to the right ● Outliers: values that stand away from the body of the distribution. Can affect every statistical method we will study, can be an informative part of your data, can be an error, and should be discussed in any conclusions drawn about the data

Centre of graph ● Mean: average of all the data, is the balancing point, can look misleading, better choice for symmetrical unimodal data. Mean of a sample is represented by variable ȳ. Mean of population is represented by 𝛍. ○ Normal mean = A + B / 2 ○ Geometric mean = (A x B) 1/2 ● Median: value splitting graph into two equal areas, midpoint in the data, better choice for skewed data, isn’t affected by outliers ● Mode: value that occurs most often Spread ● Symmetric data is suitable for using mean and standard deviation for spread ● Non-symmetric data is suitable for using median and interquartile range for spread ● Range = max - min ● Range is a weak measure, bc outliers affect it a lot ● Range can have asymmetrical number line, quartile with larger range means its skewed ● Interquartile range (IQR) = Q3 - Q1 ○ Ex. 1, 5, 8, 10, 12, 15, 23, 27 (8 data values) ○ Q1 = 8 values / 4 quadrants = 2nd data value = 5 ○ Q3 = (3 quadrants)(8 values) / 4 quadrants = 6th data value = 15 ○ IQR = 15 - 5 = 10 ○ If quartiles are a decimal, always round up no matter the decimal! 2

● ●

●

∑ (𝑦 − ȳ) Variance: the standard deviation squared. Variance formula: 𝑆 = 𝑛−1 Standard deviation: use the deviations of all data values from the mean. Formula for measuring spread for a sample data: ○ 1. Each value - mean = deviation ○ 2. Square each deviation ○ 3. Add up all the squared deviations ○ 4. Divide by (number of values - 1) ○ 5. Square root step 4 2

Standard deviation formula for measuring spread for a population data:

Coefficient of variation ● Coefficient of variation (sample) = standard deviation / mean ● Has no units ● Measures how much variability exists compared with the mean ● Ex. compare coefficient of variation for different bank investments, higher variations are attractive to traders who want to see stocks change a lot

Adding measures of centre/spread ● Mean can be added up to get a total ● Median and mode can NOT be added together ● Variance can be added up to get a total, as long as they were all uncorrelated ● Other spreads can NOT be added together Grouped data ● When data is grouped (like a histogram), the midpoint of each range is the mean 5 Number Summary ● 5 number summary: report max, min, upper quartile Q3, lower quartile Q1 and median ● Typical to summarize stocks ● Boxplot: display 5 number summary as a central box with whiskers that extend to the non-outlying values. They’re effective for comparing groups. ○ Draw a single vertical axis spanning the extent of the data ○ Draw short horizontal lines at Q1, Q3, and median. Draw vertical lines around it to create a box ○ Draw a vertical line coming out of the box with a length of (1.5)(IQR) above and below Q1 and Q3. This is the whiskers. ○ The horizontal line at the end of the whiskers is the fence. ○ Plot any outliers outside the fence with special symbols ex. Circles, stars ● Examining boxplots ○ If the median line is in the middle of the box, the data is symmetrical ○ If the median line is not in the middle of the box, the data is skewed ○ If whiskers aren’t same length, it shows skewness ● When comparing lots of groups, use multiple box plots on one scale Percentiles ● Put the data in order from smallest to largest (Round up no matter the decimal) ○ Example: 15,15,16,18,24,24,25,26,26,27,30,31 ○ Option 1 if the percentile is a decimal: To find the 80th percentile, calculate 80% of the 12 data values, which is 9.6 rounded to 10. The 10th data value is 27. ○ Option 2 if the percentile is an integer: To find the 50th percentile, calculate 50% of the 12 data values, which is the 6th data value. Take the average of the 6th and 7th data value which is 24.5 Standardizing ● Standardize different types of values to compare them ● See how far from the mean the value is, then measure that distance with standard deviation ● Z-score: show how many standard deviations each value is above or below the mean. ● Positive z-score = above mean, negative z-score = below mean

● ● ●

𝑦−ȳ 𝑠 Outliers are z > 3 or z < -3 Ex. Tim Hortans created more jobs with lower average salary, Starbucks created less jobs with higher average salary. Which company is better? Use standardization to solve.

Z-score =

Analysis of relative location ● If data is symmetric and bell shaped, it’s usually approximated by the normal distribution ● Normal distribution is used as an approximation for many real world applications ● Empirical rule: shows the % of the data contained in the graph when you move 1 standard deviation away from the average. AKA 68-95-99.7% rule

Time series plot ● Time series plot: display of values against time ● Shows point to point variation (line or dots) ● Ex. price of stocks over time, population over time ● Stationary: time series plot that does not change over time Transform skewed data ● Skewed data is hard to summarize, and hard to tell where the centre is, and if the most extreme values are outliers or part of the tail ● Ex. company’s salaries containing labour workers to CEO ● Re-express or transform the data: ○ Skewed to right: use log or square roots ○ Skewed to left: square or cube the data values

●

Makes the values are smaller, histogram is more symmetric and easier to read

Chapter 6 Scatterplots ● Best for observing the relationship and any association between 2 quantitative variables ● Direction ○ Positive: lower left to upper right ○ Negative: upper left to lower right ○ No relationship: slope of 0 and no correlation ● Form ○ Straight: points stretched out in a straight line ○ Curve: points increases or decreases in a steady curve ● Strength ○ Strong: points clustered tightly in a single stream ○ Weak: points spread out ○ Calling something “perfect” positive or negative means a perfect line ● Unusual features ○ Outliers Variables in scatterplots ● Bivariate analysis: statistical analysis of 2 variables at the same time ● Scatterplots often don’t show the origin if there’s no values near it ● X-axis: x variable playing the role of the explanatory/predictor/independent variable ● Y-axis: y variable playing the role of the response/dependent variable Correlation ● S: sample standard deviation ●

𝑥 and ȳ: the averages

●

Standardize the independent and dependent variables (

● ●

𝑍ₓ =

𝑥 − 𝑥 𝑆ₓ ) and (

𝑍ᵧ =

𝑦−ȳ 𝑆ᵧ

) Correlation coefficient: a numerical measure of the direction and strength of a linear association. 3 correlation coefficient formulas to use: ○

Σ 𝑍ₓ 𝑍ᵧ

𝑅 = 𝑛−1

○

𝑅 =

○

𝑅 =

Σ (𝑥 − 𝑥) (𝑦 − ȳ) (𝑛 − 1)(𝑆ₓ𝑆ᵧ) Σ (𝑥 − 𝑥) (𝑦 − ȳ) 2

2

Σ (𝑥 − 𝑥) Σ (𝑦 − ȳ)

●

Covariance: alternative to the correlation coefficient. 𝐶𝑜𝑣(𝑋, 𝑌)

=𝑟𝑆ₓ𝑆ᵧ

● ●

Units for covariance are the units of x and y variables. Use covariance when you need to talk in terms of units. Use correlation coefficient if you don’t care about units

Correlation coefficient properties ● Sign of correlation coefficient gives the direction ● Correlation is always between -1 and 1 (between perfectly negative & perfectly positive) ● Correlation treats x and y symmetrically (aka correlation of x and y is same as y and x) ● Correlation has no units ● Correlation is not affected by changes in the center or scale (ex. changing dollars from CAD to euros will not change the correlation) ● Correlation is sensitive to unusual observations/outliers Correlation Conditions ● Correlation measures the strength of the linear association between 2 quantitative variables. Must check conditions before you use correlation. ● 1. Quantitative variables condition: correlation is only used for quantitative variables with units. ● 2. Linearity condition: correlation is only used for straight linear data. It’s a judgement call if the data is straight enough. ● 3. Outlier condition: when there’s one or more outliers, it’s good to report the correlation with and without those outliers. Correlation table ● Gives summary of info at a glance ● Diagonal cells always are 1.000 ● Upper half of table is symm...