Title | 1DA3 Data Detailed Lecture Notes |
---|---|
Course | Business Data Analytics |
Institution | McMaster University |
Pages | 235 |
File Size | 14 MB |
File Type | |
Total Downloads | 141 |
Total Views | 903 |
Hi everyone, just a heads up: the doc link will be disabled half an hour before theexam. However, the doc can be downloaded now for personal use. Pleasedownload it before the link closes. Thanks and GoodLuck!Winter Semester 2021Teacher: Behrouz BakhtiariBUSINESSSTATISTICSCHAPTERS 1 & 2: DATA AND...
Hi everyone, just a heads up: the doc link will be disabled half an hour before the exam. However, the doc can be downloaded now for personal use. Please download it before the link closes. Thanks and Good Luck! Winter Semester 2021 Teacher: Behrouz Bakhtiari
BUSINESS STATISTICS CHAPTERS 1 & 2: DATA AND VARIABLES What Is Data? ●
Data values or OBSERVATIONS are INFORMATION collected regarding some subject
●
Data can be numbers, names, etc. and tell us the “Who” and the “What”
●
Data are useless without their CONTEXT
●
Data are often organized into a DATA TABLE
Example of Data Table:
Cases = Rows Variables = Columns ●
Rows of the stata table correspond to individual CASES about “Whom” (or about “Which” if they are not people) we record some characteristics
●
CHARACTERISTICS recorded about each individual or case are called VARIABLES
●
These are usually shown as the columns of a data table and identify “What” has been measured
●
Data tables are cumbersome for complex data sets (too messy and complicated), so often two or more separate data tables are linked together in a RELATIONAL DATABASE
●
Each data table included in the database is a RELATION because it is about a specific set of cases with information about each of these cases for all of the variables
Types of Variables ●
Categorical (aka Qualitative): names categories, indicates whether or not a case falls into a certain category ○
Ex. type of ice cream preferred, different ice creams names are chocolate chip, mint and vanilla which are category names
●
Quantitative: measures numerical values with/without units. Tells us about the QUANTITY of something ○
Ex. Number of cars owned in certain geographical location by families
○
Some quantitative variables have units (purchase amount = $) and some are unitless (click count = # of clicks)
○
DATE IS NOT QUANTITATIVE IT IS QUALITATIVE
●
Is “customer number” categorical or quantitative? It is a unique identifier, a categorical
●
Some variables could be categorical and quantitative (age)
●
Counting: core of statistics, usually count things to get insight into the world
●
Bar graphs and pie charts used for visualizing categorical variables ○
Ex. Counting cases in each category, visualization can be bar graph and pie chart
○
Ex. counting how many of something was observed
●
●
Identifiers: identify cases in databases/datasets. An identifier is unique ○
A categorical variable
○
Does not have units
○
Helps to combine different datasets and makes relational databases possible
○
Are not to be analyzed
○
Ex. customer number
Other Data Types: ○
Nominal: categorical variables used only to name categories (school attended)
○
Ordinal: if a variable can be ordered (categorical or quantitative) ■
Ex. satisfaction level and purchase amount (unsatisfied as 1, neither satisfied nor dissatisfied as 2 and satisfied is 3)
○
Time Series: data that are gathered at regular intervals over time ■
○
Ex. temperature of days in September
Cross-sectional data: when data for several variables are measured at the same point in time ■
Ex. determining sales revenue, number of customers and expenses of the last month of business
First Example is cross-sectional Second Example is time series and quantitative
Data Collection: ●
Primary Data: collected by the researcher/analyst ○
That has a specific question like a specific product of your company that no one has researched before
●
Secondary Data: collected by another party, like Stats Can. obtained by researcher/analyst
●
“When” and “Where” data was collected is important
●
“How” data is collected can make the difference between insight and nonsense (data collected through internet polls vs. polling agencies)
CHAPTER 3: SURVEY AND SAMPLING ●
Sampling ○
Why do we take samples? ■
Provide insight into behaviours of a population
■
Population is big
■
Observing the whole population is impossible or costly or too time consuming
■ ○
Only a sample of the population is AVAILABLE to observe or is observable
Data analysis helps us draw insight about the population by observing and analyzing the sample
●
Sample Data collected is either ○
Cross-sectional data
○
Time-series data
●
Structured Data: well-defined length and format
●
Unstructured Data: no pre-defined format (doctor’s notes, reports, video data, etc.) NOT IMPORTANT
●
Big Data
Features of Sampling: ●
Examine a part of the whole ○
Use SAMPLE SURVEYS: questions designed to give us answer on some characteristics of the sample
○
Sample may be BIASED: a biased sample over-or under-emphasizes certain characteris of the population ■
○
Gives us a biased understanding of the characteristics of the population
Individuals (cases) for samples must be selected RANDOMLY
●
Randomize ○
Protects us by giving us a representative sample even for effects we were unaware of
○
Seems fair because nobody can guess the outcome before it happens and because usually some underlying set of outcomes will be equally likely
○
Sample Variability: sample-to-sample differences ■
Ex. average height of McMaster undergrad business students by drawing samples from different sections of 1DA3
●
Sample size is what matters ○
Size of the sample determines what we can conclude from the data REGARDLESS OF THE SIZE OF THE POPULATION
○
How big of a sample do we need? ■
Depends on what we are estimating
■
Too small sample size may not be representative of pop
■
Prefer a sample that is a good representative of the pop and is as small as possible
Population and Parameters ●
Census: sample that includes observations from the entire pop
●
Census is usually not the best idea
●
○
Difficult or impractical ir cumbersome to perform
○
Pop characteristic may change
○
Can perform censuses often
Models use mathematics to represent reality (ex. mean) ○
Parametres: key numbers in models that represent reality
○
Population Parameter: parameter used in a model for a population
●
Need to estimate population parameters through the sample data
●
Sample Statistic: Anything calculated from a sample, one that estimates the corresponding population parameter accurately is said to be representative
●
Goal is to use sample statistics from the sample to estimate population parameter
Simple Random Sample (SRS) ●
Sample drawn so that every possible sample of the size we plan to draw has an equal chance of being selected
●
Sampling Frame: must be defined to select a random sample, it is a list of individuals (or cases, record) from which the sample will be drawn
●
Once we have the sampling frame, we can assign a sequential number to each individual in the sampling frame and draw random numbers to identify those to be sampled
Other Random Sample Designs ●
All statistical sampling designs have in common is that chance rather than human choice is used to select sample
●
Stratified sampling: slice the population into homogeneous groups, called STRATA and use SRS within the watch stratum to select members. Combine result in the end ○
●
Reduced sample variability is the most important benefit of stratified sampling
Cluster Sampling: split the pop into parts or CLUSTERS that each rep the pop. perform a census within one or a few clusters at random. If each cluster fairly represents the pop, cluster sampling will generate an unbiased sample
●
Systematic sampling: systematic approach is used to select individuals. Start from a randomized individual and follow the approach to create the sample ○
Ex. pick every 10th individual from a list of employees to create a sample of 30 individuals
●
Multistage sampling: sapling schemes that combine several methods
Valid Survey ●
Surve that can yield info you need about the population in which you are interested
●
To help ensure a valid survey, you need to ask 4 questions
●
○
What do I want to know?
○
Who are the rights represented?
○
What are the right questions?
○
What will be done with results?
Be clear about what you want to learn, phrasing the questions and answers and us eteh right sampling frame
●
Nonresponse bias: when individuals don't respond to questions
●
Voluntary response bias: in volunteer surveys, individuals with the strongest feelings on either side of the issue are more likely to respond; those how don't care may not bother
●
Measurement errors: when a question does not take into account all possible answers ○
●
Ex. radio buttons and can only pick one of the answers listed
Pilot Test: small sample from the sample frame is used first which protects against measurement errors
●
It is important not to confuse INACCURACY with BIAS. Both create errors but the roles are different ○
We want our samples to be ACCURATE which means they represent the population accurately
○
Samples of very small sizes cannot accurately represent the pop
○
Increasing sample size and more randomization increases accuracy
○
Bias arises from the way we collect samples
Circle 1 top left: what we want, want all of our sample to point to more or less the same point Circle 2 Below left: very small sample sizes Circle 3 top right: all clustered on one side of the target Circle 4 bottom right: huge variability between points Ex: not every neighbourhood is the same as another, consider the biases that may come from questions asked and randomized sampling/use the right sampling method ●
Large sample sizes = leads to very little variability
●
No way we can remove variability
●
Bias is how we select the questions and structure them correctly
CHAPTER 4: DISPLAYING AND DESCRIBING CATEGORICAL DATA Displaying Data: ●
Data visualization is an important part of statical or data analysis
●
It summarizes huge amounts of data into easy to follow, easy to digest graphs and plots ○
Billion GB of data is generated every day
●
Well-designed data graphics are strong tools to convey meaning behind the data
●
Visualization plays an important role in telling the story of the data
●
The importance of good visualization can never be overemphasised
Examples of Poor Data Visualization
Example 1: too many categories (cluttered), no percentages or numbers provided, only one variable measured. The pie chart is always on categorical variables. Make sure that in the pie chart the number of categories demonstrated is not too large.
Example 2: never make graphs 3D, do not add up to 100%, the source is opinion so it is inaccurate, the blue is bigger than the red. Only make 2D graphs. One of the important elements in visualization is size, things that are 3D trick us into thinking elements that are
closer to us are bigger than objects that are farther away. Use a venn diagram or maybe a bar graph. Pie chart implies that you are showing the data in its entirety, showing that the data is mutually exclusive.
Example 3: things that are closer to us look bigger than those father away, angle interferes when printed
Example 4: cannot see part of the data, labels for access are not provided, take one of the bars, still cannot pick the exact height of that abr that interferes with the interpretation of the data
Example 5: lines are connected even though the horizontal axis has categorical data, no labels, cannot use a line graph for these types of categorical variables, the data is not ordinal and you cannot seem to change the order of the categorical data. Should have used a bar graph, maybe even a double bar to compare the two days too.
Example 6: margin error given as source which it is not, what is today has no data attached to it, don't add up to 100%. Could use a line graph since you are observing something over time. Consider changing the question. Is showing qualitative instead of quantitative. There are gaps and holes in the data. Change questions to have mutually exclusive categories. Charts ●
Bar Charts: displays the distribution of ONE categorical variable, showing the counts for each category next to each other for easy comparison, aka bar graph ○
●
Simple bar graphs looking at only one categorical variable
Pie Charts: show the WHOLE GROUP as a circle (“pie”) sliced into pieces. The size of each piece is PROPORTIONAL to the fraction of the whole in each category.
Example above shows on categorical variable (vertical bar graph) Example Pie Chart implies that there are other things not included, showing all of the data
Frequency Tables ●
Organizes data by recording counts and category names
●
Can create pie charts, bar graphs and relative frequency out of it
●
Relative Frequency Table: displays the PROPORTIONS or PERCENTAGES that lie in each category rather than the counts
Frequency Distribution ●
Groups data into categories and records the number of (counts number of) observations in each category
Contingency Tables ●
Shows how the values of one variable are contingent on the value of another variable (2 variables in table) ○
Ex. data was collected on the use of social networks in different countries. To show how social network use is varied by countries, we can display the data in a contingency table
●
Marginal distribution: of a variable in the contingency table is the total count that occurs without reference to the value of the other variables. Are on the margins of the table. Care only about one variable and ignore the value of the other variable.
●
Visualize the marginal distribution: you can visualize the last row and last column in separate bar graphs. Data inside the table is the result of 2 different categorical variables interacting so you can use segmented or multiple bar graphs. 5039 is not included in the marginal distribution because it is the total
●
Cell: each one of a contingency table gives the count for a combination of values of both variables
●
○
Eg. country and social network use
○
Related to the idea of joint probabilities
Segmented Bar Chart: divides a bar proportionally into segments corresponding to the percentage in each group. ○
Ex. we could display the SuperBowl viewer data which treat each bar as the “whole” and divides it proportionally into segments corresponding to the percentage in each group
Conditional Distribution ●
We may want to restrict variables in a distribution to show the distribution for just these CASES THAT SATISFY A SPECIFIED CONDITION. ○
Ex. social networking is given the country of focus is Egypt
○
Closely related to conditional probability
Simpson’s Paradox ●
Results from inappropriately combine percentages of different groups
●
Theparadox appears when a certain trend appears in several groups of data, but disappears or reverses when these groups are combined
●
Get different or contradicting stories in data ○
Ex: two sales reps Peter and Katerina. Each sells printer paper and USB flash drives (which is more difficult to sell). Peter argues that he’s the better salesperson since he closed 83% of his last 120 prospects. Katerina closed only 78%. Is Peter really the better sales rep? What could be the error here?
○
Katerina was given a certain less amount and sold most of them compared to Peter
○
Treatment for Kidney Stones (small vs. large stones): ■
Treatment A is more comprehensive and involves open surgical procedures
■
Treatment B is less comprehensive and involves small punctures
■
Out of the 350 patients (with small and large stones combined), the number of successes is (AGGREGATE RESULTS BELOW) ■
Treat A: 273 resulting in a 78% success rate (273/350 = 78%)
■
Trat B: 289 resulting in a 83% success rate (289/350 = 83%)
■
Which treatment is suggested for a patient with kidney stone (unknown size)?
●
Possible reasons for the Simpson’s Paradox ○
Size of groups: when the effect of the difference in groups is ignored, the groups with a higher sample size have a greater influence on the combined results, proportionate to their size
○
Confounding variables: lurking variables that influence the results when two groups with significantly different behaviours are combined (ex. The size of the kidney stones in example above)
●
What does it mean for data analysis? ○
Analysis should be comprehensive and nuanced
○
Content knowledge is important - investigate further if data is showing results that are counterintuitive
○
Understand the limitations of data - if data is not detailed enough it may give misleading results
○
Data in aggregate vs. gata in groups gives you different and sometimes contradictory result
●
The Simposon’s Paradox can be avoided by ○
Reviewing frequency table
○
Reviewing correlation among variables
○
Investigating any lurking (confounding variables) that may result in significant differences between groups ■
If you find that there is is a confounding variable, review data and try to control it to see if different results appear
○
A comprehensive and deep level of content knowledge (domain knowledge...