1DA3 Data Detailed Lecture Notes PDF

Title 1DA3 Data Detailed Lecture Notes
Course Business Data Analytics
Institution McMaster University
Pages 235
File Size 14 MB
File Type PDF
Total Downloads 141
Total Views 903

Summary

Hi everyone, just a heads up: the doc link will be disabled half an hour before theexam. However, the doc can be downloaded now for personal use. Pleasedownload it before the link closes. Thanks and GoodLuck!Winter Semester 2021Teacher: Behrouz BakhtiariBUSINESSSTATISTICSCHAPTERS 1 & 2: DATA AND...


Description

Hi everyone, just a heads up: the doc link will be disabled half an hour before the exam. However, the doc can be downloaded now for personal use. Please download it before the link closes. Thanks and Good Luck! Winter Semester 2021 Teacher: Behrouz Bakhtiari

BUSINESS STATISTICS CHAPTERS 1 & 2: DATA AND VARIABLES What Is Data? ●

Data values or OBSERVATIONS are INFORMATION collected regarding some subject



Data can be numbers, names, etc. and tell us the “Who” and the “What”



Data are useless without their CONTEXT



Data are often organized into a DATA TABLE

Example of Data Table:

Cases = Rows Variables = Columns ●

Rows of the stata table correspond to individual CASES about “Whom” (or about “Which” if they are not people) we record some characteristics



CHARACTERISTICS recorded about each individual or case are called VARIABLES



These are usually shown as the columns of a data table and identify “What” has been measured



Data tables are cumbersome for complex data sets (too messy and complicated), so often two or more separate data tables are linked together in a RELATIONAL DATABASE



Each data table included in the database is a RELATION because it is about a specific set of cases with information about each of these cases for all of the variables

Types of Variables ●

Categorical (aka Qualitative): names categories, indicates whether or not a case falls into a certain category ○

Ex. type of ice cream preferred, different ice creams names are chocolate chip, mint and vanilla which are category names



Quantitative: measures numerical values with/without units. Tells us about the QUANTITY of something ○

Ex. Number of cars owned in certain geographical location by families



Some quantitative variables have units (purchase amount = $) and some are unitless (click count = # of clicks)



DATE IS NOT QUANTITATIVE IT IS QUALITATIVE



Is “customer number” categorical or quantitative? It is a unique identifier, a categorical



Some variables could be categorical and quantitative (age)



Counting: core of statistics, usually count things to get insight into the world



Bar graphs and pie charts used for visualizing categorical variables ○

Ex. Counting cases in each category, visualization can be bar graph and pie chart



Ex. counting how many of something was observed





Identifiers: identify cases in databases/datasets. An identifier is unique ○

A categorical variable



Does not have units



Helps to combine different datasets and makes relational databases possible



Are not to be analyzed



Ex. customer number

Other Data Types: ○

Nominal: categorical variables used only to name categories (school attended)



Ordinal: if a variable can be ordered (categorical or quantitative) ■

Ex. satisfaction level and purchase amount (unsatisfied as 1, neither satisfied nor dissatisfied as 2 and satisfied is 3)



Time Series: data that are gathered at regular intervals over time ■



Ex. temperature of days in September

Cross-sectional data: when data for several variables are measured at the same point in time ■

Ex. determining sales revenue, number of customers and expenses of the last month of business

First Example is cross-sectional Second Example is time series and quantitative

Data Collection: ●

Primary Data: collected by the researcher/analyst ○

That has a specific question like a specific product of your company that no one has researched before



Secondary Data: collected by another party, like Stats Can. obtained by researcher/analyst



“When” and “Where” data was collected is important



“How” data is collected can make the difference between insight and nonsense (data collected through internet polls vs. polling agencies)

CHAPTER 3: SURVEY AND SAMPLING ●

Sampling ○

Why do we take samples? ■

Provide insight into behaviours of a population



Population is big



Observing the whole population is impossible or costly or too time consuming

■ ○

Only a sample of the population is AVAILABLE to observe or is observable

Data analysis helps us draw insight about the population by observing and analyzing the sample



Sample Data collected is either ○

Cross-sectional data



Time-series data



Structured Data: well-defined length and format



Unstructured Data: no pre-defined format (doctor’s notes, reports, video data, etc.) NOT IMPORTANT



Big Data

Features of Sampling: ●

Examine a part of the whole ○

Use SAMPLE SURVEYS: questions designed to give us answer on some characteristics of the sample



Sample may be BIASED: a biased sample over-or under-emphasizes certain characteris of the population ■



Gives us a biased understanding of the characteristics of the population

Individuals (cases) for samples must be selected RANDOMLY



Randomize ○

Protects us by giving us a representative sample even for effects we were unaware of



Seems fair because nobody can guess the outcome before it happens and because usually some underlying set of outcomes will be equally likely



Sample Variability: sample-to-sample differences ■

Ex. average height of McMaster undergrad business students by drawing samples from different sections of 1DA3



Sample size is what matters ○

Size of the sample determines what we can conclude from the data REGARDLESS OF THE SIZE OF THE POPULATION



How big of a sample do we need? ■

Depends on what we are estimating



Too small sample size may not be representative of pop



Prefer a sample that is a good representative of the pop and is as small as possible

Population and Parameters ●

Census: sample that includes observations from the entire pop



Census is usually not the best idea





Difficult or impractical ir cumbersome to perform



Pop characteristic may change



Can perform censuses often

Models use mathematics to represent reality (ex. mean) ○

Parametres: key numbers in models that represent reality



Population Parameter: parameter used in a model for a population



Need to estimate population parameters through the sample data



Sample Statistic: Anything calculated from a sample, one that estimates the corresponding population parameter accurately is said to be representative



Goal is to use sample statistics from the sample to estimate population parameter

Simple Random Sample (SRS) ●

Sample drawn so that every possible sample of the size we plan to draw has an equal chance of being selected



Sampling Frame: must be defined to select a random sample, it is a list of individuals (or cases, record) from which the sample will be drawn



Once we have the sampling frame, we can assign a sequential number to each individual in the sampling frame and draw random numbers to identify those to be sampled

Other Random Sample Designs ●

All statistical sampling designs have in common is that chance rather than human choice is used to select sample



Stratified sampling: slice the population into homogeneous groups, called STRATA and use SRS within the watch stratum to select members. Combine result in the end ○



Reduced sample variability is the most important benefit of stratified sampling

Cluster Sampling: split the pop into parts or CLUSTERS that each rep the pop. perform a census within one or a few clusters at random. If each cluster fairly represents the pop, cluster sampling will generate an unbiased sample



Systematic sampling: systematic approach is used to select individuals. Start from a randomized individual and follow the approach to create the sample ○

Ex. pick every 10th individual from a list of employees to create a sample of 30 individuals



Multistage sampling: sapling schemes that combine several methods

Valid Survey ●

Surve that can yield info you need about the population in which you are interested



To help ensure a valid survey, you need to ask 4 questions





What do I want to know?



Who are the rights represented?



What are the right questions?



What will be done with results?

Be clear about what you want to learn, phrasing the questions and answers and us eteh right sampling frame



Nonresponse bias: when individuals don't respond to questions



Voluntary response bias: in volunteer surveys, individuals with the strongest feelings on either side of the issue are more likely to respond; those how don't care may not bother



Measurement errors: when a question does not take into account all possible answers ○



Ex. radio buttons and can only pick one of the answers listed

Pilot Test: small sample from the sample frame is used first which protects against measurement errors



It is important not to confuse INACCURACY with BIAS. Both create errors but the roles are different ○

We want our samples to be ACCURATE which means they represent the population accurately



Samples of very small sizes cannot accurately represent the pop



Increasing sample size and more randomization increases accuracy



Bias arises from the way we collect samples

Circle 1 top left: what we want, want all of our sample to point to more or less the same point Circle 2 Below left: very small sample sizes Circle 3 top right: all clustered on one side of the target Circle 4 bottom right: huge variability between points Ex: not every neighbourhood is the same as another, consider the biases that may come from questions asked and randomized sampling/use the right sampling method ●

Large sample sizes = leads to very little variability



No way we can remove variability



Bias is how we select the questions and structure them correctly

CHAPTER 4: DISPLAYING AND DESCRIBING CATEGORICAL DATA Displaying Data: ●

Data visualization is an important part of statical or data analysis



It summarizes huge amounts of data into easy to follow, easy to digest graphs and plots ○

Billion GB of data is generated every day



Well-designed data graphics are strong tools to convey meaning behind the data



Visualization plays an important role in telling the story of the data



The importance of good visualization can never be overemphasised

Examples of Poor Data Visualization

Example 1: too many categories (cluttered), no percentages or numbers provided, only one variable measured. The pie chart is always on categorical variables. Make sure that in the pie chart the number of categories demonstrated is not too large.

Example 2: never make graphs 3D, do not add up to 100%, the source is opinion so it is inaccurate, the blue is bigger than the red. Only make 2D graphs. One of the important elements in visualization is size, things that are 3D trick us into thinking elements that are

closer to us are bigger than objects that are farther away. Use a venn diagram or maybe a bar graph. Pie chart implies that you are showing the data in its entirety, showing that the data is mutually exclusive.

Example 3: things that are closer to us look bigger than those father away, angle interferes when printed

Example 4: cannot see part of the data, labels for access are not provided, take one of the bars, still cannot pick the exact height of that abr that interferes with the interpretation of the data

Example 5: lines are connected even though the horizontal axis has categorical data, no labels, cannot use a line graph for these types of categorical variables, the data is not ordinal and you cannot seem to change the order of the categorical data. Should have used a bar graph, maybe even a double bar to compare the two days too.

Example 6: margin error given as source which it is not, what is today has no data attached to it, don't add up to 100%. Could use a line graph since you are observing something over time. Consider changing the question. Is showing qualitative instead of quantitative. There are gaps and holes in the data. Change questions to have mutually exclusive categories. Charts ●

Bar Charts: displays the distribution of ONE categorical variable, showing the counts for each category next to each other for easy comparison, aka bar graph ○



Simple bar graphs looking at only one categorical variable

Pie Charts: show the WHOLE GROUP as a circle (“pie”) sliced into pieces. The size of each piece is PROPORTIONAL to the fraction of the whole in each category.

Example above shows on categorical variable (vertical bar graph) Example Pie Chart implies that there are other things not included, showing all of the data

Frequency Tables ●

Organizes data by recording counts and category names



Can create pie charts, bar graphs and relative frequency out of it



Relative Frequency Table: displays the PROPORTIONS or PERCENTAGES that lie in each category rather than the counts

Frequency Distribution ●

Groups data into categories and records the number of (counts number of) observations in each category

Contingency Tables ●

Shows how the values of one variable are contingent on the value of another variable (2 variables in table) ○

Ex. data was collected on the use of social networks in different countries. To show how social network use is varied by countries, we can display the data in a contingency table



Marginal distribution: of a variable in the contingency table is the total count that occurs without reference to the value of the other variables. Are on the margins of the table. Care only about one variable and ignore the value of the other variable.



Visualize the marginal distribution: you can visualize the last row and last column in separate bar graphs. Data inside the table is the result of 2 different categorical variables interacting so you can use segmented or multiple bar graphs. 5039 is not included in the marginal distribution because it is the total



Cell: each one of a contingency table gives the count for a combination of values of both variables





Eg. country and social network use



Related to the idea of joint probabilities

Segmented Bar Chart: divides a bar proportionally into segments corresponding to the percentage in each group. ○

Ex. we could display the SuperBowl viewer data which treat each bar as the “whole” and divides it proportionally into segments corresponding to the percentage in each group

Conditional Distribution ●

We may want to restrict variables in a distribution to show the distribution for just these CASES THAT SATISFY A SPECIFIED CONDITION. ○

Ex. social networking is given the country of focus is Egypt



Closely related to conditional probability

Simpson’s Paradox ●

Results from inappropriately combine percentages of different groups



Theparadox appears when a certain trend appears in several groups of data, but disappears or reverses when these groups are combined



Get different or contradicting stories in data ○

Ex: two sales reps Peter and Katerina. Each sells printer paper and USB flash drives (which is more difficult to sell). Peter argues that he’s the better salesperson since he closed 83% of his last 120 prospects. Katerina closed only 78%. Is Peter really the better sales rep? What could be the error here?



Katerina was given a certain less amount and sold most of them compared to Peter



Treatment for Kidney Stones (small vs. large stones): ■

Treatment A is more comprehensive and involves open surgical procedures



Treatment B is less comprehensive and involves small punctures



Out of the 350 patients (with small and large stones combined), the number of successes is (AGGREGATE RESULTS BELOW) ■

Treat A: 273 resulting in a 78% success rate (273/350 = 78%)



Trat B: 289 resulting in a 83% success rate (289/350 = 83%)



Which treatment is suggested for a patient with kidney stone (unknown size)?



Possible reasons for the Simpson’s Paradox ○

Size of groups: when the effect of the difference in groups is ignored, the groups with a higher sample size have a greater influence on the combined results, proportionate to their size



Confounding variables: lurking variables that influence the results when two groups with significantly different behaviours are combined (ex. The size of the kidney stones in example above)



What does it mean for data analysis? ○

Analysis should be comprehensive and nuanced



Content knowledge is important - investigate further if data is showing results that are counterintuitive



Understand the limitations of data - if data is not detailed enough it may give misleading results



Data in aggregate vs. gata in groups gives you different and sometimes contradictory result



The Simposon’s Paradox can be avoided by ○

Reviewing frequency table



Reviewing correlation among variables



Investigating any lurking (confounding variables) that may result in significant differences between groups ■

If you find that there is is a confounding variable, review data and try to control it to see if different results appear



A comprehensive and deep level of content knowledge (domain knowledge...


Similar Free PDFs