Ch 1 Exploring Data PDF

Title Ch 1 Exploring Data
Course Statistics
Institution University of San Francisco
Pages 4
File Size 224.1 KB
File Type PDF
Total Downloads 86
Total Views 173

Summary

Lecture notes on Ch 1...


Description

Topic: Ch 1: Exploring Data (pgs 2-81) Pd.

1,

Questions:

Notes:

Introduction: Data Analysis, Making Sense of Data

- Statistics: science of data - Individuals: objects described by a set of data - Variable: any characteristic of an individual 1. Quantitative (numerical) 2. Categorial (represented through pie/bar graphs) - Distributions: 1. Marginal (row/column totals) 2. Conditional (specific: within one row/column)

Can you make something quantitative, categorial?

1.1 Analyzing Categorical Data - alt to bar graph = segmented bar graph

When using pic’s instead of bar, not scaled/inaccurate. - Marginal Distribution tell us nothing about the relationship between the variables - best to use a (segmented) bar graph to represent conditional distribution How to organize a statistical problem?

1.2 Displaying Quantitative Data with Graphs How can we picture the distribution of a quantitative variable?

- Use pie chart/bar graph to display distribution of categorical variable more vividly - Use a pie chart only when you want to emphasize each category’s relation to the whole - Bar graphs can compare any set of quantities that are measured in the same units - Don’t replace the bars in a bar graph with pictures Two-Way Tables & Marginal Distributions - relationship between 2 categorical variables 1. Ex: Gender & Opinion - Marginal Distribution: distribution of values of that variable among all individuals described by the table **calculate (add) values for whole table for marginal distribution** - Conditional Distribution: describes the values of that variable among individuals who have a specific value of another variable - specific unit from column/row - 4 steps to organize: 1. State: What’s the question you’re trying to answer? 2. Plan: How will you go about answering the question? What statistical techniques does this problem call for? 3. Do: Make graphs and carry out needed calculations 4. Conclude: Give your practical conclusion in the setting of the real-world problem - Association: if specific values of one variable tend to occur in common w/ specific values of the other - Simpson’s Paradox: an association between two variables that holds for each individual value of a third valuable can be changed or even reversed when the data for all values of the third variable are combined

Dotplots - each data value is shown as a dot above its location on a number line - describe pattern of distribution: SOCS (Shape, Outlier, Center, Spread)

- look for rough symmetry or clear skewness - roughLY, slightLY, greatLY > use LY - skewed right: most of the data pulls to the left - skewed left: most of the data pulls to the right - Unimodal (one max), bimodal (two max), multimodal (many maxs) Stemplots - gives a quick picture of the shape of a distribution while including the actual numerical values in the graph - “Stem” and “Leaf” - What if you want to compare the # of pairs of shoes the males and females have? Back-to-Back stemplot - do not work well w/ large data sets - if you split stems, make sure each stem is assigned an equal # of possible leaf digits Histogram - the most common graph of the distribution of one quantitative variable is the histogram - don’t confuse histograms and bar graphs - draw histograms w/ no space to show the equal-width classes - use dotplot, stemplot, and histogram to show the distribution of a quantitative variable - when examining any graph, look for overall pattern (SOCS)

1.3 Describing Quantitative Data with Numbers

How do you compare the median and mean?

- the most common measure of center is the ordinary arithmetic average, of mean - The mean x: add their values, divided by number of values - the mean tells us how large each observation in the data set would be if the total were split equally among all the observations - The median M: the midpoint of a distribution - mean & median measure center in different ways - the mean and median of a roughly symmetric distribution are close together - in a skewed distribution, mean is usually farther out in the long tail than is the median - Measuring spread: The Interquartile Range (IQR) - the simplest measure of variability is the range: highest value lowest value - IQR: Quartile 3 - Quartile 1 - The 1.5 x IQR rule for outliers: 1. # < Q1 - (1.5 x IQR) 2. # > Q3 + (1.5 x IQR) - when you find outliers, try to find an explanation for them - The 5-number summary: consists of the smallest observation, the 1st quartile, median, 3rd quartile, largest observation - Minimum, Q1, M, Q3, Maximum - to get a quick summary of both center and spread, combine all 5 numbers - Standard Deviation sx: measures the avg distance of the

observations from their mean - find the avg of the squared distances & then taking the sq root - ex: 4, 5, 5, 4, 4, 2, 2, 6 // Mean (M): 4 (x-M) 4-4= 0^2 = 0 5-4= 1^2 = 1 5-4= 1^2 = 1 4-4= 0^2 = 0 4-4= 0^2 = 0 2-4= -2^2 =4 2-4= -2^2 = 4 6-4=2^2 = 4 ----------------= 14 - 8 = # of values O = sq rt. Of 14/8

How to choose measure of center & spread:

- median & IQR usually better than mean & SD for describing a skewed distribution/distribution w/ strong outliers - measures of center: median, IQR - measures of spread: mean, SD - Resistant (no change): Median, IQR - Non-resistant (change): Mean, SD...


Similar Free PDFs