Lecture Notes, Lectures 1-16 - Min Zhou PDF

Title Lecture Notes, Lectures 1-16 - Min Zhou
Author Nicole Davis
Course Introduction to Statistical Analysis in Sociology
Institution University of Victoria
Pages 65
File Size 3.2 MB
File Type PDF
Total Downloads 58
Total Views 140

Summary

Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou - Min Zhou merged files: Introduction.docx - Intro to Statistical Analysis Part 1.docx - Intro to Statistical Analy...


Description

Stats 271  

Analyzing data and translating it into knowledge 3 major aspects: o 1. Design: obtaining data o 2. Description: summarizing and visualizing data o 3. Inference: make predictions based on data

Basic Concepts: 1. Subject:  The entity we measure in a study 2. Population:  The total set of subjects of interest 3. Sample:  The subset of the population for whom we have (or plan to have) 4. Sampling:  The process of selecting a sample 5. Random Sampling:  The sampling that makes a sample representative of the population 6. Data:  The information gathered o Ex. quantitative terms from surveys 7. Variable:  Any characteristics observed for the subjects in a study  One variable o Univariate analysis  Two variables o Bivariate analysis  More than 2 variables o Multivariate analysis 8. Descriptive Statistics:  Methods for summarizing and visualizing the data obtained from a sample 9. Inferential Statistics:  Methods for making predictions about a population based on data obtained from a sample of that population How is Soci 271 organized? Descriptive Statistics One Variable Two Variables More Variables

Inferential Statistics

Introduction to Statisti Statistical cal Analysis-Part 1

  



Variable o Any characteristic that is observed for the subjects in a study Observation o The data values we observe for a variable Two Types of Data o Categorical variables  Assign observations to separate and distinct classes or categories  Sex (male, female)  Religion (Protestant, Catholic, Jewish, other, none)  Class (1st year, sophomore, junior, senior)  Annual income (< $40K, $40K-$80K, > $80K)  A special CV: binary variable  Only 2 options/variables  Ex. Gender (male or female) o Quantitative variables  Assign meaningful numerical scores to observations  Age  GPA (range 0-9)  Number of course completed  Annual income (in dollars) o Incomes can be measure using categories or specific scores o The values of the variable, not the substance, determine whether it is categorical or quantitative o Score for categorical variables are just labels, not meaningful numbers o Distinguishing types of date is important b.c different methods apply when analyzing categorical and quantitative data (the third dimensions) Levels of Measurement o Nominal variables  Assign numbers only to signal that observations in different categories are not the same  The order of numbers is not meaningful for such measures  The number cannot be ranked  The number cannot be manipulated mathematically  Ex. For religious preference, we might assign numbers as follows: 1. Protestant 2. Catholic 3. Jewish 4. Other 5. None o Ordinal Variables

Assign numbers to indicate that one observation has more (or less) of something than another. The numbers do not tell you how much more though.  The order of numbers is meaningful; they can be ranked  Difference of rations of these number are no meaningful  Ex. For happiness, we might assign numbers as follows: 1. Not at all happy 2. Not very happy 3. Somewhat happy 4. Very happy o Interval-Ratio Variables  Quantitative variables  Numbers are meaningful  Real number  All mathematical operations can be applied  $50, 000 is $10, 000 more than $40, 000 Varieties of Quantitative Variables o Discrete variable (count variable)  Has a basic unit of measurement that cannot be subdivided  Always measure in whole numbers  Ex. Number of students in a class  Ex. Number of pets per household o Continuous variable numbers that can be subdivided infinitely (at least theoretically)  Can be measured in precise, detailed numerical forms (not whole numbers)  Ex. Annual income (in dollar, may be fractional)  Ex. Weight Note: o Don’t be excessively literal in thinking about whether a measure is continuous  Although the smallest unit of Canadian currency is a cent, income is continuous for practical purposes  If a variable has many scored values (though not an infinite number of values), it can be treated as continuous o In practice, some continuous variables are measured as discrete  Although education is a continuous variable, in theory, it is often measure in years 





Summary Types of Data Categorical Variable Quantitative Variable

Levels of Measurement Nominal Ordinal

Discrete/ Continuous Discrete Discrete

Interval-Ratio

Discrete Continuous

Examples Sex, race Social class, opinion scales Number of children Height, income



What is a good measurement? o No matter what types of variable they are, good variables should be measured according to the following principles  Mutually exclusive and exhaustive  Measurements must be mutually exclusive: o If an observation is assigned one value it can’t be assigned another  Measurement must be exhaustive: o All observations must be assigned some value  Validity of measurement  Does the measurement correspond to the “true concept?” o Since we don`t observe concepts directly, we use a valid variable to capture the concept o Sometimes it can be difficult; do grades of test scores measure “ability,” “achievement?”  Reliability  How closely do two measurements of the same things correspond? o If you repeat measuring the same thing, it should give you the same values o Does the bathroom scaled give the same weight if you weigh yourself twice within a short period of time? o Do two or more tests on a subject score student in the same order?

Descriptive Statistics for a Categorical Variable 

Use numbers: Frequency Table o A frequency table lists all possible values or a variable and their frequency and/or relative frequencies o Components  Categories  Frequency  Relative frequency (proportion or percentage)  Total  Sometimes a frequency table also includes cumulative frequency/percentage/proportion

Religious Preferen Preference ce Protestant Catholic Jewish None Other Total

Frequency 1025 351 33 146 42 1597

Percentage 64.18 21.98 2.07 9.14 2.63 100

    

The 1st column tells what categories the variable has The “frequency” column tells the number of people in a category: there are 1,597 in all, 351 whom are Catholic The “percentage” column gives relative frequencies as the fraction of the total found in a given category Sometimes “proportion” is reported instead of “percentage” o For Catholic, proportion is 351/1597=0.2198 Mode: the category with the highest frequency

Number of Childre Children n 0 1 2-3 4-5 6 or more Total





 

Frequency 107 313 504 177 32 1133

Percentage 9.44 27.63 44.48 15.62 2.82 100

Cumulative Percentage 9.44 37.07 81.55 97.17 100 100

Cumulative percentage o The percentage of the current category plus the percentage of all the categories that came before it Use Graphs o Bar graph  Percentage or frequency on the y-axis  Categories on x-axis placed wherever o Pareto chart (special type of bar graph/chart)  Order to the categories based on frequencies o Pie Chart  Size of pie represents percentage of each category To retrieve frequencies w.o percentages, multiply the total by the percentage To generate percentage w.o frequency add up all the frequencies and divide the specific one by the total

Introduction to Stat Statistical istical Analysis-Part 2

We describe a quantitative variably visually (via graphs) and via summary statistics (numbers)  Typically we want to summarize three features of a quantitative variable: o Central tendency  What is a typical observation? o Variability  How spread out the observations are? o Shape of distribution  What is the overall pattern of the observations? Measures of Central Tendency  What is an “average” or “typical” value of a quantitative variable X?  Mean o The sum of the observations divided by the number of observations 



Median o The midpoint of the observations when they are ordered form smallest to the largest o Middle vale o First, rank the observations form lowest to highest; then find the X value that divides all the observations into two equal halves

Mode o The most common value Example We survey 7 people on their years of education. 16, 1, 9, 16, 16, 18, 12 Mode = 16 Median =16 Mean = 15.43 Definition Average Value Mean Middle Value Median 

Mode

Most Common Value

Comparison Between Mean and Median  The advantage of mean:

Type of Variable Only for quantitative variables For both quantitative and ordinal categorical variables For all types of variables, including nominal categorical variables

o o o



Uses the numerical values of all the data, while median only uses the order of all the data (and the value in the middle) More informative Example: Comparison of education between two regions Region 1: 6, 9, 12, 12, 13 Region 2: 12, 12, 12, 16, 19

Region 1: median = 12, mean = 10.4 Region 2: median = 12, mean =14.2 The advantage of median: o Resistant to extreme values (outliers), while mean can be highly influenced by outliers o Example: A survey on a group of people about their annual income (in thousand dollars) 65, 60, 55, 65, 85, 75, 70, 75, 2000

Median = 67.5  70 Mean = 68.75  283.33 Measures of Variability/Spread/Dispersion  Variability o How spread out the observations of a variable are  Measures of spread: o Range  The difference between the largest and smallest observations  The larger the range, the more spread out the data  Limitation: range only uses the largest and smallest observations, and neglect much information of other observations o Variance and Standard Deviation  Closely connected  Use all the data  Describe the dispersion of all observations around their mean value  Based on differences or deviations of observations from the mean  Di=Xi – X  Variance:  Average squared deviations from the mean   

Formula: Squaring each difference makes them all positive numbers (to avoid negative numbers cancelling out positive numbers) Calculate the “average” squared deviation here by diving by n-1 (rather than n) for technical reasons (degrees of freedom)



Standard Deviation:  The square root of the variance  Shares the same unit of measurement as the original date, instead of the “squared units” that apply when using the variance

 or Example We survey 6 people on their years of education: 9, 12, 12, 14, 16, 21 Mean = 12 Deviation: 9-12=-5 12-14=-2 12-14=-2 14-14=0 16-14=2 21-14=7 Squared Deviation: (-5)2=25 similarly 4, 4, 0, 4, 49 Sum of squared deviations: 25 + 4 + 4 + 0 + 4 + 49=86 Variance: 86/(6-1)=17.2 Standard deviation = square root of 17.2=4.15 How to Construct Histograms  For discrete variables o Ex. courses you have complete in college  For continuous variables o Ex. Annual income o Create intervals first  For discrete variable with too many possible values o Ex. School size (number of students) o Usually create intervals first Distribution  The values the variable takes and the frequency of occurrence of each value  Histogram and smooth curve are especially useful in displaying the shape of the distribution

Introduction to Statistical Analysis in Sociol Sociology ogy

Histogram and Smooth Curve • A Survey on high Schools about their size

The Shape of the Distribution  The mode of the distribution (highest point)  Unimodal, bimodal, etc…  The tail of the distribution  Right tail, left tail  Symmetry/skewness  Symmetric, skewed to the left, skewed to the o Right The Shape of the Distribution: Symmetry • A distribution having the mirror-image property around its middle is symmetric

The Shape of the Distribution: Skewness • Skewed to the right

• Skewed to the left The Shape of the Distribution

The Shape of the Distribution: Measures of Central Tendency Revisited • Graph 1: Symmetric distribution, the mean equals median. • Graph 2: Left-skewed distribution, the mean is smaller than the median. • Graph 3: Right-skewed distribution, the mean is larger than the median. The Shape of the Distribution: Measures of Central Tendency Revisited • If the mean is greater than the median, the distribution is “right”- or “positively”-skewed. • This means that the extreme observations have high values. • If the mean is less than the median, the distribution is “left”- or “negatively”-skewed, in which case the extreme observations have low values. The Shape of the Distribution: Measures of Spread Revisited • Identify spread from a distribution curve:

The Shape of the Distribution • Curves are frequently used to show the shape of distributions (familiarize yourself with distributions shown in curves) • Some curves with unique shapes are particularly important in statistics. Bell-Shaped Distribution • What is a bell-shaped distribution like? o Unimodal, symmetric, and shaped like the one below:

Bell-Shaped Distribution • Why pay attention to bell-shaped distribution? o It has very good properties (Empirical Empirical Rule Rule) (1) 68% of the observations fall within 1 standard deviation of the mean (2) 95% of the observations fall within 2 standard deviations of the mean (3) Nearly all observations fall within 3 standard deviations of the mean.

Bell-Shaped Distribution • Empirical Rule:

Example: Empirical Rule • Scores on an exam has a distribution like the one below, and mean=75, standard deviation=6. Use the Empirical Rule, describe the distribution in detail.

Example: Empirical Rule • 68% of the observations are between 75-6=69 and 75+6=81, that is, within the interval (69, 81). • 95% of the observations are between 75-2x6=63 and 75+2x6=87, that is, within the interval (63, 87). • Nearly all (more than 99%) of the observations are between 75-3x6=57 and 75+3x6=93, that is, within the interval (57, 93). Measures of Position with a Distribution • Describe the position of an observation in the distribution. • Two measures: o (1) Percentile o (2) Z-score Measures of Position with a Distribution: (1) Percentile • The pth percentile is a value such that p percent of the observations fall below or at that value. o e.g., If you are at the 95th percentile on the SATs, then • 95% of those who took the test scored lower than you. o e.g., On a survey on annual income, if you are at the 35th percentile, then 35% of all people surveyed make lessmoney than you, and 65% people more money than you. Measures of Position: Percentile • Difference between Percentile and Percentage: percentiles are actually scores, while percentages are relative frequencies. • Percentiles depend on order/rank of observations. Measures of Position: Percentile • Some percentiles have special names: • The 50th percentile is the median (middle value, or Q2); • The 25th percentile is the “first quartile” (Q1); • The 75th percentile is the “third quartile” (Q3);

• The two quartiles and the median divide a distribution into 4 equal (25% each) segments. Other Use of Percentile (1) • A new measure of spread based on percentiles: Interquartile Range • Interquartile range (IQR) is the difference between the third and first quartile. • IQR = Q3 – Q1 • It is the range for the middle half of the data. • It is not as sensitive to extreme/outlying values. Other Use of Percentile • A good way to summarize the d distribution istribution in numbers: The five-number summary o Five numbers:  (1) Minimum  (2) First quartile (Q1)  (3) Median (Q2)  (4) Third quartile (Q3)  (5) Maximum Other Use of Percentile (3) • A method to identify outliers outliers: o The 1.5*IQR Criterion for Identifying Outliers o An observation is a potential outlier if it falls more than 1.5*IQR below the first quartile or more than 1.5*IQR above the third quartile. Example On a survey on test scores: Minimum=32, Q1=78, median=85, Q3=90, maximum=98 (1) Range=98 98 98-32=66 -32=66 (2) IQR=90 90 90-78=12 -78=12 (3) Are there potential outliers? 1.5IQR=1.5*12=18. 78 78-18=60 -18=60 and 90+18=108. Any nu numbers mbers below 60 or above 108 are outlie outliers. rs. There is a score of 32, so tthere here is at least one outlier at the llower ower end. (4) How (to which direction) is the distribution skewed? Skewed to the left. Summary: Use of Percentile/Quartile • Describe the position of an observation • Describe the spread of the distribution • Summarize the distribution in 5 numbers • Identify outliers Measures of Position with a Distribution (2) Z-score • Z-score tells how many standard deviations a score lies above (if z is positive) or below (if z is negative) the mean: • Z = (observation – mean) / standard deviation • Z = (x- )/s • Z-scores are also called standard scores, because they are unit-free. Example In the hours worked (weekly) distribution: The mean is 41.9 hours, the standard deviation 13.4 hours • 40 hours is the median (50th percentile), what is its zscore? Z=(40-41.9) Z=(40-41.9)/13.4=-0.14. /13.4=-0.14. (0.14 standard deviations below th the e mean). • (2) What is the z score for 85 hours? Z=(85-41.9) Z=(85-41.9)/13.4=3.22 /13.4=3.22 (3.22 standard deviations above the mean, very unusual) • (3) What is the actual value for the z-score of (-1)? X=41.9+( X=41.9+(-1)*13.4=28.5 -1)*13.4=28.5 (28.5 hours worked has a z-score of -1) Summary: Use of Z-Score • Measure the position of an observation. • Identify outliers: Large z-scores (either positive or negative) are unusual. Further Thinking

These two measures of position—percentiles and z-scores—have a systematic relationship for some specific distributions, such as bell-shaped distribution. • How are percentiles and z-scores related for a bell-shaped distribution? (Think about the Empirical Rule). Box Plot • A graph that displays the “5-number summary” of a distribution (Minimum, First quartile, Median, Third quartile, Maximum), as well as outliers if any. • Components: o (1) the box with a line inside o (2) two whiskers (the lines extending from the box) o (3) some outlying dots (if any) •

The box spans the IQR (Q1 is lower hinge, Q3 is upper hinge). The line inside the box is the median. “Whiskers” extend up to 1.5*IQR beyond the box, but no further than maximum or minimum observation. • Observations more than 1.5*IQR away from hinge are flagged as “outliers”, represented by single symbols (e.g. dots). Example: Box Plot • A box plot on IQ tests: • • •



Five-number summary: o Minimum=68 o Q1=91 o Median=99 o Q3=107 o Maximum=141 o Other information: Two outliers: 136, 141 Range: 141-68=73; IQR: 107-91=16

Introduction to Stat Statistical istical Analysis in Sociology Sociology:: Part 4 Further Thinking  These two measures of position—percentiles and z-scores—have a systematic relationship for some specific distributions, such as bell-shaped distribution.  How are percentiles and z-scores related for a bell-shaped distribution? (Think about the Empirical Rule). Bell-Shaped Distribution  The “Empirical Rule” states that for bell-shaped distributions:  About 68% of observations lie within 1 standard deviation of the mean  95% lie within 2 standard deviations of the mean  99% lie within 3 standard deviations of the mean  The first statement implies that: o one standard deviation beneath the mean is about the 16th percentile o one standard deviation above the mean is about the 84th percentile  Graphically, the first assertion of the Empirical Rule looks like this:



z-score -1 (1 s.d. beneath mean) is at about the 16th percentile Z-score +1 (1 s.d. above mean) is at about the 84th percentile So Pr[-1 < z-score ≤ +1] = 84-16=68% Similarly, the second assertion of the Empirical Rule implies that: o The 2.5th percentile of a bell-shaped distribution is about 2 standard deviations beneath the mean o The 97.5th percentile of a bell-shaped distribution is about 2 standard deviations above the mean So Pr[-2 < z-score ≤ +2] =...


Similar Free PDFs