Title | Math 1041 lecture notes |
---|---|
Author | michael Shi |
Course | Statistics for Life and Social Science |
Institution | University of New South Wales |
Pages | 194 |
File Size | 4.3 MB |
File Type | |
Total Downloads | 21 |
Total Views | 143 |
Download Math 1041 lecture notes PDF
FACULTY OF SCIENCE SCHOOL OF MATHEMATICS AND STATISTICS
MATH1041 STATISTICS FOR LIFE AND SOCIAL SCIENCES Course Pack Part B Lecture Notes
Semester 1, 2016
What is statistics? statistic – summary of data (which are measures of events)
Descriptive statistics
field of statistics – the collecting, analysing and understanding of data measured with uncertainty Who needs to know about statistical methods? Anyone who collects, analyses or wants to understand data. (And a lot of people do!)
1.1
1.3
Some example problems Does smoking when you’re pregnant affect your child’s development?
Course aims This course provides an introduction to statistics: the study of collecting, analysing, and interpreting data. Statistics plays a fundamental role in quantitative research (research involving data). Some examples of fields in which quantitative research plays a major role are psychology, biology, physics, economics, ...
A study was conducted (Johns et al 1993) using guinea pigs to address this question. Ten pregnant guinea pigs were injected with nicotine tartrate, and ten were not. Offspring were then given an “intelligence test”, a maze through which they had to pass to find food. Results (number of errors the offspring made in the maze): Group Sample size Control 10 Treatment 10
Mean 23.4 44.3
Standard deviation 12.3 21.5
Is there evidence that guinea pigs in the treatment group (those whose mums were “smokers”) were slower learners, on average? 1.2
1.4
Lecture 2: Graphs Butter side up or butter side down? The Mythbusters were testing the myth that toast lands butter side down more often than butter side up. To do this, they pushed 24 pieces of toast off a table, and found that 14 landed butter side down. Is this good evidence that toast lands butter side down more often than butter side up?
During this lecture, we will meet common graphs used for visualising data. Graphing data is a key step in analyses – it is important to use the appropriate graph(s) for your situation!
• Introduction – the role of graphs • Quantitative or categorical? • Recommended graphical tools • Class survey data
1.5
1.7
Skills to be developed In this course, you will learn how to approach designing studies and analysing data to answer research questions like the above. In particular, at the end of this course, you will be able to: 1. Recognise which analysis procedure is appropriate for a given research problem involving one or two variables 2. Understand principles of study design
Introduction – the role of graphs Data → Information
3. Apply probability theory to practical problems 4. Apply statistical procedures on a computer using Microsoft Excel or R
Data are just a bunch of numbers. A major goal of Statistics is to make them informative.
5. Interpret computer output for a statistical procedure 6. Calculate confidence intervals and conduct hypothesis tests by hand for small datasets 7. Understand the usefulness of Statistics in your professional area 1.6
1.8
Recommended graphical tools
Tools for Making Data Informative
• If you want to summarise one variable • and it is quantitative: a histogram or boxplot
• Graphical tools (today’s class).
• and it is categorical: a bar graph (or “bar chart”)
• Summary measures (next class).
• If you want to explore the relationship between two variables • and both are quantitative: a scatterplot
Which type of graph to use depends on: • Whether you are summarising one variable or looking at the relationship between two variables • Whether the variables are quantitative or categorical (qualitative)
• and both are categorical: a clustered bar chart or a jittered scatterplot • and one is categorical, the other quantative: comparative boxplots or comparative histograms
1.9
1.11
Quantitative or categorical? A categorical variable places an individual into one of several categories What sort of graph would you use to summarise: A quantitative variable takes numerical values, measured on a scale.
• gender of MATH1041 students
Which of the following variables are quantitative, and which are categorical?
• satisfaction with UNSW (from 0 to 10)
• gender
• Time travelling to UNSW • method of travelling to UNSW
• satisfaction with UNSW (from 0 to 10) • time travelling to UNSW • method of travelling to UNSW 1.10
1.12
Spread What to look for in a graph
Smaller Spread Larger Spread
• the location (where most of the data are) and spread (or variability) of the data
Frequency
When commenting on a graph of a quantitative variable, consider:
• the shape of the data (symmetric, left-skewed or right-skewed?) • if there are any unusual observations
1
2
3
4
5
6
7
8
Variable 1.15
1.13
Frequency
Typical shapes:Symmetric
Frequency
Change in location
1
2
3
4
5
6
7
8
1
Variable
2
3
4
5
6
7
8
9
Variable 1.14
1.16
The following histogram depicts the scoring average of players from the National Basketball Association (NBA) up to the 2008 season.
Typical shapes:Skewed to the left
Histogram of points per game 120
Frequency
Frequency
100 80 60 40 20 1
2
3
4
5
6
7
8
9
0
10
Variable
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 Points per game
1.17
Typical shapes:Skewed to the right
1.19
Frequency
Comment on the location, spread and shape of the histogram.
1
2
3
4
5
6
7
8
9
10
Variable 1.18
1.20
Identify the variable(s) involved in the following questions, whether they are quantitative or categorical, and what sort of graph you would use to answer the questions: • Do males and female MATH1041 students have different levels of satisfaction with UNSW?
Identify the variable(s) involved in the following questions, whether they are quantitative or categorical, and what sort of graph you would use to answer the questions: • Is there a relationship between how much the class spends on their hair and how much they would charge for their labour?
• How much do students pay for a haircut? • Does the amount you pay for a haircut depend on your gender?
• Is there a relationship between how funny the class found the the boomerang joke and the frostbite joke?
• Of the 6 ‘jokes’ in the survey, which ones did we think were most funny?
1.23
1.21
Another way to think about it: variable type:
categorical
quantitative
useful graphs:
bar chart
boxplot or histogram
Some Other Graphical Tools
two variables
one variable
one categorical, both both categorical one quantitative quantitative clustered bar chart
comparative boxplots
scatterplot
Stem-and-leaf plots (e.g. p. 10 of Moore et al.)
Pie charts (controversial – can be bettered by a bar chart.)
Time plots (e.g. pp. 20-22 of Moore et al.). Suitable for time ordered data. Common in financial pages of newpaper.
Dot plots “Poor person’s” histogram.
1.22
1.24
Statistics Packages
Class Survey Data Graphs (and indeed most statistical procedures) are most easily implemented using a computer, and a statistics package specially developed for data analysis. Common programs used for statistics: • SAS • SPSS (PASW) • Excel (We will use Excel. Sadly, Excel doesn’t do boxplots!) • R (We’ll use R too – used for most graphs in the lecture notes)
During the remainder of the class, will look at other results in the “getting to know the class” exercise. These graphs illustrate what is regarded by statisticians as “best practice”. Unfortunately, not all of these graph types are supported by Excel! (These graphs will also be available on UNSW Moodle soon)
• Minitab • S+ (S-PLUS) 1.27
1.25
Fancy graphs
Lectures 3-4: Numerical summaries This lecture, we will meet common types of numerical summaries of data – ways of summarising the key properties of data using a few numbers.
We have covered some fundamental graphical tools. But new tools are constantly being developed and modified.
• Introduction
Depending on the problem at hand, there is nothing to stop you devising your own graphical display!
• Summaries of categorical variables
A good example of an improvised graphical display is the moving bubble plot used by Prof Hans Rosling in: http://www.youtube.com/watch?v=jbkSRLYSojo
1.26
• Summaries of quantitative variables • Five-number summaries • Outlier detection • Linear transformations
1.28
Data analysis for one or two variables two variables
one variable
Introduction From last lecture:
Data → Information
variable type:
categorical
quantitative
useful graphs:
bar chart
boxplot or histogram
one categorical, both both categorical one quantitative quantitative clustered bar chart
comparative boxplots
scatterplot
useful numbers:
Data are just a bunch of numbers. This lecture
A major goal of Statistics is to make them informative.
1.29
1.31
Relationship to Textbook Tools for Making Data Informative • Graphical tools • Graphical tools (last class).
Section 1.1 Displaying Distributions with Graphs
• Numerical summaries (this week’s classes).
• Numerical summaries Section 1.2 Describing Distributions with Numbers
1.30
1.32
Data analysis for one or two variables two variables
one variable
Types of numerical summary
variable type:
categorical
quantitative
useful graphs:
bar chart
boxplot or histogram
useful numbers:
table of frequencies
one categorical, both both categorical one quantitative quantitative clustered bar chart
comparative boxplots
scatterplot
Examples of numerical summaries are: • proportions or percentages • mean or average
mean and sd
• median • interquartile range (IQR) • standard deviation
1.33
1.35
Recommended numerical summaries
Summaries of categorical variables
If you want to summarise one categorical variable: table of frequencies or percentages
Consider the data from the class survey last lecture. If you want to summarise one quantitative variable: Measures of:
location
spread
Commonly used:
mean (¯ x)
standard deviation (s)
Robust to outliers:
• Is gender a quantitative or categorical variable? • What type of numerical summary would you use for gender of MATH1041 students?
median (M ) interquartile range (IQR)
1.34
1.36
Numerical Summary of Gender Gender Female
Satisfaction with UNSW
Frequency % 208 57.94
Male
151
The mean satisfaction rating with UNSW is
7.61
42.06
1.37
1.39
The mean is just another name for what is commonly called the average of a set of numbers.
Summaries of quantitative variables Measures of location
A common notation for the mean (used in textbook) is
x ¯
Given measurements of a quantitative variable, an obvious question is How large (or small) are the values?
The mean also has a physical interpretation as the centre of gravity of the data.
Measures of “location” tell us how large (or small) the typical value is.
1.38
1.40
The Mean Can be Heavily Influenced by Outliers
The problem is that a couple of “entrepreneurs” gave unrealistic answers: someone this semester said $10 billion!
Travel Times to UNSW
If the entrepreneurs are removed (any value over $1,000), then the mean changes from
The mean travel time to UNSW is
332.76 minutes. $ 318, 929.86 to $ 63.37
1.41
1.43
The Median – an Alternative to the Mean Labour cost Even after removing big outliers, the mean is still not describing the “typical” labour cost very well.
The mean cost of labour is
$ 318, 929.86 A more satisfactory numerical summary in this case is:
The median labour cost is $35
1.42
1.44
Definition of the median Often Mean and Median are Close The median is the “middle value”. For n values sorted as x1, x2, . . . , xn , the median is:
UNSW satisf.
• x(n+1)/2 if n is odd
Travel time
x ¯ 7.61
M 8
332.76 min
60 min
• the average of xn/2 and xn/2+1 if n is even Textbook notation: the textbook refers to the median as M .
1.47
1.45
Computing the Median
A sample of 26 UNSW satisfaction ratings led to the following stemand-leaf plot.
Mean versus Median
Data: 5, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9
A dynamic appreciation for the differences between mean and median is provided by the
55 6 7777777777 888888888 9999
Mean and Median Applet on the textbook web-site www.whfreeman.com/ips8e
What is the median of these data? Answer: 1.46
1.48
Class Exercise Computing Q1 and Q3. Medians and Boxplots
Recall the UNSW satisfaction dataset from a few slides ago: 55 6 7777777777 888888888 9999
The median ratings for the jokes were Joke: baby boomerang crash deathstar frost sperm
Median: 2 3 3 4 3 6
What are Q1 and Q3 for these data? Answer:
These correspond to the horizontal bars in the boxplots!
1.49
How about the edges of the box?
1.51
Measures of spread
These correspond to the medians of the lower and upper half of the data; and are called:
Asthma example
A few years ago a colleague was involved in a study that explored possible genetic differences between asthmatics and non-asthmatics.
Q1 = first quartile
The histograms on the next slide show Q3 = third quartile
FENO = Fraction of Expired Nitric Oxygen (“biomarker” for asthma) for two groups A and B with genetic differences.
(note: the median can be thought of as the second quartile, M = Q2) 1.50
1.52
0.10 0.0
Simple Measure of Spread
0
10
20
30
40
50
A simple measure of spread is the height of the box part of the boxplot. From earlier slides this is:
0.05
FENO for Group A
0.0
0.02
Q3 − Q1 = interquartile range = IQR
0
10
20
30
40
FENO for Group B
50
1.55
1.53
IQR for Small Example
Spread of a Set of Data The two groups do not differ considerably in their central location.
For the following dataset (26 UNSW satisfaction scores):
But they do differ substantially in their spread.
20
FENO
30
40
50
55 6 7777777777 888888888 9999
10
recall that Q1 = 7 and Q3 = 8. This means that the inter-quartlie range is A
B
IQR = Q3 − Q1 = 8 − 7 = 1. 1.54
1.56
Standard Deviation – another measure of spread
Another measure of spread is the standard deviation denoted by s and calculated as
s=
v u u (x − x ¯)2 + (x2 − x ¯)2 + . . . + (xn − x ¯)2 t 1
n−1
Often IQR and s are Similar
UNSW satisf.
IQR 1
s 1.34
Hair cost
$25
$475.38
where x1, x2, . . . , xn denote the data. Note: It is easy to calculate s using statistics mode on your calculator. There is a standard deviation button: usually σn−1 or sx . 1.57
1.59
Example Recall our sample data on satisfaction with UNSW: 55 6 7777777777 888888888 9999
The Standard Deviation Can be Heavily Influenced by Outliers
n = 26 For the labour cost data we get x1 = 5, x2 = 5, x3 = 6, x4 = 7, . . . , x26 = 9.
s = $ 4, 533, 995.12
You can use your calculator to show that for this dataset, s
But if the outliers (values greater than $1, 000) are removed from the sample then we get
x ¯ ≃ 7.46
(5 − 7.46)2 + (5 − 7.46)2 + . . . + (9 − 7.46)2 25 ≃ 1.067
s =
s = $ 110.45 This still seems a little high though... probably because there are still some people who replied with pretty large values
The standard deviation for this data set is 1.067 (to 3 decimal places). 1.58
1.60
Five-Number Summary for UNSW Satisfaction
IQR is Hardly Affected by Outliers
Min. Q1 M Q3 Max. 1 7 8 8 10
For the full labour costs data we get IQR = $ 39.5
With the outliers (values greater than $1000) omitted we get IQR = $ 35
1.61
1.63
Five-Number Summary for Travel Times (in minutes)
Min. Q1 M Q3 Max. 0 17.75 60 90 100000
Five-Number Summaries Textbook advocates the five-number summary:
Min. Q1 M Q3 Max. where Min. and Max. are the smallest and largest values.
1.62
1.64
Class Exercise: Five-Number Summary
Data analysis for one or two variables two variables
one variable A poll of age in years of 20 randomly chosen students led to the data: 18 19 20 21 22 23 24 25
18 19 19 19 19 19 20 21 21 21
variable type:
categorical
quantitative
useful graphs:
bar chart
boxplot or histogram
useful numbers:
table of frequencies