Math 1041 lecture notes PDF

Title	Math 1041 lecture notes
Author	michael Shi
Course	Statistics for Life and Social Science
Institution	University of New South Wales
Pages	194
File Size	4.3 MB
File Type	PDF
Total Downloads	21
Total Views	143

Preview

CLICK TO PREVIEW PDF

Summary

Download Math 1041 lecture notes PDF

Description

FACULTY OF SCIENCE SCHOOL OF MATHEMATICS AND STATISTICS

MATH1041 STATISTICS FOR LIFE AND SOCIAL SCIENCES Course Pack Part B Lecture Notes

Semester 1, 2016

What is statistics? statistic – summary of data (which are measures of events)

Descriptive statistics

field of statistics – the collecting, analysing and understanding of data measured with uncertainty Who needs to know about statistical methods? Anyone who collects, analyses or wants to understand data. (And a lot of people do!)

1.1

1.3

Some example problems Does smoking when you’re pregnant affect your child’s development?

Course aims This course provides an introduction to statistics: the study of collecting, analysing, and interpreting data. Statistics plays a fundamental role in quantitative research (research involving data). Some examples of fields in which quantitative research plays a major role are psychology, biology, physics, economics, ...

A study was conducted (Johns et al 1993) using guinea pigs to address this question. Ten pregnant guinea pigs were injected with nicotine tartrate, and ten were not. Offspring were then given an “intelligence test”, a maze through which they had to pass to find food. Results (number of errors the offspring made in the maze): Group Sample size Control 10 Treatment 10

Mean 23.4 44.3

Standard deviation 12.3 21.5

Is there evidence that guinea pigs in the treatment group (those whose mums were “smokers”) were slower learners, on average? 1.2

1.4

Lecture 2: Graphs Butter side up or butter side down? The Mythbusters were testing the myth that toast lands butter side down more often than butter side up. To do this, they pushed 24 pieces of toast off a table, and found that 14 landed butter side down. Is this good evidence that toast lands butter side down more often than butter side up?

During this lecture, we will meet common graphs used for visualising data. Graphing data is a key step in analyses – it is important to use the appropriate graph(s) for your situation!

• Introduction – the role of graphs • Quantitative or categorical? • Recommended graphical tools • Class survey data

1.5

1.7

Skills to be developed In this course, you will learn how to approach designing studies and analysing data to answer research questions like the above. In particular, at the end of this course, you will be able to: 1. Recognise which analysis procedure is appropriate for a given research problem involving one or two variables 2. Understand principles of study design

Introduction – the role of graphs Data → Information

3. Apply probability theory to practical problems 4. Apply statistical procedures on a computer using Microsoft Excel or R

Data are just a bunch of numbers. A major goal of Statistics is to make them informative.

5. Interpret computer output for a statistical procedure 6. Calculate confidence intervals and conduct hypothesis tests by hand for small datasets 7. Understand the usefulness of Statistics in your professional area 1.6

1.8

Recommended graphical tools

Tools for Making Data Informative

• If you want to summarise one variable • and it is quantitative: a histogram or boxplot

• Graphical tools (today’s class).

• and it is categorical: a bar graph (or “bar chart”)

• Summary measures (next class).

• If you want to explore the relationship between two variables • and both are quantitative: a scatterplot

Which type of graph to use depends on: • Whether you are summarising one variable or looking at the relationship between two variables • Whether the variables are quantitative or categorical (qualitative)

• and both are categorical: a clustered bar chart or a jittered scatterplot • and one is categorical, the other quantative: comparative boxplots or comparative histograms

1.9

1.11

Quantitative or categorical? A categorical variable places an individual into one of several categories What sort of graph would you use to summarise: A quantitative variable takes numerical values, measured on a scale.

• gender of MATH1041 students

Which of the following variables are quantitative, and which are categorical?

• satisfaction with UNSW (from 0 to 10)

• gender

• Time travelling to UNSW • method of travelling to UNSW

• satisfaction with UNSW (from 0 to 10) • time travelling to UNSW • method of travelling to UNSW 1.10

1.12

Spread What to look for in a graph

Smaller Spread Larger Spread

• the location (where most of the data are) and spread (or variability) of the data

Frequency

When commenting on a graph of a quantitative variable, consider:

• the shape of the data (symmetric, left-skewed or right-skewed?) • if there are any unusual observations

1

2

3

4

5

6

7

8

Variable 1.15

1.13

Frequency

Typical shapes:Symmetric

Frequency

Change in location

1

2

3

4

5

6

7

8

1

Variable

2

3

4

5

6

7

8

9

Variable 1.14

1.16

The following histogram depicts the scoring average of players from the National Basketball Association (NBA) up to the 2008 season.

Typical shapes:Skewed to the left

Histogram of points per game 120

Frequency

Frequency

100 80 60 40 20 1

2

3

4

5

6

7

8

9

0

10

Variable

2

4

6

8 10 12 14 16 18 20 22 24 26 28 30 Points per game

1.17

Typical shapes:Skewed to the right

1.19

Frequency

Comment on the location, spread and shape of the histogram.

1

2

3

4

5

6

7

8

9

10

Variable 1.18

1.20

Identify the variable(s) involved in the following questions, whether they are quantitative or categorical, and what sort of graph you would use to answer the questions: • Do males and female MATH1041 students have different levels of satisfaction with UNSW?

Identify the variable(s) involved in the following questions, whether they are quantitative or categorical, and what sort of graph you would use to answer the questions: • Is there a relationship between how much the class spends on their hair and how much they would charge for their labour?

• How much do students pay for a haircut? • Does the amount you pay for a haircut depend on your gender?

• Is there a relationship between how funny the class found the the boomerang joke and the frostbite joke?

• Of the 6 ‘jokes’ in the survey, which ones did we think were most funny?

1.23

1.21

Another way to think about it: variable type:

categorical

quantitative

useful graphs:

bar chart

boxplot or histogram

Some Other Graphical Tools

two variables

one variable

one categorical, both both categorical one quantitative quantitative clustered bar chart

comparative boxplots

scatterplot

Stem-and-leaf plots (e.g. p. 10 of Moore et al.)

Pie charts (controversial – can be bettered by a bar chart.)

Time plots (e.g. pp. 20-22 of Moore et al.). Suitable for time ordered data. Common in financial pages of newpaper.

Dot plots “Poor person’s” histogram.

1.22

1.24

Statistics Packages

Class Survey Data Graphs (and indeed most statistical procedures) are most easily implemented using a computer, and a statistics package specially developed for data analysis. Common programs used for statistics: • SAS • SPSS (PASW) • Excel (We will use Excel. Sadly, Excel doesn’t do boxplots!) • R (We’ll use R too – used for most graphs in the lecture notes)

During the remainder of the class, will look at other results in the “getting to know the class” exercise. These graphs illustrate what is regarded by statisticians as “best practice”. Unfortunately, not all of these graph types are supported by Excel! (These graphs will also be available on UNSW Moodle soon)

• Minitab • S+ (S-PLUS) 1.27

1.25

Fancy graphs

Lectures 3-4: Numerical summaries This lecture, we will meet common types of numerical summaries of data – ways of summarising the key properties of data using a few numbers.

We have covered some fundamental graphical tools. But new tools are constantly being developed and modified.

• Introduction

Depending on the problem at hand, there is nothing to stop you devising your own graphical display!

• Summaries of categorical variables

A good example of an improvised graphical display is the moving bubble plot used by Prof Hans Rosling in: http://www.youtube.com/watch?v=jbkSRLYSojo

1.26

• Summaries of quantitative variables • Five-number summaries • Outlier detection • Linear transformations

1.28

Data analysis for one or two variables two variables

one variable

Introduction From last lecture:

Data → Information

variable type:

categorical

quantitative

useful graphs:

bar chart

boxplot or histogram

one categorical, both both categorical one quantitative quantitative clustered bar chart

comparative boxplots

scatterplot

useful numbers:

Data are just a bunch of numbers. This lecture

A major goal of Statistics is to make them informative.

1.29

1.31

Relationship to Textbook Tools for Making Data Informative • Graphical tools • Graphical tools (last class).

Section 1.1 Displaying Distributions with Graphs

• Numerical summaries (this week’s classes).

• Numerical summaries Section 1.2 Describing Distributions with Numbers

1.30

1.32

Data analysis for one or two variables two variables

one variable

Types of numerical summary

variable type:

categorical

quantitative

useful graphs:

bar chart

boxplot or histogram

useful numbers:

table of frequencies

one categorical, both both categorical one quantitative quantitative clustered bar chart

comparative boxplots

scatterplot

Examples of numerical summaries are: • proportions or percentages • mean or average

mean and sd

• median • interquartile range (IQR) • standard deviation

1.33

1.35

Recommended numerical summaries

Summaries of categorical variables

If you want to summarise one categorical variable: table of frequencies or percentages

Consider the data from the class survey last lecture. If you want to summarise one quantitative variable: Measures of:

location

spread

Commonly used:

mean (¯ x)

standard deviation (s)

Robust to outliers:

• Is gender a quantitative or categorical variable? • What type of numerical summary would you use for gender of MATH1041 students?

median (M ) interquartile range (IQR)

1.34

1.36

Numerical Summary of Gender Gender Female

Satisfaction with UNSW

Frequency % 208 57.94

Male

151

The mean satisfaction rating with UNSW is

7.61

42.06

1.37

1.39

The mean is just another name for what is commonly called the average of a set of numbers.

Summaries of quantitative variables Measures of location

A common notation for the mean (used in textbook) is

x ¯

Given measurements of a quantitative variable, an obvious question is How large (or small) are the values?

The mean also has a physical interpretation as the centre of gravity of the data.

Measures of “location” tell us how large (or small) the typical value is.

1.38

1.40

The Mean Can be Heavily Influenced by Outliers

The problem is that a couple of “entrepreneurs” gave unrealistic answers: someone this semester said $10 billion!

Travel Times to UNSW

If the entrepreneurs are removed (any value over $1,000), then the mean changes from

The mean travel time to UNSW is

332.76 minutes. $ 318, 929.86 to $ 63.37

1.41

1.43

The Median – an Alternative to the Mean Labour cost Even after removing big outliers, the mean is still not describing the “typical” labour cost very well.

The mean cost of labour is

$ 318, 929.86 A more satisfactory numerical summary in this case is:

The median labour cost is $35

1.42

1.44

Definition of the median Often Mean and Median are Close The median is the “middle value”. For n values sorted as x1, x2, . . . , xn , the median is:

UNSW satisf.

• x(n+1)/2 if n is odd

Travel time

x ¯ 7.61

M 8

332.76 min

60 min

• the average of xn/2 and xn/2+1 if n is even Textbook notation: the textbook refers to the median as M .

1.47

1.45

Computing the Median

A sample of 26 UNSW satisfaction ratings led to the following stemand-leaf plot.

Mean versus Median

Data: 5, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9

A dynamic appreciation for the differences between mean and median is provided by the

55 6 7777777777 888888888 9999

Mean and Median Applet on the textbook web-site www.whfreeman.com/ips8e

What is the median of these data? Answer: 1.46

1.48

Class Exercise Computing Q1 and Q3. Medians and Boxplots

Recall the UNSW satisfaction dataset from a few slides ago: 55 6 7777777777 888888888 9999

The median ratings for the jokes were Joke: baby boomerang crash deathstar frost sperm

Median: 2 3 3 4 3 6

What are Q1 and Q3 for these data? Answer:

These correspond to the horizontal bars in the boxplots!

1.49

How about the edges of the box?

1.51

Measures of spread

These correspond to the medians of the lower and upper half of the data; and are called:

Asthma example

A few years ago a colleague was involved in a study that explored possible genetic differences between asthmatics and non-asthmatics.

Q1 = first quartile

The histograms on the next slide show Q3 = third quartile

FENO = Fraction of Expired Nitric Oxygen (“biomarker” for asthma) for two groups A and B with genetic differences.

(note: the median can be thought of as the second quartile, M = Q2) 1.50

1.52

0.10 0.0

Simple Measure of Spread

0

10

20

30

40

50

A simple measure of spread is the height of the box part of the boxplot. From earlier slides this is:

0.05

FENO for Group A

0.0

0.02

Q3 − Q1 = interquartile range = IQR

0

10

20

30

40

FENO for Group B

50

1.55

1.53

IQR for Small Example

Spread of a Set of Data The two groups do not differ considerably in their central location.

For the following dataset (26 UNSW satisfaction scores):

But they do differ substantially in their spread.

20

FENO

30

40

50

55 6 7777777777 888888888 9999

10

recall that Q1 = 7 and Q3 = 8. This means that the inter-quartlie range is A

B

IQR = Q3 − Q1 = 8 − 7 = 1. 1.54

1.56

Standard Deviation – another measure of spread

Another measure of spread is the standard deviation denoted by s and calculated as

s=

v u u (x − x ¯)2 + (x2 − x ¯)2 + . . . + (xn − x ¯)2 t 1

n−1

Often IQR and s are Similar

UNSW satisf.

IQR 1

s 1.34

Hair cost

$25

$475.38

where x1, x2, . . . , xn denote the data. Note: It is easy to calculate s using statistics mode on your calculator. There is a standard deviation button: usually σn−1 or sx . 1.57

1.59

Example Recall our sample data on satisfaction with UNSW: 55 6 7777777777 888888888 9999

The Standard Deviation Can be Heavily Influenced by Outliers

n = 26 For the labour cost data we get x1 = 5, x2 = 5, x3 = 6, x4 = 7, . . . , x26 = 9.

s = $ 4, 533, 995.12

You can use your calculator to show that for this dataset, s

But if the outliers (values greater than $1, 000) are removed from the sample then we get

x ¯ ≃ 7.46

(5 − 7.46)2 + (5 − 7.46)2 + . . . + (9 − 7.46)2 25 ≃ 1.067

s =

s = $ 110.45 This still seems a little high though... probably because there are still some people who replied with pretty large values

The standard deviation for this data set is 1.067 (to 3 decimal places). 1.58

1.60

Five-Number Summary for UNSW Satisfaction

IQR is Hardly Affected by Outliers

Min. Q1 M Q3 Max. 1 7 8 8 10

For the full labour costs data we get IQR = $ 39.5

With the outliers (values greater than $1000) omitted we get IQR = $ 35

1.61

1.63

Five-Number Summary for Travel Times (in minutes)

Min. Q1 M Q3 Max. 0 17.75 60 90 100000

Five-Number Summaries Textbook advocates the five-number summary:

Min. Q1 M Q3 Max. where Min. and Max. are the smallest and largest values.

1.62

1.64

Class Exercise: Five-Number Summary

Data analysis for one or two variables two variables

one variable A poll of age in years of 20 randomly chosen students led to the data: 18 19 20 21 22 23 24 25

18 19 19 19 19 19 20 21 21 21

variable type:

categorical

quantitative

useful graphs:

bar chart

boxplot or histogram

useful numbers:

table of frequencies