MATH1041-Lec - Lecture notes All PDF

Title MATH1041-Lec - Lecture notes All
Course Statistics for Life and Social Science
Institution University of New South Wales
Pages 84
File Size 7.6 MB
File Type PDF
Total Downloads 246
Total Views 513

Summary

MATHChapter 1: Study Design Data Collection and Organisation & Variable Types o Data setsPopulation : collection of all individuals/items/objects under consideration in a statistical study, usually determined by what we want to know.Cases : members, objects, units, subjects or indivs from th...


Description

MATH1041 Chapter 1: Study Design 1. Data Collection and Organisation & Variable Types o Data sets Population: collection of all individuals/items/objects under consideration in a statistical study, usually determined by what we want to know. Cases: members, objects, units, subjects or indivs from the population, from which information (data) is collected. IDs/labels: identification code of each indiv. Sample: a subset of the pop. Sample size n: no. of cases/observations in the sample. Variable: a characteristic of the cases that can be measured, collected, recorded or counted. Type of variable is either categorical or quantitative. -

Categorical = places an indiv. Into one of several categories eg. Gender Quantitative = takes numerical values (averaging) eg. Temp

No. of variables: the total no. of variables recorded, measured or collected. Voluntary response sample: consists of people who choose themselves by responding to a general appeal. (biased/skewed because people with strong opinions, especially negative, are more likely to respond) Census: procedure of systematically acquiring and recording information about every individual in a given population of interest. (whole population) 2. Sources of Data Anecdotal data: is haphazardly collected (such as data from your own experience)  Eg. Suppose that a few of my friends are left-handed and are good at maths – does this mean left-handers are good at maths. Available data: are data that were produced in the past for another purpose but that may help answer a present question.  Eg. Government census data, Australian Bureau of Statistics (ABS) Collecting your own data:  Eg. Taking a quick class survey in a MATH1041 lecture. -

Census is a survey of the whole population. Usually it cannot be done because of time, money and ethical constraints. Samples are often very informative - eg. Accurately predict winner of an election with only 1,000 Australians

3. Observational Studies vs Experiments

Observational study: indivs are observed and variables of interest are measured, but there is no attempt to influence responses.  Eg. Ask several UNSW students how much time it takes them to get ready in the morning. Experiment: some treatment is deliberately imposed on indivs, and we observe their response.  Eg. The next week, half the students above were asked to drink a coffee in the morning and the other half not to drink any coffee. Then, they were asked at noon if they felt sleepy in the morning.



3.1 Observational Studies and Association Observational studies can be used to find an association between 2 variables.

 Eg. The length of stay in a hospital is associated with the size of the hospital. 

But often in science we are concerned with causation (causal link), not just association:

 Eg. Lack of sleep causes one’s attention span to decrease 

Low temp applied to water causes ice to form -

Problem with observational studies: association does not mean causation

Lurking variable: variable that is not among the explanatory or response variables in a study and yet may influenced the interpretation of relationships among those variables.  Eg. “level of fitness” = response variable (changing or responding) 

“amount of time spent exercising weekly” = explanatory variable



lurking variables = intensity of exercise, age,

2 variables are confounded when their effects on a response variable cannot be distinguished from each other without further investigation. -

Confounded variables may be either explanatory or lurking variables or both.

A variable is a confounding variable if: (a) it is unobserved and (b) its effects on the response variable are hard to distinguish from the effects of the chosen explanatory variable without further investigation. 

2

There are many possible explanations for an observed association:  Common response (ie. Response to a common cause)  Causation  Confounding

3

4. Experimental Designs   

An experiment allows us to demonstrate causation. We can make an intervention (the cause) and see whether or not there is an effect. In a carefully designed experiment, the intervention is the only possible explanation for any effect we observe, so we have demonstrated a causeand-effect link.

 Eg. Does smoking cause cancer? -

Common response: there could be a genetic factor that predisposes people both to nicotine addiction and to lung cancer. Confounding: it might be that smokers live unhealthy lives in other ways (diet, alcohol, lack of exercise) and that such other habit confounded with smoking is a cause of lung cancer.

-

4.1 Principles of Experimental design Subjects (experimental units): indivs on which the experiment is done. Treatment: a specific experimental condition applied to subjects Factor: an explanatory variable in the experiment – a variable that is manipulated in different treatments. Levels: each treatment is formed by combining a specific level of each of the factors in the experiment. Response variable: the variable of primary interest that is measure on subjects, after treatment. EXAMPLE: ARE SMALLER CLASS SIZES BETTER? Observational studies suggest so, but small class sizes tend to happen in rich neighbourhoods. Hence the Tennessee STAR program experiment: -

-

The subjects were 6385 students beginning Kindergarten. Each student was randomly assigned to one of three treatments: regular class (22-25) with 1 teacher, regular class (22-25) with 1 teacher and a full-time teacher’s aide and a small class (13-17). Each treatment was a level of a single factor: the size of the class.

The students stayed in the same kind of class for four years, with a single cohort of students progressing from kindergarten through third grade. After that, they went to regular classes. In later years the students from smaller classes had better results on the response variable, their marks on standard tests.

4

EXAMPLE: WHAT ARE THE EFFECTS OF REPEATED EXPOSURE TO ADVERTISING? 60 students view a 40-min TV program, but the program includes ads for a new smart-phone. Different students see different ads – some see a 30 sec ad, some see a 90 sec ad. Each ad is shown either once, 3 times or 5 times. Students are then asked if they intend to purchase the new smartphone -

Subjects: 60 students in experiment Factor 1: duration of the ad; and F t 2 th f ti th d i h

4.2 Compare, Randomise and Repeat All experiments should employ the following principles:  Compare 2 (or more) treatments. 1 group of subjects should be a suitably chosen control group (eg. Subject receiving a dummy treatment ie. Placebo). These control subjects will be compared to subjects in the treatment group (receiving the treatment of interest).  Randomise assignment of subjects to treatments (to remove the selection bias)  Repeat the treatment on many subjects, to reduce chance variation.

4.3 Independent vs Dependent Variable Explanatory variables = independent variables Response variables = dependent variable

HOW DO YOU RANDOMISE? USE R In R, randomisation is performed by the sample() function.

5



 

4.4 Flow Charts Give an outline of the design of an experiment.

4.5 Choice of Control Group Control group should differ from other treatments only in the application of the treatment of interest. A common type of CG is a placebo – a dummy treatment. Sometimes called an experimental control. It controls for unforeseen effects that experimental manipulation may have on subjects.

4.6 Types of Experiments Randomised Comparative Experiment: Subjects are randomly allocated to one of several treatments, and responses are compared across treatment groups. -

Matched pairs designs are a type of randomised comparative experiment that produce more precise results than complete randomisation, because we are controlling for variation in response across pairs.

Subjects are broken into pairs (that have similar properties) and apply each of 2 treatments to one subject from each pair.

 Identical twin studies – allow us to control for genetics  Before-after experiments – two measurements are taken on each subject, and control for variation across subjects. -

6

A block is a group of subjects known before the experiment to be similar in some way that might affect their response to treatment.

In a randomised block design, the random assignment of subjects to treatments is carried out separately within each block. A matched pairs design is a special case of a randomised block design, where the blocks are the pairs and there are 2 treatments.

1. 2.

3.

4.

4.7 Cautions about Experiments Choose an appropriate control. The only thing that should vary across treatments is/are the factor(s) of interest. (Use a placebo?) Beware of bias. If the administrator of the treatment knows what is being applied, this may bias the way they work with the subject. Hence, in a doubleblind experiment, neither the subject nor the administrator of the treatment knows what is being applied. Replicate = repeat the entire treatment independently for different subjects. When repeating a treatment for different subjects, it is important that all treatment steps are repeated: Applying the treatment in one go to 1000 subjects is not the same as applying it separately to ten groups of 100 people each. Experiment needs to be realistic. For experiments to say anything about the real world, they need to duplicate real-world conditions.

Chapter 2: Descriptive Statistics 1. Numerical Summaries for One Variable

# Numbers that we use to summarise data depend on: a) Type of the variable(s) (categorical or quantitative) b) Number of variables (1 vs 2)



7

1.1 Exploratory Analysis Consists of describing the main features of the data in a dataset. This description can be done by providing numerical summaries of the variables involved; o Proportions or percentages o Mean or average o Median o Interquartile range (IQR) o Standard deviation

1.2 Numerical Summary for a Categorical Variable Table of frequencies for one categorical variable: -

List of the possible categories (even those not observed) Together with their counts, percent or proportion of cases in each category # % = proportion X 100

1.3 Numerical Summary for a Quantitative Variable Include measures of: -

The location (where most of the data are), AND The spread (or variability) of the data

1.4 Measures of Location Measures of location show how large (or small) the typical value is. o Mean o Median o Quartiles Q1 AND Q3 MEAN: mean() =

x 1 + x 2+ x 3 +…+ x n n

MEDIAN: median() =

x(n+1 )/ 2 if n is odd

median() equals the average of

x(n /2)∧x

n ( +1) 2

if n is even

> “middle value”/half the observations are below median while half are above.

Mean vs. Median  

Outliers are unusually large (or small) data values that tend to be quite far away from where the bulk of the data is contained. The mean can be grossly affected by outliers, whereas the median is robust (resistant) to outliers.

QUARTILES Q1 AND Q3: 1st and 3rd quartiles are the medians of the top and bottom halves of data (after median splits the data into two groups). The quartiles divide the data into four groups which each contain roughly 25% of the data Q1 = median of observations whose position in the ordered data are to the left of location of overall median. Q3 = median of observations whose position in the ordered data are to the right of location of overall median.

8

# Q2 (2nd quartile) is just the median (M = Q2)

1.5 Measures of Spread o Interquartile range o Standard deviation o Variance IQR = Q3 – Q1

STANDARD DEVIATION: sd() =



2 2 2 ( x 1−mean( ) ) +( x 2−mean ()) +…+ ( x n −mean())

n−1

VARIANCE: (sd squared)

1.6 Recommended Numerical Summaries for ONE quantitative variable

Five-Number Summaries: compiles measures encountered ^^^^ fivenum() summary()  gives mean

2. RStudio and Graphical Summaries for One Variable Mean can be interpreted as the centre of gravity of the distribution of data.

9

1.5xIQR Criterion for Outliers Suspected outliers are observations outside the whiskers on the boxplot.

When commenting on a graph of a quantitative variable, describe the overall pattern of the data in terms of:

10



The location (where most of the data are) and spread (or variability) of the data;



The shape of the data (symmetric, left-skewed or right skewed)

Figure 1: Left-skewed

Figure 2: Rightt-skewed

Figure 3: Symmetrical 

If there are any unusual observations (suspected outliers)

3. Relationships Between Two Variables 3.1 At least 1 categorical variable NUMERICALLY: - Break the data up by categories of that variable - Summarise the data in each category using the appropriate one-variable method - Compare

11

GRAPHICALLY: - Break the data up by categories of that variable - Summarise the data in each category using the appropriate one-variable method - Compare

12

3.2 Quantitative variables

SCATTERPLOT Describing the relationship between 2 quantitative variables: o Existent versus non-existent (is there an existing relationship between the 2 variables) o Strong versus weak o Increasing versus decreasing (as x increases, y decreases – neg) (as x increases, y increases - pos) o Linear versus non-linear (semi-defined/defined shape)

CORRELATION A number that measures how 2 quantities are associated. It measures the strength of the linear statistical relationship between 2 quantitative variables.  Standardisation To standardise a set of data x ={ x 1 … x n } and a standard deviation of one.

by rescaling it to have a mean of zero

Each case’s value x´ 1 in the new transformed set of data ´x indicated its difference from the mean, mean(x) , of the original set of data in number of standard deviations, sd ( x ) , of the original data. -

Eg. After you standardise, a value of 0.5 indicates that the value for that case is half a standard deviation above the mean, while a value of -2 indicated that a case has a value of two standard deviations lower than the mean.

How to standardise: 1. The mean is subtracted from the value for each case, resulting in a mean of zero. 2. The difference between the indiv.’s score and the mean is divided by the standard deviation, which results in a standard deviation of one. Standardised value = ´x 1=

13

x´ 1−mean(x ) , i=1 … n . sd (x )

 Pearson’s Coefficient of Correlation r It measures the strength and direction of a linear relation between x and y. Its formula involves standardised data (standardisation neutralises scale effects): r=r ( x , y ) =

x´1 y´ 1+ x´2 y´ 2+ …+ x´n y´n n−1

 Properties of the Correlation r

0.5/-0.5 = moderate # The cases in the x data set and in the y data set must be the same (meaning the x and y coordinates of one dot in the scatterplot must have been measured on the same indiv). # r only measures the strength of the linear relationship. If there is a strong association but not a linear one, r does not have to be close to -1 or 1.

14

5. Least-Squares Regression  Explanatory and Response Variables Regression is a method of explaining the relationship between two quantitative variables when the variables are not interchangeable / how a response variable is related to explanatory variables. Eg. - One variable is believed to cause changes, predict or explain variations of one variable by the variations of the other.  

The variable that is used to explain or predict is the explanatory variable. The other is response variable. It is conventional in scatterplots used in regression to put the explanatory variable on the horizontal x-axis and the response variable on the vertical yaxis. (Simple linear regression)

Least-squares is a mathematical method for determining a “line of best fit” through the scatterplot points. Line chosen to minimise the sum of the squared lengths of the arrows / distance between each. Straight lines:

y=b 0 +b1 x

Where b0

= y-intercept on y-axis (namely when x = 0) b1 = slope/gradient of the line

# The value of b1 represents the magnitude of the effect of x and y. ^y =b 0 +b1 x

 values of

^y

are called fitted or predicted values, meaning we

can predict values of the response variable from the least-squares regression line . # A linear regression equation should only be used to make predictions for values of the explanatory variable (where x-variable is included in the data) within the range of the actual data – otherwise this is extrapolating.

15

#Check the appropriateness of the basic regression assumptions so that the assumption of linearity is satisfied. ^^^^^^^Always plot data first to check if regression is appropriate.

16

6. Residuals and the r-Squared Value 

 Linear regression assumptions When fitting a least squares line to data on y and x, the assumption is that the overall structure of the data is linear. The line should go right through the points and these points should be scattered on both sides of the line, with no apparent structure (pattern).

6.1 Residual Plots The residuals from a least-squares regression are obtained by subtracting the fitted values ^y (aka predicted values) from the response values.

A residual plot is a scatterplot of the residuals against the explanatory variable x. Interpreting residual plots: If the regression line catches the overall pattern of the data, there should be no pattern (left) in the residuals. 6.2 Measuring the Strength of a Linear Regression An important quantity throughout regression is the coefficient of determination R2. It measures the strength of regression (linear or not). -

For simple linear least-squares regression (with an intercept), we have

So that we will note it r2 instead

AND

r measures the strength of linear relationships between two variables on equal footing. The r2 coefficient is the proportion of the variance in the response variable that is predictable from the explanatory variable: r 2=

variance of ^y values variance of y values

#(variance is the square of the standard deviation) So r2 is the % of variation in y that is explained by the linear regression.

17



18

A small r2 does not necessarily mean there is no relationship. The r2 value at this point only assesses linear relationships.

Chapter 3: Probability, Discrete Random Variables and the Binomial Distribution 1. Probability 

A phenomenon or an experiment is said to be random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions.



The probability of a given outcome (or set of outcomes) of a random experiment is the proportion of times this outcome (or set of outcomes) will occur, in an infinitely large number of repetitions; the so-called...


Similar Free PDFs