Notes PDF

Title Notes
Course Statistics for Life and Social Science
Institution University of New South Wales
Pages 32
File Size 1.9 MB
File Type PDF
Total Downloads 230
Total Views 768

Summary

MATH1041 – Staisics for life and Social ScienceLecture 1Secion 1: Introducion to Data Collecion and Organisaion A major goal of staisics is to make data informaiveSteps involved in a Staisical Analysis? What data to collect? (This depends on what is the research quesion and who asks it.)  How to ...


Description

MATH1041 – Statistics for life and Social Science Lecture 1 Section 1: Introduction to Data Collection and Organisation 

A major goal of statistics is to make data informative

Steps involved in a Statistical Analysis?

    

What data to collect? (This depends on what is the research question and who asks it.) How to collect data in a clever way? (Design of experiment. Computer simulation) How to organize your data? (In paper notebooks, data bases, even DNA for long term storage.) How to describe your data? (File format and size. File content. Statistical descriptive summaries.) How to analyse data? (Relationships, statistical inference.)

Data Sets (I)

  



A data set) is a collection of data (i.e., numbers, qualifiers, pieces of information). In Statistics, data sets usually come from actual observations collected on cases, obtained by sampling a population (of such cases). Most commonly, a data set corresponds to the contents of a single database table, or a single statistical rectangular table. Each row in the table corresponds to the observations (values) of a few variables (such as height and weight) on one given element (case) of that population. Each column of the table represents a particular variable. The data set may comprise data for one or more members, corresponding to the number of rows, and called the sample size.

Data Set (II) You should always clearly identify the:

      

Population: the collection of all individuals or items or objects under consideration in a statistical study, usually determined by what we want to know. Cases: the members, objects, units, subjects or individuals from the population, from which information (i.e., data) is collected. IDs/labels: the identification code of each individual. Sample: a subset of the population. Sample size n: number of cases/observations in the sample. Variable: a characteristic of the cases that can be measured, collected, recorded or counted. Number of variables: the total number of variables recorded, measured or collected.

A Real Data Set Example A researcher decides to conduct a study about the population of Tasmanian devils (cases) living in Tasmania. She manages to capture 30 specimens in the wild (sample), and for each one (Specimen 1, Specimen 2 . . . etc.) (IDs/labels) she records the weight, the presence or absence of the disease, the severity of the disease on a scale of 1 (mild) to 5 (extremely severe) and the location where the animal was captured (variables). The sample size is n = 30 and the number of variables is 4.

Data Set (III)

 

A data set also refers to a computer file having a record organisation, namely a standard way that information is encoded for storage (called the format of the file). Many different file formats exist, usually recognised by looking at the file extension (e.g., .png for images, .txt for stream of characters, .html for webpages, .zip for compressed files, .xls for Excel files, .ods for Open Office Spreadsheets (same as Excel but free), etc.).

Section 2: Variable Types Quantitative or Categorical? The type of a variable is either categorical or quantitative.

 

A categorical variable places an individual into one of several categories. A quantitative variable takes numerical values, for which arithmetic operations such averaging make sense.

E.g. the variable temperature is a quantitative variable, whereas the variable gender is a categorical variable. Note: for a quantitative variable, it is important to give the units of measurements.

Section 3: The Various Sources of Data Introduction Statistics is the science of collecting, organizing, and interpreting data.



If there is a problem with the way we have collected data, this leads to problems with the analysis and interpretation of the data.

A strategy for using data in research:

   

Identify the key research question: the question you want to answer. Decide on the population to be studied – the set of people/things to which your research question refers to. Decide which variables to measure. These should be informed by your research question. Obtain data from the population of interest to answer your research question.

Sources of Data (I) Anecdotal Data:

 

Anecdotal data is haphazardly collected (such as data from your own experience). Note that anecdotal evidence is based on haphazardly selected individual cases, which often come to our attention because they are striking in some way. These cases need not be representative of any large group of cases. Anecdotal data is not usually a sound basis for drawing conclusions.

Sources of Data (II)

Available Data:



Available data are data that were produced in the past for some other purpose but that may help answer a present question. These data are previously produced (possibly for some other purpose).

Sources of Data (III) Collecting your own data:



You can of course collect your own data. There are several ways of doing this.

Some Definitions When Collecting Data Population and Sample:



The entire group of individuals that we want information about is called the population. A sample is a part of the population that we actually examine in order to gather information.

Voluntary response sample:



A voluntary response sample consists of people who choose themselves by responding to a general appeal. Voluntary response samples are biased because people with strong opinions, especially negative opinions, are most likely to respond.

Census:



A census is the procedure of systematically acquiring and recording information about every individual in a given population of interest.

Census vs Sample When collecting your own data, you can take a census or take a sample. If we survey the whole population this is called a census. Samples are often very informative, so taking a census is often a waste of time

Lecture 2 Section 4: Observational studies vs Experiments Observational Study: Individuals are observed and variables of interest are measures, but there is no attempt to influence responses. Experiment: Some treatment is deliberately imposed on individuals, and we observe their responses. We can use an observational study to find an association between two variables.



The length of stay in a hospital is associated with the size of the hospital.

But often in science we are concerned with causation, not just association:

 

Lack of sleep causes one’s attention span to decrease. Low temperature applied to water causes ice to form.

Important Definitions

Lurking Variable: a lurking variable is a variable that is not among the explanatory or response variable in a study and yet may influence the interpretation of relationships among those variables.



E.g. Studies show that people’s level of fitness tends to increase with the amount of time they spend exercising weekly.

Confounding: Two variables are confounded when their effects on a response variable cannot be distinguished from each other without further investigation. Confounded variables may be either explanatory variables or lurking variables or both. A variable is called a confounding variable if: (a) it is unobserved and (b) its effects on the response variable are hard to distinguish from the effects of the chosen explanatory variable without further investigation. There are many possible explanations for an observed association:

  

Common response (i.e., response to a common cause) Causation Confounding

Common Response

Note: Temperature is not observed, both ice cream sales and heat stroke cases change in response to changes in temperature. Note: Dashed lines man “this was not looked at in the study.”

Causation

Note: Due to the gravitational attraction of the moon it causes the rise and fall of ocean tides. Confounding E.g. Body mass index, or BMI, is one way to assess whether your weight is in the healthy range. The BMI combines a person’s height and weight to form a measure that can help predict their risk of developing disease.

It is not uncommon to observe a strong association between the BMI of children and the BMI of their parents.

Note: Child diet is not observed. Here, it is a confounding variable: influence of heredity is mixed up with influences from the child’s environment. Parent BMI and Child diet are two potential causes for Child BMI, but because of their association, they are confounded.

Why do an Experiment? An experiment allows us to demonstrate causation. We can make an intervention (the cause) and see whether or not there is an effect! In a carefully designed experiment, the intervention is the only possible explanation for any effect we observe, so we have demonstrated a cause-and-effect link.

Principle of Experimental Design

Example – Are smaller class sizes better? Observational studies suggest so, but small class sizes tend to happen in rich neighborhoods. Hence the Tennessee STAR program experiment:

 



The subjects were 6385 students beginning Kindergarten. Each student was randomly assigned to one of three treatments: regular class (22 to 25 students) with one teacher, regular class (22 to 25 students) with one teacher and a full time teacher’s aide and small class (13 to 17 students). Each treatment was a level of a single factor: the size of the class.

The students stayed in the same kind of class for four years, with a single cohort of students progressing from kindergarten through third grade. After that, they went to regular classes. In later years the students from smaller classes had better results on the response variable, their marks on standard tests. Principles of Experiment Design Vocabulary Subjects: Individuals on which the experiment is done. Treatment: A specific experimental condition applied to subjects. Factor: An explanatory variable in the experiment – a variable that is manipulated in different treatments. Levels: Each treatment is formed by combining a specific level of each of the factors in the experiment. Response variable: The variable of primary interest that is measured on subjects, after treatment. Compare, Randomise and Repeat All experiments should employ the following principles:



  

Compare two (or more) treatments. One group of subjects should be a suitably chosen control group (e.g., subjects receiving a dummy treatment, i.e., a placebo). These control subjects will be compared to subjects in the treatment group (receiving the treatment of interest). Randomise assignment of subjects to treatments (to remove selection bias). Repeat the treatment on many subjects, to reduce chance variation. Replicate the entire treatment on another group of subjects.

Independent vs Dependent Variables It is very important in a study/experiment to be able to recognize the explanatory (also called independent variables) and the response variable (also called the dependent variable). Flow Charts Flow charts are useful for giving an outline of the design of an experiment. For the effects of advertising example, the following diagram/flow chart could be used:

Notice the random allocation of individuals to each treatment.

Choice of Control Group The control group should differ from other treatments only in the application of the treatment of interest. A common type of control group is a placebo – a dummy treatment. This is sometimes called an experimental control. It controls for unforeseen effects that experimental manipulation may have on subjects. E.g., “the placebo effect” – patients often feel better when a doctor gives them a treatment, no matter what the treatment is! Summary

      

Various sources of data are available, some more reliable than others to draw conclusions on a research study. Understand the difference between a census and a sample. Distinguish between observational studies and experiments. Association does not mean causation. The three possible explanations for an observed association: common response, causation, confounding. An experiment allows us to demonstrate causation. For a proper experiment, one should always compare, randomise, repeat and replicate.

Section 5: Experimental Design Types of Experiment (I) Randomised Comparative Experiment: is a RCE, subjects are randomly allocated to one of several treatments, and responses are compared across treatment groups. Types of Experiment (II) Match Pairs Design are special type of randomised comparative experiment that can produce much more precise results than complete randomization, because we are controlling for variation in response across pairs. Common examples of matched pairs design:

  

Identical twin studies – allow us to control for genetics! Before-after experiments – we take two measurements on each subject, and control for variation across subjects.

Match Pairs Design: In a MPE, we break subjects into pairs (that have similar properties) and apply each of two treatments to one subject from each pair.

Types of Experiment (III) Block: a block is a group of subjects known before the experiment to be similar in some way that might affect their response to treatment.



E.g. one block of men and one block of women

In a randomised block design, the random assignment of subjects to treatments is carried out separately within each block. A matched pairs design is a special case of a randomised block design, where the blocks are the pairs, and there are two treatments. Cautions about Experiments Below are some common problems when designing experiments:

 

Choose an appropriate control. The only thing that should vary across treatments is/are the factor(s) of interest. (Use a placebo?) Beware of bias. If the administrator of the treatment knows what is being applied, this may (even subconsciously) bias the way they work with the subject. Hence in a double blind experiment, neither the subject nor the administrator of the treatment knows what is being applied.





Replicate = repeat the entire treatment independently for different subjects. When repeating a treatment for different subjects, it is important that all treatment steps are repeated: Applying the treatment in one go to 1000 subjects is not the same as applying it separately to ten groups of 100 people each. Why? Try to make your experiment realistic. For experiments to say anything about the real world, they need to duplicate real-world conditions!



Lecture 3 Section 1: Numerical Summaries for One Variable Numerical summaries are ways of summarizing the key properties of data using a few numbers. Explanatory Analysis An exploratory analysis consists of describing the main features of the data in a dataset. This description can be done by providing numerical summaries of the variables involved, such as:

    

proportions or percentages mean or average median interquartile range (IQR) standard deviation

Numerical Summary for a Categorical Variable We can summarise one categorical variable using table of frequencies:

 

list of the possible categories together with their counts, percent or proportion of cases in each category

Numerical Summary for a Quantitative Variable When summarising a quantitative variable numerically, we often make sure you to include measures of:

 

The location (where most of the data are), and The spread (or variability) of the data

Measure of Location

Measures of location tell us how large (or small) the typical value is. A useful measure for location could be the mean.



The mean is just another name for what is commonly called the average value of a set of numbers

The Median – An Alternative to the Mean On the other hand, the median is the middle observation in the data set. This value is $70, 000, a much better representation of what an employed engineer earns with the firm. Median: the median is the “middle value”. Mean v Median Outliers are usually large (or small) data values that tend to be quite far away from where the bulk of the data is contained. The mean can be grossly affected by outliers, whereas the median is robust to outliers. Because the mean can be grossly affected by outliers, it can be meaningless. In that case you should use the median. Unlike median, the mean does NOT have to be around the middle of the data. The Quartiles Q1 and Q3 Recall that the median splits our data into two groups which contain roughly 50% of the data. We can further examine our data by looking at the medians of the top and bottom halves of the data. These measures are known as the first and third quartiles. The quartiles divide the data into four groups which each contain roughly 25% of the data (i.e., a quarter of the data, hence the name). Note that the second quartile Q2 is just the median, (M = Q2).

 

Q1: The first quartile Q1 is the median of the observations whose position in the ordered data are to the left of location of the overall median (inclusive). Q3: The first quartile Q1 is the median of the observations whose position in the ordered data are to the right of location of the overall median (inclusive).

Measure of Spread A measure of location (such as the mean or median) alone can be misleading. For example, two countries with the same median family income are very different if one has extremes of wealth and poverty, and the other has

little variation among families. So in addition to reporting the location of our data we should also report the spread of our data. IQR – A Measure of Spread A simple measure of spread is the interquartile range. Standard Deviation Another measure of spread is the standard deviation.

Recommended Numerical Summaries If you want to numerically summarise one quantitative variable, use:

Summary

   

The type of variable determines the choice of the numerical summary. Numerical exploratory analysis (a.k.a., numerical summary) on one variable. Calculating the location and spread of data using numerical summaries. Limitations of some of these statistical tools.

Section 2: RStudio and Graphical Summaries The Role of Graphs Graphing data is a key step in analyses – it is important to use the appropriate graphs for your situation. A major goal of statistics is to make data informative. Recommended Graphical Tools Which type of graph to use depends on:

  

Whether you are summarising one variable or looking at the relationship between two variables. Whether the variables are quantitative or categorical. If you want to graphically summarise one variable: o and it is categorical: a bar chart, o And it is quantitative: a histogram or a boxplot.

What to Look for in a Graph When commenting on a graph of a quantitative variable, consider describing the overall pattern of the data in terms of:

 

the location (where most of the data are) and spread (or variability) of the data; The shape of the data (symmetric, left-skewed or right-skewed).

And also, indicate:



If there are any unusual observations (called suspected outliers).

Other Graphical Tools


Similar Free PDFs
Notes
  • 18 Pages
Notes
  • 12 Pages
Notes
  • 61 Pages
Notes
  • 35 Pages
Notes
  • 19 Pages
Notes
  • 70 Pages
Notes
  • 6 Pages
Notes
  • 35 Pages
Notes
  • 29 Pages
Notes
  • 70 Pages
Notes
  • 6 Pages
Notes
  • 19 Pages
Notes
  • 32 Pages
Notes
  • 28 Pages
Notes
  • 56 Pages