Stat2114 LECT Notes PDF

Title Stat2114 LECT Notes
Author ishan puri
Course Abstract Algebra
Institution George Mason University
Pages 147
File Size 9.1 MB
File Type PDF
Total Downloads 19
Total Views 135

Summary

all detailed lecture notes...


Description

STAT2114

Week 1 Introduction to sample surveys: What is a survey?  An examination of an aggregate unit usually, o Human beings o Economic institutions o Social institutions  By a systematic collection of data from the unit Purpose of a survey  Provision of info such as collected data for answering research question, testing study hypothesis and examining impacts of various factors on some (e.g. social) phenomenon Often the purpose is descriptive, e.g.  How much do people spend on food?  Of those entitled to claim for unemployment benefits, what proportion actually do?  What brand of toothpaste do people use?  Which brand of milk do people prefer? The purpose can also be comparative, e.g.  To test some hypothesis suggested by sociological or other theory , e.g o That domestic violence is associated with particular income and educational levels o That parents who physically abuse their children were often themselves physically abused as children. To assess the influence of various factors, which can be manipulated by public action, upon some social phenomenon, e.g.  Phenomenon: Numbers of 16-20-year-olds starting to smoke Factors: o Banning of smoking in public places; o restricting sales of cigarettes to people over 18; o anti-smoking campaigns in media. What is sample survey?  Sample survey is a survey where data are collected on a sample of individuals from a population of interest, via some kind of statistical sampling procedure/method  A new concept: finite nature of the population (Note: Statistical sampling you have considered so far generally assumes an infinite population, or sampling with replacement.)  Note: We hope that the sample is “representative” of the population! What is sample survey design?  Statistics is used to gather information from a sample to make interfences about a target population  A target population is the population about which information is wante d  A sample is usually a subset drawn from a target population  To provide accurate information about the population, the (good) sample needs to be representative of the population.  Methods/procedures of selecting a GOOD (survey) sample are referred as sample survey designs.

STAT2114

Alternative to sampling: Complete enumeration of population, ie Census Advantages of sample survey over complete enumeration  Reduced cost  Greater speed  Greater scope  However census is limited as people have lack of patience to answer many questions and also tendency to provide inaccurate answers Why bother with a census?  statutory requirement: Australia act of parliament requires census ever 5 years  Important for determining electoral boundaries Technical terms/definitioins:  Element: is an object on which one of more measurements are taken, also known as an observation unit  The population is a collection of elements about which we wish to make interferences  To take a sample we need a list of the elements in the population o Often such lists do no exist o There may be a lisr of disjoint collections of elements which cover the population  Unit: a collection of elements  Sampling units: non overlapping units are to be selected for a sample (each unit is a collection of elements), covering the entire population o E.g. population of interest = all school kids o Element = a kid attending school o Unit = school in NSW o In most cases, unit is same as element  Sampling frame: list of sampling units, o E.g. schools in NSW form the first frame o Children within a school form a second frame  A sample is a collection of units drawn from a frame (or frames) Principal steps in a s sample survey:  Establishing survey objective or purpose  Defining the population to be sampled  Choosing appropriate sampling frame(s)  Selecting a suitable sample survey design  Deciding on the method of obtaining data  Defining the data (variables) to be collected  Creating and pre-test the survey questionnaire  Collecting, managing and analysing the data Establishing survey objective or purpose  Central purpose behind survey must be established o For the travel patterns example on the previous slide, each of the following aims would produce a different design and structure:

STAT2114 





Obtain detailed description of public transport (trains, buses and ferries) usage  Develop predictive model of travel behaviour  Find mobility patterns of population subgroup, eg, disabled people, senior citizens  Assess likely impact of changes in transport policy With central purpose established, itemise sub topics that relate to that purpose o Eg, for the aim “Assess likely impact of changes in transport policy” Possible sub-topics include:  Current roles of public & private transport  Likely growth in car numbers  Safety considerations  Attitudes to road building, tolls, parking restrictions, etc. Formulate specific information requirements relating to these sub topics o Eg, for the sub-topic “Current roles of public & private transport” we need detailed information:  record journeys for each unit (person)  mode of transport  purpose  Respondents’ perceptions of suitability of public transport for various types of journey

Defining the population to be sampled: decide which groups of people are relevant to the survey. This is the population Some may be difficult to include:  Very young members children  People in hospitals, institutions Some may be ruled out on budgetary grounds Note:  Important to include those who do not partciapate in survey and those who do  Identify population unambiguously o E.g if we are sampling population of farms, “farm” must be clearly defined. So borderline cases can be decided Choosing the sampling frame:  Once the target population is clearly defined,  choosing a sampling frame (or frames) in which the list of units should cover all the units in the population, or be at least in a close agreement with the target population.  Having the frame(s) formed, we then need to figure out how many units to be selected. How big should the sample be:  To work out how big the sample should be, in general we need to know the degree of precision required

STAT2114



 Deciding the sampling design/method Now, we need to choose an appropriate sampling design/method to obtain the sample of units from the target population. So sufficient and useful information can be obtained from the sample. There are a number of sampling designs, including probability and non-probability sampling. We’ll come back to this a bit later in this lecture. Deciding the method of obtaining data:  Personal interviews  Telephone survey  Mail or email survey  Online (web) survey (very common nowadays)  Other Pretest the survey questionnaire or pilot study:  It is very useful to try out questionnaire on a small scale of 30 to 100 people to test, and obtaining feedback and studying the response from them  After testing the questionnaire, improve it by ; o Refining wording o Fixing layout and ordering o Pruning questionannaire to reasonable length Response rate:  Response rates > 85% rare  Commonly 75% - 80% is considered good  Postal or email survesy usually achieve much lower response rate (due to junk mail and that)

STAT2114 

Note: The analyst needs to consider whether the nonrespondents are missing at random, or whether there is some common reason for nonresponse. The latter will lead to serious bias in the results of the survey. (Read Lohr, Chapter 8)

Example: telephone surveys  Market research companies routinely conduct surveys over the telephone.  They usually ring in the early evening when people are likely to be at home. (In Sydney, usually 1/3 accept the call and 1 in 15 calls result in an interview)  Who will the respondents be?  How will this bias the results of the surveys?  Example (continued) o “If the nonrespondents tend to differ from the respondents, then the biases in the results from using only the respondents may make the entire study worthless.” o “Many surveys reported in academic journals on purchasing, for example, have response rates between 10% and 15%. It is difficult to see how anything can be concluded about the population in such a survey.” Different types of samplinh design  Probability sampling; o Every possible sample has a known probability of being selected  This allows us to make such statements as: o Our estimate is UNBIASED o We are 95% confident that our estimated proportion is within 2% of the true population proportion Most of this course will be concerned with probability sampling, including Simple random sampling (to be covered in Week 8):  Every possible sample has the same probability of selection (ie, each subject in the target population has the same chance/probability to be selected into the sample). Stratified random sampling (to be covered in Weeks 3 & 4):  If a population can be broken down into nonoverlapping subpopulations (called strata), it may make sense to take probability samples within each stratum, particularly if the variability of whatever is being measured is small within a stratum, but there are large variations between strata.  Generally, estimates of population parameters have much smaller variances when stratification is used. Other types of probability sampling will also be considered:  Cluster sampling (to be covered in week 6)  Systematic sampling (also in week 6) Non probability sampling: Quota sampling  E.g. a certain population contains o 60% males o 40% females  A simple random sample almost certainly wont contain exact 60% males  Quota sampling selects subjects one at a time, until there are exactly 60% males  Quota sampling takes a variety of forms but all have the common feature:

STAT2114 o Once the general breakdown of the desired sample has been decided (how many men, women, # in each age group etc) the choice of actual sampling units is left up to the interviewer.  It can be seen like a method of stratified sampling, but selection probabilities are unknown  Quotas given to interviewers are calculated from available data (census data etc), so that the classes (e.g. genders, age groups, ethnic groups) are represented in the correct proportions.  A factor, eg, gender, is chosen for quota control if it separates the population into strata which may differ in their opinions on the subject under study.  Age and gender are widely used for quota controls - interviewers have not much trouble deciding to which group a respondent belongs!  But it should take into account differences between groups of people in probability of being available for interview. o Eg, if working women differ in opinions from non-working women, and interviews are conducted by day-time home visits, then the sample will be seriously biased.  Unlike Gender, Social class is much more difficult to tell, and thus no reliable statistical basis for setting quotas. o Definition of social class usually involves combination of objective factors (occupation, income, education) and subjective factors (appearance, speech, etc). o Definition of social class is vague. If it is used for quota control, it leaves some play for interviewer’s subjective judgement and thus may introduce bias. Problem with quota sampling:  Is it representative of the target population in all respects?



Interviewers tend to select people readily at hand ( easy availability)

Main arguments against quota sampling  Cant calculate likely sampling error  Often fails to be representative o E.g. top age group = age ≥ 65 would be filled mainly by people in 65 to 70 years range. Older people would be underrepresented  Social class problem  Strict control of field work is hard as its not easy to check if interviewers place people in correct category, or in category that is hard to fill Main arguemnets in favour of quota sampling  Less costly o But this argument is difficult to be assessed. The more quota controls, the more expensive the survey. But need the controls to avoid selection bias. o evidence that quota samples have greater sampling variability than probability samples. o should compare quota & prob. sampling in terms of cost for prescribed level of precision. o no call-backs required o don’t have to travel long distances to track down pre-selected respondents  Easy administratively; o Avoids problems of random sample selection, such as:

STAT2114



 non-contactable people  call-backs Fieldwork can be done quickly Quota sampling does not require sampling frames that may be difficult to obtain.

quota sampling and opinion polls: Prior to and including 1948, US Presidential polls were based on quota sampling. Quota sampling was first seriously questioned after 1948 election. Recall, Gallup polls predicted: Truman (Democrat) : Underdog (44.5%) Dewey (Republican) : Favourite (49.5%) Actual results were: Truman : 49.5% Dewey : 45.1% Note: Quota controls were sex, age, education, colour and veteran status; Lower education group was largely under represented in all polls. Sources of error in surveys  Sample cannot give complete information. There will always naturally be errors in estimation of population parameters of interest.  Such errors are known as sampling errors.  Sampling errors are controlled and estimated by carefully designing the survey. Other non sampling errors:  Non response  Inaccruacte responses  Selection bias Non-response: Non-response may be in  individual question(s)  entire questionnaire (e.g. never at home, refuse to respond, fail to mail back questionnaire) Errors may be introduced as respondents may differ from non-respondents in important ways. Example: Survey to determine attitudes to increase of parking fee on campus  It is likely that mainly those strongly opposed will reply. Note: Non-response in postal surveys is a particular problem - no clues as to characteristics of non-respondents (eg, approximate age, gender, etc.). Postal surveys: Respondents are likely to be:  Favourably disposed to survey’s aims  politically/socially active  higher socio-economic group  receptive to new ideas  rapid decision-makers  high achievers  used to communication by post, filling out forms non respondants (or late respondants) are likely to be:

STAT2114    

elderly, withdrawn live in urban, rather than suburban areas feel that they will be judged by their responses feel inadequate to supply information

Dealing with non-response:  call-backs o plan a fixed number of call-backs o call-backs at different times of day and different days of week  rewards and incentives o offer reward for response o rewards should be offered to potential particapants onluy after they have been selected. Inaccurate responses:  sensitive issues: o did u cheat on ur tax form?  Most respondants will say they didn’t (someties anaonymoys questions are better for such questions  Ambiguous questions o Are u social drug user?  Occasional? Regular? Once?  Is alcohol a social drug? o Number of years education? Selection bias: Example: The Literary Digest Poll  In 1936, FD Roosevelt completed his first term in White House.  Most predicted FDR to have an easy victory in presidential race against the Republican Alfred Landon.  Not so the Literary Digest! – Predicted overwhelming victory for Landon: Landon 57%, FDR 43%. 65 Note:  Literary Digest had correctly predicted results of all presidential elections since 1916.  The 1936 prediction was based on the largest poll in history: 2.4 million respondents!  But FDR won by a landslide 62% to Landon’s 38%  What went wrong? o Questionnaire was mailed to 10million people o There were 2.4 million respondents o Mailing list was compiled from  Telephone books  Club membership lists  Automobile club lists  Hence this screens out the poor and so selection bias is made apparent Wording of questions Vital as it can guide u towards correct answer and so people feel inclined to vote for supposed correct option Biased wording;

STAT2114 

Are you in favour of Australia becoming a republic? (YOU ARE SHOWN THAT IT IS ‘CORRECT’ TO BE IN FAVOUR OF A REPUBLIC)



Are you in favour of or opposed to Australia becoming a republic? (THIS IS NEUTRAL)

Vs

Effect of the question on survey responses Factual questions examples:  What is your regularly hourly rate of pay on this job as of 30th April?  When asked “How many rooms in the household?” 1 in 6 gave the wrong answer.  This is not always as easy as it might seem.  Need precise definition of “fact” to be collected. What is counted as a room here? Opinion questions examples:  Eg, “As you know, many older people share a home with their grown children. Do you think this is generally a good idea or a bad idea?” o For this kind of question, you need to be careful with its wording, SGTA exercises week 1:

Week 2: simple random sampling Some probability concepts: say we have a random variable Y which can take on the values y1, y2, … ,yk with probabilities p1, p2, … pk Then Y is a discrete random variable.

We must have Expected value of a discrete random variable Expected value of Y: E(Y) = μ

Example: expected value You are plagued by possums in your roof at night. Let the random variable Y be the number of possums in your roof on a particular night.

STAT2114 After careful observation over a period of time, you have determined the following probabilities for Y:

What these probabilities mean is:  On 30% of nights there are no possums , on 40% of nights 1 possum, etc  The probability that you will observe no possums on a randomly chosen night is 0.3, one possum 0.4, etc Expected number of possums: E(Y) = 00.3 + 10.4 + 20.2 + 30.1 = 1.1 Note: this gives the average number of observed possums per night in the long run. Expected value of a function of a discrete random variable Say we are interested in a function of Y, g (Y) Examples: g(Y) = Y 2 , g(Y) = √ Y The expected value of g (Y) is given by

Possums example: let g(Y) = Y 2 The expected square of the number of possums in E(Y 2 ) = 020.3 + 120.4 + 220.2 + 320.1 = 2.1 Variance of a discrete random variable: the variance of a discrete random variable Y is defined as Var (Y) = σ 2 = E [(Y- μ )2]

I

Y from its mean.

V

ms

STAT2114

Alternative formula for the variance Var (Y) = E(Y2) – [E(Y)]2 Possum example: E(Y) = 1.1 E(Y 2 ) = 2.1 E(Y 2 ) - [E(Y)]2 = 2.1 - 1.12 = 0.89 = Var(Y)

Covariance For two random variables Y1 and Y2 , the covariance between Y1 and Y2 is defined as

Variance of a sum of random variables: it can be shown by expanding the pattern seen for the 2 variables case on the previous slide, for n random variables

Also,

STAT2114

Finite populations of size N The population values for a finite population of size N are denoted as

The population variance is

Simple random sampling (SRS) 

Appears in some stage of or some form in most sampling procedures



Easiest to deal with analytically



Appropriate if one has no prior info about structure in the population to be sampled



SRS: a Simple Random Sample, is a sample of size ‘n’ drawn from a population of size ‘N’, in such a way that every possible sample of size ‘n’ has the same chance of being selected



NOTE: SAMPLING IS DONE WITHOUT REPLACEMENT o This procedure is called simple random sampling o The sample obtained is a simple random sample



Another way to describe a sample random would be a sample size “n” that has been selected in the way which allows each individual in the population of size ‘N’ has exact the same chance/probability (=n/N) to be selected into the example.

STAT2114 Selecting a SRS in practice 

Label population 1,2,…,N



use the table ...


Similar Free PDFs