GEA1000 LECTURE NOTES PDF

Title GEA1000 LECTURE NOTES
Course Career Catalyst
Institution National University of Singapore
Pages 54
File Size 3.8 MB
File Type PDF
Total Downloads 79
Total Views 130

Summary

GEA1000 LECTURE NOTES GEA1000 LECTURE NOTES GEA1000 LECTURE NOTES GEA1000 LECTURE NOTES GEA1000 LECTURE NOTES GEA1000 LECTURE NOTES...


Description

GEA1000 Consolidated Notes

Population -

The entire group (of individuals or objects) that we want to know something about.

Research question Type of research question

Examples

Make an estimate about the population

What is the average number of hours that students study each week? What proportions of all Singapore students are enrolled in a university?

Test a claim about the population

Is the average course load for a university student greater than 20 units? Does the majority of students qualify for student loans?

Compare two-subpopulations

In university X, do female students have a higher GPA score than male students? Are student athletes more likely than non-athletes to do final year projects?

Investigate a relationship between two variable in the population (compares two subpopulations)

Is there a relationship between the average number of hours students spend each week on facebook and their GPA? Does drinking coffee help students pass the math exam?

SAMPLING Population vs Sample

-

Population of interest -- group in which researcher has interest in drawing conclusions of the study. Population parameter -- a numerical fact about a population. Sample -- a proportion of the population selected in the study. Estimate -- an inference about the population’s parameter, based on the information based on the sample.

Sampling Frame

-

‘Source material’ from which a sample is drawn. - Sampling is drawn from the sampling frame May not cover the population of interest One of the conditions for for generalisability - Sampling frame is equal to or greater than the population of interest.

Census vs Sample - Census - an attempt to reach out to the entire population of interest - Sample - a proportion of the population selected - Why sampling over population data? - Cost speed

so u have a population u want to know something about. a sample is when u gather data on a chunk of the population - Got sampling frame a census is when u gather data on the entire population - No sampling frame Bias Selection Bias

Non-response Bias

Associated with the researcher’s biased selection of units

Associated with the participants’ non-disclosure of information related to the study

- Imperfect sampling frame - Non-probability Sampling

- Disinterested - Inconvenient - Unwilling to disclose sensitive information

PROBABILITY SAMPLING What is it? - sampling scheme such that the selection process is via a known randomised mechanism. - The probability of selection may not be the same throughout all units of the population. General idea is to have the element of chance in the selection process, so as to eliminate biases associated with selection.

Types of Probability Sampling Sampling Plan

What is it

Simple random sample

-

-

Units are selected randomly from the sampling frame Mechanism: random number generator (Eg: random-digit dialling) Sample results do not change haphazardly from sample to sample. Variability is due to chance.

Advantages

Disadvantages

Good representation of the population

Time consuming: subject to non-response, accessibility of information

Systematic sample

Stratified random sample

A method of selecting units from a list by applying a selection interval k, and random starting point from the first interval. -

Cluster random sample

-

Simpler selection process as opposed to Simple Random Sampling

Potentially underrepresenti ng the population if list is non-random

Breaking down the population into strata. Each stratum are similar in nature, but size may vary across strata Simple random sample from every strata Example: Sample count (General Election)

Good representation of sample by stratum

Require sampling frame and criteria for classification of population into stratum

Breaking down the population into clusters Randomly sample a fixed number of clusters Include all observations from selected clusters Example: Mental wellness surveys in schools

Less time-consuming and less costly

Require larger sample size in order to achieve a low margin of error. High variability due to dissimilar clusters or small number of clusters

NON-PROBABILITY SAMPLING Sampling plan

What is it

Convenience sampling

The researcher uses the subjects that are most easily available to participate in the research study. Eg mall survey

Issues 1. Demographics of mall goers - teenagers, retired people, people who are more affluent. Other groups (non-teenagers and retirees, and the not so affluent) are left out. This is a good example of selection bias. 2. Individuals asked to

do the survey may not respond. This could lead to non-response bias Volunteer sampling

The researcher actively seeks Those who did not respond volunteers to participate in the are left out of the study. This presents to us a clear problem study. Eg online polls on non-response bias, in a volunteer sample. Selection bias? Non-response bias

GENERALISABILITY CRITERIA 1. Good sampling frame 2. Probability-based sampling 3. Large sample size 4. Minimum Non-response

VARIABLES

Types of variables Categorical variables

Numerical variables

Nominal

Ordinal

Discrete

Continuous

No intrinsic ordering for the variables Eg, nationality

Categories come with some natural ordering and numbers are often used to represent the ordering Eg, happiness level

One where possible values of the variable form a set of numbers with “gaps”.

One that can take on all possible numerical values in a given range or interval

Eg, modular credits

Eg, time

MEDIAN -

The median of a numerical variable in a data set is the middle variable after arranging the values of the data set in ascending/ descending order.

Properties of median - Adding a constant value (positive or negative) to all the data points changes the median y that constant value. - Multiplying all the data points by a constant value c results in the median being multiplied by c - Similar to the mean, knowing the median alone also does not tell us about the frequency of occurrence of scores nor does it tell us how the scores are distributed within the class. - More generally, the median of a numerical variable does not tell us the total value, frequency of occurrence or the distribution of data points of the numerical variable.

Quartiles and Interquartile Range - The first quartile, usually denoted by Q1, is the 25th percentile of the data values and the third quartile, usually denoted by Q3 is the 75th percentile of the data values. Similarities between IQR and Standard deviation - The IQR is always non-negative and this follows from the fact that 𝑄3 is at least as large as 𝑄3.

-

Adding a constant value, 𝑐 (positive or negative) to all the data points does not change the IQR. Multiplying all the data points by a constant value 𝑐 results in the IQR being multiplied by |𝑐|.

Using summary statistics appropriately - The mean and standard deviation are a pair of summary statistics that attempt to describe the central tendency and dispersion of the data - The median and IQR are another pair of summary statistics that attempt to describe the central tendency and dispersion of the data - Given that we have defined 2 ways of measuring the central tendency and spread of the data, it may occur to some to ask “So which notion of central tendency and dispersion is more useful?

-

The answer in short depends on the distribution of the data points. Briefly speaking, the median is often used in preference to the mean when the distribution of points is not symmetrical.

-

An example of this occurs in HDB prices. The median is typically chosen as the measure of central tendency for HDB prices since there may be flats which are extremely expensive relative to the price of most of the HDB flats.

-

This will be covered in greater detail in subsequent chapters and units when we learn the idea of skewness of a distribution.

MODE -

Mode of a variable is the value of the variable that appears the most frequently.

EXCEL COMMAND : “=MODE”

Interpretation of mode as “peaks” - When we are describing the distribution of points of a discrete variable, the mode can be interpreted as a “peak” of the distribution. In the context of probability, a peak of the distribution, refers to the value that has the highest probability of occurring.

EXPERIMENTS Types of Study Designs 1. Experimental 2. Observational

Experiment vs Observational Study Experimental study

Observational Study

Assigned by the researcher

Assignment by subjects themselves

Can provide evidence of a cause and effect relationship

Can provide evidence of ‘association’, not cause and effect Saves time and money Presents fewer ethical issues

CAUSE AND EFFECT RELATIONSHIP - To establish a cause-and-effect relationship, we want to make sure that the independent variable is the only factor that impacts the dependent variable. - How to account for the effects from all these other variables? RANDOM ASSIGNMENT

RANDOM ASSIGNMENT - Random assignment is an impartial procedure that uses chance. - If the number of subjects is large, by the laws of probability, the treatment and control groups will tend to be similar in all aspects.

PLACEBO Placebo - Treatment with no active ingredients, and no effect (instead of coffee, give choffy)

Placebo effects - The response observed when subjects receive a placebo treatment, but still show some positive effects - Choffy is what we call a ‘Placebo’. A placebo is a treatment with no active ingredients and has no effect. - While this seems like a good idea, there has been a considerable amount of research showing that people who receive a treatment with no active ingredients can also show positive effects. Merely thinking that they received some form of treatment was enough to observe some response in the subjects, even if the treatment does nothing! This is called the placebo effect. BLINDING - Blinded subjects do not know whether they are in the treatment or control group. - A placebo that is very similar to the treatment can be chosen to help make the blinding effective. - The subjects are blind to the treatment to prevent their own beliefs about the treatment from affecting the results - For the coffee example, subjects in both treatment and control group will be given a cup of drink every morning. However, the study will ensure that both treatment and placebo smell and taste the same, to prevent subjects from knowing whether they are drinking coffee or choffy! - Blinded accessors do not know whether they are assessing the treatment group or control group. Double-blinding - An experiment is called double blinding if both subjects and assessors are blinded about the assignment.

RATES

Analysing 1 categorical variable:

-

-

Alternatively, we can calculate the rate of successful treatments, which is 831/1050 (i.e. successful treatments / all treatments) = 0.79, or 79%. Here, we can see that a majority of the treatments are successful, since rate(Success) > rate(Failure). We will be using rates for much of this chapter. Intuitively, we can think of a rate as a fraction, proportion, or a percentage. This is useful for understanding some of its properties. For example, we note that 0%< rate(X) rate(B | NA)

rate of A given B is more than the rate of A given NB if and only if the rate of B given A is more than the rate of B given NA

rate(A | B) < rate(A | NB) ⇔ rate(B | A) < rate(B | NA)

rate of A given B is less than the rate of A given NB if and only if the rate of B given A is less than the rate of B given NA

rate(A | B) = rate(A | NB) ⇔ rate(B | A) = rate(B | NA)

rate of A given B is equal to the rate of A given NB if and only if the rate of B given A is equal to the rate of B given NA

⇔means the two relationships occur together. So, showing that one side is true implies that the other side is also true.

Consequence of the symmetry rule To identify if there is any association, check for either: 1. rate(A | B) = rate(A | NB) OR 2. rate(B | A) = rate(B | NA) rate(Success | X) < rate(Success | Y): Negative association between successful treatments and treatment X Check: rate(X | Success) < rate(X | Failure) BASIC RULE ON RATES -

The overall rate(A) will always lie between rate(A | B) and rate(A | NB).

Consequences of the basic rules on rates 1. The closer rate(B) is to 100%, the closer rate(A) is to rate(A | B). 2. If rate(B) = 50%, then

3. If rate(A | B) = rate(A | NB), then rate(A) = rate(A|B) = rate(A|NB).

Linking back to consequences 2 and 3

Linking back to data set at hand

SIMPSON’S PARADOX

Previously, we found that Treatment Y was positively associated with success overall, but individually across large and small stones, Treatment X was positively associated with success. This was an example of a Simpson’s paradox, a phenomenon whereby the direction of association gets reversed when the groups are combined. As a side note, in this example there are only two subgroups – which are small stones and large stones. In examples where there are more than two subgroups, we will call it a Simpson’s paradox as long as the majority of the individual subgroup rates are opposite from the overall association. For example, if there happen to be three

subgroups, as long as 2 out of 3 of them are opposite from overall, we will call that scenario a Simpson’s paradox.

Analysing 3 categorical variables

-

Let’s look at the rates highlighted in yellow. We can see the Simpson’s paradox occurring, whereby the rate of success amongst small and large stones is better individually for treatment X, but the overall rate is better for Y.

-

We see that treatment X has been used to treat mostly patients with large stones (526 large vs 174 small). Thus, from the Basic Rule of Rates, we know that the overall success rate of X will lie a lot closer to 72.4% than 92.5%. As it turns out, the overall success rate of X is 77.4%. Treatment Y, on the other hand, has been used to treat mainly patients with small stones (80 Large vs 270 small) and that means the overall success rate of Y will lie a lot closer to its success rate across small stones, which is 86.7%. As it turns out, the overall success rate of Y is very close at 82.6%.

-

-

Due to the overall success rate of X being so close to 72.4% and overall success rate of Y being so close to 86.7%, we have an overall success rate of X lower than overall success rate of Y.

-

Success rates for each stone type: The success rates for large stones ranges from 68.8% to 72.4%. Whereas the success rates for small stones ranges from 86.7% to 92.5%. We see that in general, the large stones have a lower rate of success than the small stones. In other words, this means the large stones are more difficult to cure.

Conclusion for Paradox - Treatment X is in fact better than treatment Y, but because people have been using treatment X to treat the more difficult cases, this lowers the overall success rate of treatment X.

Stone size → CONFOUND Note: when a Simpson’s paradox occurs, it implies that there is definitely a confounding variable present. This however, does not mean that a confounder necessarily leads to a Simpson’s paradox

CONFOUNDERS -

A confounder is a third variable that is associated to both the independent and dependent variable whose relationship we are investigating.

-

-

This means that there is a higher proportion of large stones being assigned to treatment X as compared to treatment Y. Thus, Large stones are positively associated with treatment X.

The rate of success given large stones is 0.719, which is lower than the rate of success given small stones, which is 0.890. In other words: Large stones are negatively associated with success.

-

Stone size is associated with treatment type, by comparing rates. And we have shown that stone size is associated with success. Therefore, stone size is a confounder, when we are investigating the relationship between treatment type and success.

DISTRIBUTIONS - Univariate EDA - Histograms - Box Plots SCATTERPLOTS - Bivariate EDA - Correlation Coefficient - Linear Regression Construction of box plots

Bivariate EDA to find association 1. Scatterplots 2. Correlation Coefficients 3. Regression Analysis

CORRELATION COEFFICIENT - Correlation coefficient is a measure of linear association - Range is between -1 and 1 - It summarizes direction and strength of linear association

Interpreting r value

𝑟 > 0 → positive association 𝑟 < 0 → negative association 𝑟 = 0 → no linear association 𝑟 = 1 → perfect positive association ฀𝑟 = −1 → perfect negative association

Properties of r - Not affected by the following: 1. Interchange of 2 variables 2. Adding a number to all values of a variable 3. Multiplying a +ve number to all values of a variable - Affected by: 1. the amount of variability in the data 2. differences in the shapes of the 2 distributions 3. lack of linearity 4. the presence of 1 or more "outliers," 5. characteristics of the sample 6. measurement error.

Take note 1. Correlation does not imply causation 2. Correlation coefficient does not measure non-linear association 3. How outliers affect correlation coefficient - outliers may decrease the strength of linear association between two variables - may also increase the correlation coefficient.

LINEAR REGRESSION - We model the relationship by a straight line Y= 𝑚𝑋 + 𝑏 2 things linear regression helps us with - Given some data points, we get a line: Y=m𝑋+𝑐 1. Prediction of average y given x. 2. Helps us understand how Y changes on average w.r.t. 1 unit change in X How to find regression line - Define the i-th residual of the observation: 𝑒i= difference between the observed outcome and predicted outcome. - Want to minimize 𝑒12 + ⋯ + 𝑒n2 where 𝑛=no of data points

Slope vs Correlation Coefficient -

𝑆𝑦

The slope of the regression line and correlation coefficient is related by 𝑚 = 𝑆𝑥 𝑟, where 𝑆𝑦 is the standard deviation for 𝑦 and 𝑆𝑥 is the standard deviation for 𝑥.

FIND HOW TO MODEL IN CHAPTER 3 NOTES..

EXTRAPOLATION

CANNOT USE UR Y TO FIND X SAMPLE SPACE AND EVENTS A sample space, denoted by S, is the set of all possible outcomes of a random process (also known as a probability experiment)

An event of the sample space is a subset of the sample space

Discrete random variable: one where the S = {1, 2, 3, 4, 5, 6} with throwing the dice sample space of outcomes forms a being a random process discrete numerical variable. ● Event A can be A = {1, 3, 5} ● Eg. throwing a dice. S = {1, 2, 3, 4, 5, ● Event B can be B = {2, 4, 6} 6} Continuous random variable: one where the sample space of outcomes forms a continuous numerical variable ● Eg. Normally distributed IQ scores wit...


Similar Free PDFs