Biostats 100B Final Cheat Sheet PDF

Title Biostats 100B Final Cheat Sheet
Author Idean Roohani
Course Introduction to Biostatistics
Institution University of California Los Angeles
Pages 5
File Size 404.4 KB
File Type PDF
Total Downloads 32
Total Views 145

Summary

Download Biostats 100B Final Cheat Sheet PDF


Description

MIDTERM 1 Data & Ways to Collect It quantitative refers to data with variables in which numerical results make sense. continuous quantitative involves any real number within a reasonable range discrete quantitative data may only take on certain values (i.e., whole numbers) qualitative data does not involve #s, and rather refers to qualities or attributes; it takes immeasurable, categorical measurements nominal - data that can only be classified into categories (qualitative) ordinal - data in which one value being larger than the other matters, but the numerical value of the number does not matter. RANKING. interval - the difference between two numbers is meaningful, but the ratio of the measurements is not meaningful; you can put them in order, and you can compare them with addition and subtraction (but not multiplication and division) Zero is not significant. ratio - data that can be ordered and meaningfully compared through all operations (addition, subtraction, division, and multiplication). Zero is significant. Collection of Data/Intro to Sampling Non-probability sampling - some elements have no probability of being selected for inclusion in the sample (aka chunk/convenience) --- introduces bias into the sample (bias is favoritism towards certain elements in the sample) Probability Sampling - every element of the population has some chance of being included in the sample Simple Random Sampling - all elements have an equal probability of being selected at each step of the sampling process - Obtain population frame - Assign every element in sample a # - Select these #s at random in systematic way - Associate these #s with the corresponding population elements to produce sample Stratified Random Sampling - accounts for the invalid assumption made by SRS that the sample population is homogenous, and thus divides sample into strata/subpopulations and takes SRS from each division Cluster Random Sampling - randomly select clusters of units, and then use every element in cluster for sample population - cluster should be easily defined for unique assignment of population elements - cluster should be relatively small since all elements being used - clusters sampled for cost - when cluster too big, two-stage cluster is used (SRS is performed within cluster) - need population frame to perform random sample

3) The Median is simply the 50th percentile of the sample; to calculate, place observations in numerical order and find the middle. THE BEST MEASURE OF LOCATION IS THE MEDIAN.

4) The Mode is the category with the most observations in it (most frequent value) 5) Midrange

Mf =

largest observation+ smallest observation x 1 +x n = 2 2

Measures of Dispersion 1) Mean Absolute Deviation is the average absolute distance observations are from their mean

2) Variance is the average squared deviation of observations from their mean n 2 i i=1 2

∑ (x − x´ )

Variance ( s )=

n−1

3) Standard Deviation is the square root of the variance



SD=

n

2 (x i− ´x ) ∑ i=1

n−1

4) The Range is the largest observation in the sample minus the smallest observation in the sample R = 4 SD, so s = R/4 5) Interquartile Range (IQR) is simply 75th - 25th percentiles Lower Fence = Q1 - 1.5IQR (lower limit of outliers) Upper Fence = Q3 + 1.5IQR (upper limit of outliers)

Five Number Summary 1) Minimum 2) Maximum 3) Median 4)

Q1

Q3

5)

Boxplot

Systematic Random Sampling - method of random sampling used when population info is ordered sequentially -let N = number of elements in population, n = number of elements in the sample, and k = n/N (if k is not an integer, round to next highest integer) -randomly select integer between 1 and k. Call this number a. -the sample will then consist of ath, (a+k)th...(a + (n-1)k)th sequentially numbering elements in the population Misleading Samples - a bigger sample is not better than a smaller sample, especially if there is bias (because the bias will be accentuated) population frame must be chosen carefully (because many frames may only reflect a portion of the population) sampling human populations must be aware of the potential of nonrespondents, as this can produce a selection bias Hawthorne effect - the awareness of being observed changes your behavior (because the individual likely modifies their response) Measures of Location and Variability Measures of Location describes the center of the data (basically the average value) Measures of Dispersion is the number that describes the variability of data about its center statistics are calculated values or numbers that summarize or describe the sample data Measures of Location/Measures of Center 1) The Arithmetic Mean - sum of observations in the sample divided by the number of observations in the sample

-if sample is too large, use frequency table to calculate arithmetic mean

Left skewed (negative skew) has long tail on left side, Right skewed (positive skew) has long tail on right side Bottom Whisker = Minimum Top Whisker = Maximum

MIDTERM 2 Probability Probability describes the likelihood of an EVENT happening Statistical experiment involves outcomes that are determined according to chance variation; all outcomes are known in advance, meaning we know what could happen, but we don't know what will happen

Rules of Probability

The Normal Distribution *Shorthand - N(μ, σ) If the variable Y follows a normal distribution, then About 68% of the y’s are within

xi = midpoint of interval; fi = frequency in interval; N = total fi If data is “well behaved” (essentially symmetric), then 50% have less than or equal to the mean and 50% have more than the mean. If data is “skewed”, mean is not a good measure of location median preferred over mean. 2) The Geometric Mean = average of the logarithmic values, converted back to a base 10 10average of logarithmic values

About 95% of the y’s are within

± 1 SD of the mean ± 2 SD of the mean ± 3 SD of the mean

About 99.7% of the y’s are within Properties of the Normal Distribution: 1) The curve extends to +∞ and

−∞

and is asymptotic the horizontal

axis (touches the axis at ±∞ ) 2) The total area under the curve is 1 unit completely determine the particular normal curve 3) μ and σ

4) The curve is symmetric about μ (mean = median = mode) The Standard Normal Distribution (aka Z-Scores) With continuous variables, probability of an exact value is virtually 0 (because 64.01415015" is different than 64.0252952" and is not 64") Another way to think about it is that there is no area under a point in a curve (only area under a range) Frequentist approach = One sample is interpreted by thinking about all the samples that we could have perceived. Bayesian Principle = Need prior knowledge which is incorporated into the analysis Prior Distribution = Drawing graphs of possible values of μ (x-axis =

μ , y-axis = strengths of belief) based on their prior knowledge. We say that the sample mean ( x) is an unbiased estimate of the population mean (μ) Median is NOT an unbiased estimate of the population median. Neither is the mode or mid-range Applying Z-Transformations

Sampling Distribution of

μx´ =μ

Variance:

σ ´x2=

´ X

σ √n

σ ´x = σ2 n

If we are dealing with x rather than the individual measurement X, then we can use the Central Limit Theorem to justify the use of the normal distribution under these circumstances Standard Error:

SE ´x =

Ha

when

H0

is true.

H0 is true - Type II Error (Beta Error/False Negative) = Not finding significant evidence for H a when H a is true. - The risk of Type I error is a probability computed under the assumption that

x −μ σ √n

The sampling distribution of

Ha α ¿.

- α error = level of significance = P(type I error) = consumer risk - β error = P(type II error) = producer risk - α and β are inversely related, just like sensitivity and specificity - P(α) + P(β) DOES NOT equal 1. - if it were, then you would always be making a mistake - P(type II error) = β; specific value depends.... - we can ensure that β error remains small by increasing the sample size - increasing the sample size decreases β error - β error may be as high as 20%, but the range is usually 10-20% for β errors in design. - POWER is defined as 1- β - increasing sample size increases power - As  increases,  decreases and power decreases

σ √n

Standard Error incorporates the two factors that affect reliability: 1) The inherent variability of the observations (standard deviation: s) 2) The sample size (n) Z-Score of Sampling Distribution

Z=

- Type I Error (Alpha Error/False Positive) = Finding significant evidence for

is true - The risk for Type I error is always limited by the chosen significance level (

Z-value tells us how many SDs an observation is from the mean

Standard Deviation:

- Judgment of null hypothesis is based on the p-value calculation

- The risk of Type II error is a probability computed under the assumption that

x −μ Z= σ Mean:

- probability distribution, because area under curve = 1 -small sample distribution; main use when σ is unknown and n is small - variance (or SD) of z-distribution is smaller than t-distribution - there are infinite # of t-distributions dependent on degrees of freedom

Type I and Type II errors are inversely related: As one increases, the other decreases. The Type I, or α (alpha), error rate is usually set in advance by the researcher. The Type II error rate for a given test is harder to know because it requires estimating the distribution of the alternative hypothesis, which is usually unknown. FINAL MATERIAL

Y1

Paired Design = the observations (

´x

will depend on the sample size n in two ways



1) Standard deviation is inversely proportional to n . 2) If the population distribution is not normal, then the shape of the sampling distribution of ´x depends on n, being more nealy normal for larger n. As the sample size increase, the sample mean and sample standard deviation tend to approach more closely the population mean and population standard deviation, and the standard error tends to approach closely to 0. Central Limit Theorem & When to Use Z Statistics vs. T Statistics Suppose you are sampling from an infinite population with a finite variance; then if n is sufficiently large, the sampling distribution of x will be approximately normal - Sufficiently large is n > 30 - If n > 30, it is okay to use s to approximate σ for your z-statistics. - If n < 30, then you must use the t-statistics. However, there is one instance that is worth mentioning. If the sample population is precisely a normal distribution, then, regardless of the sample size, the distribution of ´x will be normal (not approximately normal, but exactly normal). This fact is not a direct consequence of the CLT, but if we can assume we are sampling from a normal population, then that theorem needn’t be invoked. It is only when we can’t make this assumption that the CLT demonstrates its utility. Confidence Intervals Improving accuracy of an estimate requires a trade off to be less precise Precision is represented by the width of the CI; accuracy is represented by the level of confidence - Wider CI = less precise estimates, but more accurate (fall within range) - Can attain higher CI without compromising precision for interval by increasing sample size n - Width of Confidence Interval = ( The t-Distribution - mean = median = mode = center = 0

x+ ¯

Z P ∗s ¿

√n

) -(

x− ¯

Z P ∗s ¿

√n

)

,

Y2

) occur in pairs; the observational

units in a pair are linked in some way, so they have more in common with each other than with members of another pair. Example Hypotheses ): Null Hypothesis ( H μD =0 0 - Ex: Hunger when taking mCPP is no different from hunger when taking a placebo Alternative Hypothesis ( -

H A¿

:

μD ≠ 0

Ex: Hunger when taking mCPP is different from hunger when taking a placebo - Pairing in an experimental design can serve to reduce bias, to increase precision, or both. - The primary purpose of pairing is to increase precision - The extra information made available by an effectively paired design is entirely wasted if an unpaired analysis is used - When we pair two samples, we eliminate a covariate - Paired design reduced to performing a 1-sample t-test on the differences. Paired Data is analyzed by a 1-Sample T-Test - It is harder to reject the null hypothesis for a paired t-test compared to a 2-sample t-test - If it is harder to reject null hypothesis, we are more likely to make a type 2 error - Paired t-test has less power compared to the 2-sample t-test - Paired when you need to, don’t do it all the time because you have less power - Qualitative Data aka Categorical Data and Count Data and Dichotomous data - Parameter = population proportion (pi) - Sample proportion = p

are known under most circumstances. 28. Parameters are characteristics of a population, are fixed for that population though inevitably unknown in statistical problems.TRUE 29. The standard error of the mean increases with decreasing sample size.TRUE 30. The standard error of the mean increases with increasing population size.TRUE 31. The Standard Deviation for a population of measurements 16.0 meters, the standard deviation computed from a sample of sixteen observations from this population would be expected to be approximately 4.0 meters.FALSE 32. The Standard Deviation for a population of measurements 25 mg, the standard deviation computed from a sample of 625 observations from this population would be expected to be approximately 1.0 mg. FALSE 33. If in a sample of size 2 the observations are 0 and 2, the variance is 2 .



True/False Question Bank 1. The smaller the level of confidence, the shorter the confidence interval. TRUE 2. For a fixed confidence level, when the sample size increases, the width of the confidence interval for the population mean, μ, decreases. TRUE 3. For a fixed confidence level, when the sample size increases, the margin of error of the confidence interval for a population mean decreases. TRUE 4. The null hypothesis H0 is considered correct until proven otherwise. TRUE 5. If the null is rejected, this means that the null is not true. FALSE 6. If the null is not rejected, this means that the null hypothesis is true. FALSE 7. As the sample size increases, the power of the test of the null hypothesis, H0, increases. TRUE 8. α = 1 – β FALSE 9. If we reject H0, then we can say we have conclusively proven H1 is true. FALSE 10. If we reject H0, then we can say we have conclusively proven H0 is false. FALSE 11. If we take a large enough sample from any population, then the histogram for the sample data would be expected to be bell-shaped. FALSE 12. If we take a large random sample, the mean of the sample will be approximately equal to the median. FALSE 13. If we reject H0 at α = 0.05, the probability is 0.95 that H1 is correct. FALSE 14. The larger the sample size, the larger the confidence interval. FALSE 15. The alternative hypothesis, H1, is considered correct until demonstrated otherwise. FALSE 16. As the sample size decreases, the probability of a β error decreases. FALSE 17. In systematic random sampling, any selected sequence of observation is just as likely to be chosen as any other distinct sequence. TRUE 18. In stratified random sampling, we define strata so that the individuals in each of these strata are as heterogeneous as possible. FALSE 19. In any symmetric distribution the mean, median, and mode have the same value. FALSE 20. Since x-bar is an unbiased estimate of population mean, this means that any value of c-bar that we observe must be very close to the population mean. FALSE 21. The sample standard deviation is used with skewed data as a measure of variability of the sample.FALSE 22. The sample midrange can be used as a reasonable estimate of the sample standard deviation.FALSE 23. Subtracting the arithmetic mean from each observation in a sample, the sum of values is always zero. TRUE 24. Subtracting the arithmetic mean from each observation in a sample, results in a group of values whose mean is always zero.TRUE 25. Subtracting the arithmetic mean from each observation in a sample, results in a group of values whose standard deviation is always zero.FALSE 26. If X1=3, X2=81, X3=27, X4=3, and X5=729, then the geometric mean is 27. TRUE 27. Parameters are characteristics of a population, are fixed for that population, but

FALSE 34. One can obtain a true random sample of the individuals in a community by using the information from the Department of Social Security. FALSE 35. One can obtain a true random sample of the individuals in a community by using the information from the Department of Motor Vehicles. FALSE 36. One can obtain a true random sample of the individuals in a community by using the voter registration database. FALSE 37. Individual arithmetic means computed from random sample can vary among themselves but are usually equal to the mean of the population. FALSE 38. Arithmetic means computer from random samples vary among themselves and may differ from the mean of the population. TRUE 39. In any skewed distribution, mean > median. FALSE 40. After subtracting the median from each observation in a sample, the sum of the values is always zero. FALSE Old Midterm Questions with α = 0.05 and n=50 In testing H0, μ=100 a) If n is change to 15 and α is left unchanged, what happens to β? INCREASED b) If n is change to 100 and α is left unchanged, what happens to 1-β? INCREASED c) If α is changed to 0.01 and n is left unchanged, what happens to β? INCREASED d) If H0 is not correct, but the true μ is actually very close to 100, β should be close to: 0, 1, or 0.5? 1 e) If H0 is not correct, but the true μ is actually very far from 100, β should be very close to: 0, 1, or 0.5? 0 If our significance level, α, is set equal to 1, then we always reject H0. If we sample the entire population, then α=β=0. If we set α=0.05, then β must be… Can’t tell with the information given. Suppose we test a new device in 20 healthy volunteers and measure the change in their heart rate. The mean was 10 and the standard deviation was 5. Compute the 90% confidence interval for μ , the population mean change in the heart rate.

´x ± t

5 s →10 ± 1.729 → ( 8.067, 11.933 ) ; 8.067< μ2158 )=0.10 P (1642 < μ < 2158 )=0.90

(iii) If we were to take another sample from this population, we would probably get an interval with different endpoints. (iv) The probability is 90% that if we were to take another sample, the new x-bar would be between 1642 and 2158. (v) The formula that generated the interval in question yields a correct interval (contains μ ) 90% of the time. Assume that the length of time between charges of a cellular phone is normally distributed with a mean of 10 hours and a standard deviation of 5 hours. If I have a fully charged phone at 8 am, find the probability that I will have to recharge the phone by 10 pm (same day).

By 10 pm: x =14 hours→ Z =

x−μ 14 −10 =+0.80 → = 5 σ

α

H0

, we reject

, there is significant evidence

substituting words for μ ) H α 0 , there is not significant evidence to support the claim that (Write out H a substituting words for μ ) to support the claim that (Write out

Since p-value >

Ha
...


Similar Free PDFs