STA 205 Composite Notes (NKU) PDF

Title STA 205 Composite Notes (NKU)
Course Introduction To Statistical Methods
Institution Northern Kentucky University
Pages 20
File Size 467.7 KB
File Type PDF
Total Downloads 51
Total Views 142

Summary

Every lecture was attended, notes including everything on board and what the professor said. ...


Description

STA 205 Composite Notes (NKU) 1/16 -

-

-

1

Idea of the course: to understand how people use data in different situations to research and learn (can be applied to education, medicine, etc.). Ideally, you would like to understand an entire population, but due to data and time constraints, this is often not possible, so you look at a subset, make an inference, and extend that to the entire population. Statistics as #s- mean, median, mode, range, standard deviation, proportions, growth rate, batting averages, etc. Statistics as a science- collecting data (taking a sample from the population, census), calculating, conclusion Population- “all” (CS: all is the keyword here) values of interest, does not refer to only people, can refer to all of the ages/incomes/moths/anything in a set. N-value- the size of a population (CS: most likely an estimate) Census- a way to observe EVERY value in the population Sample- SPECIFIC subset of the population (CS: special number of values) n-> sample size Survey- to look at part of the population (CS: if you perform a survey of the population, you are taking a sample of the population, the sample is the product of the survey) Parameter- population values (the population mean, median, proportion, standard deviation), any of the numbers that refer to all/any of the values in the population Statistics- sample values All statistics is estimation, there is a grey area where problems can be left open ended, leaving you to make an educated decision. Random sampling- every value in the population has the same opportunity of being selected Convenience Sampling- the sample is easy to obtain NOTE: You need to understand your population to understand the kind of sample you have. Self-selected (Volunteer)- participants volunteer or choose to be part of a study (ex: polls, questionnaires) Two Different Types of Data we will be working with - Categorical - Data can be categorized based on a certain quality or attribute (ex: dividing the classroom based on age, gender, eye color, hair color, major, zip code, area code, social security numbers) - Proportions - If you are dealing with categorical data, you are seeking a percentage - Quantitative - Numerical (every set of quantitative data is numerical, but not every data made up of numbers is quantitative) - Averages are meaningful (if we wrote down everyone in the room’s phone number and found the average, that would not be meaningful) - Age, weight, height, distance, GPAS

STA 205 Composite Notes (NKU) 1/18 -

-

-

-

2

Review from last class - Keep in mind: population is the overall set of data - The sample is a smaller set of the population - A census is getting information from everything/person/value in the population, but we generally do not have time for that, so instead we SURVEY the population. - We should be able to recognize the difference between quantitative and categorical data Ex: A researcher wants to understand the amount of money spent on fast food weekly among 18 to 24 year old college students in the Greater Cincinnati Area. Suppose he takes a sample of 1000 college students from NKU whose ages range from 18 to 24 years and finds a mean amount of $35.50 with a standard deviation of $11.75. - Describe the population for this scenario. - 18-24 year olds in the Greater Cincinnati Area - Is the sample taken Convenient, Self-selected, or Random. - Convenient. This can be identified by the fact that the sample is contained to one school. - Is the mean of $35.50 a parameter or a statistic? - A statistic. We know this because it is a value describing a SAMPLE not a POPULATION. What to do for StatCrunch (Doing these steps is a 10 point quiz) - Go to StatCrunch.com - -> EXPLORE - -> GROUPS - Search: CULBERTSON - Look for: -> Culbertson STA 205 Spring 2019 (at top) - -> JOIN - On Wed, give her a full sheet of paper with First and Last Name and Username and “Section 3” StatCrunch Tips - Bar Plots - Always give your graphs a title - It is helpful to put the value above the bar (when creating a bar plot) - There is a difference between frequency and relative frequency. Relative frequency is actually a percentage, while regular frequency is just how many or how much (numerically). - Different graphs are more appropriate for different sets of data/situations - Pie charts - Without the key, pie charts are useless. - When the slices are so thin they are almost indistinguishable from each other, a bar plot may be a better choice of representation to show

STA 205 Composite Notes (NKU)

3

differences in the data. 1/23 -

-

1/25 -

Categorical data, if numerical, is not something you can find a useful/meaningful average from. Bar graphs vs pie graphs - It is often easier to decide which bar is taller rather than what slice of the pie is bigger On a relative frequency bar graph, the labeling is a decimal which lets you know the percentage or proportion of the entire set per category. When doing a frequency table, all you have to do is select the data set, usually you do not have to change from the default settings Everything we are doing for the next two weeks are under the graph and stat label Histograms- bar graph that represents quantitative data that has been organized into categories or classes - Characteristics - Bars touch - Class intervals are of equal width (bin width) - End values (right boundary of interval) are counted in next category or bar - CS: The big thing to emphasize is that the bars touch and the width of said bars are equal. - CS: anything that falls at the end of a category is actually counted in the next category - When intervals are ranges, for example from one to ten, we do not actually know what value it is from the histogram, only the frequency of scores in that categorical range - Unimodal histograms- have 1 peak - Bimodal histograms- have 2 peaks - Multimodal histograms- have 3 or more peaks - Interpreting a histogram - Minimum to maximum - Most frequent category (modal class) - Typical range (includes center and spread when available) - Atypical range (outliers) - Shape - Skewed data - Skewed right - Start tall and decrease, mean is to the right of the median (median < mean) - Skewed left - Starts small and increases, mean is to the left of the median (median > mean) When looking at a histogram, always look for the minimum and maximum first (note: the starting point is not always 0!!! Don’t assume!!!)

STA 205 Composite Notes (NKU) -

-

4

When there are two most frequent categories (modal classes), mention them both, and do not lump them together A good way to think of “typical”- where is the majority or this data? - Tip- start where it is tallest, then spread out to get more bars when needed One way to use StatCrunch to determine the shape of a graph (especially if that shape isn’t especially obvious), is to indicate where the mean and median are on the graph. This tells if the graph is skewed at all. The most frequent category always needs to be described as a range (from one value to another) when analyzing a histogram Do not be scared of including too much in the typical range, most often it will be which bars stand out to you as being above the rest When identifying the atypical values/range, do not mention the “space” between the atypical data and the rest of the data, because we have no data for that value, so it is not able to be mentioned

2/1 -

-

Properties of standard deviation - s>0 (standard deviation has to be greater than or equal to 0) - If s = 0, all values are the same - The larger s gets, the more spread out the values are - The mean is more descriptive of the data when s is small Interquartile range- the length of the middle 50% of ordered data - Properties of the IQR - IQR > 0 (non negative quantity) - If IQR = 0, then Q1 and Q3 are the same, and the middle 50% of values are the same, including the median - The larger the IQR is the more spread out the data - The median is more descriptive of the data when the IQR is small - IQR = Q3 - Q1 - Q1 = Quartile 1 - marks lower 25 % of ordered data - Q3 = Quartile 3 - marks the lower 75% of ordered data

###

###

###

IQR -

Shape

Q3

Q2

Q1

###

STA 205 Composite Notes (NKU)

5

-

-

-

Skewed left has tail to the left - Mean is bigger than median (because the mean is being pulled down) - Skewed right has tail to the right - Mean is bigger than median (because the mean is getting pulled up) - Roughly symmetric: mean and standard deviation if the shape is symmetric, these are the appropriate measures in this case) - If your data is skewed, so is the mean Box-plots (5 pt summary) - Minimum - Q1 - Median - Q3 - Maximum - Ex: 0, 0, 3, 5, 5, 7, 7, 7, 9 - Minimum: 0 - Q1: 3 - Median: 5 - Q3: 7 - Maximum: 9 - Finding outliers - Any value less than the lower fence or more than the upper fence is an outlier - Lower fence = Q1 - 1.5 IQR - Upper fence = Q3 + 1.5 IQR - From previous example - Q1 = 3 and Q3 = 7 - IQR = 7 - 3 = 4 - Lower fence = 3 - 1.5(4) = -3 - Upper fence = 7 + 1.5(4) = 13 To get the summary stats, go to the stats option, then summary stats (second option), select a column (StatCrunch instructions)

2/4/19 - When you have symmetrical data, you want to refer to the mean as the measure of center (and standard deviation is the corresponding measure of variation). - When data is skewed, the mean is also skewed, so you refer to the median (and the corresponding IQR as the measure of variation) as the measure of center. - A question asking about variability is not asking about how small the whole range is, but how small the IQR box is. - If a question asks how well the median represents the data, it is actually asking how small the IQR box is - When determining if a data set is roughly symmetric, you need to look at the whole range of the data set to determine f there is a significant difference between the mean and median. - The median will ALWAYS be in your typical range.

STA 205 Composite Notes (NKU)

6

2/6/19 - Be more sensitive about rounding rules - Round .5 and up to the number above - This will be important with the unit we are heading into - The “typical” range can be just one bar if it is exceptionally taller than the rest (note: it is not the height of the bar that describes that category though, that will e along the x axis) - 2/11/19 - Normal distributions - Bell shaped - Symmetric - Centered at mean - Approximately 68% of data is within one standard deviation of the mean - One standard deviation on either side of the mean is 34% of the data - Approx 95% of the data is within 2 standard deviations of the mean - Two standard deviations within either side of the mean is another 13.5 percent of the data (not including the preceding data, only the data within the category of two standard deviations away from the mean) - Approximately 99.7% of data is within 3 standard deviations of the mean - Three standard deviations on either side of the mean is 2.35% of the data - (fancy looking) M = population mean - σ = population standard deviation - CS: If a value is more than 3 standard deviations away from the mean, it is an outlier. - Example: - Suppose a certain population is normal, distributed with a mean of 75 and a standard deviation of 4. Describe the distribution as completely as possible. M= 75, σ = 4 - Approx 68% of data is between 71 and 79 - “ 95% “ between 67 and 83 - “ 99.7% “ between 63 and 87 - 2/13/19 - Probability ~ Proportion ~ Percentage - A Z-score tells how many standard deviations a value x is from the mean M. - Sygma = standard deviation X - mean - Then, use the Z Score Chart (in sygma canvas) to convert the number you get from that equation into a four digit number (THAT FOUR DIGIT NUMBER IS NOT THE Z SCORE! IT IS CONSIDERED THE PROPORTION OR PERCENTAGE)

Z

STA 205 Composite Notes (NKU)

-

-

7

- The z score does not equal or imply the area. - ALWAYS value minus mean To find the inbetween, find two z score, look at the chart twice, and subtract (the smaller one from the larger one) - ON BOARD: - Find 2 z scores - Find 2 areas in the chart - Subtract smaller area from larger area. 2/15/19 - μ = 67 - σ=5 - Find the proportion of values between 65 and 70. - Z = (70-67) / 5 = 0.6 ⟶ 0.7257 -

-

-

-

Z = (65-67) / 5 = -0.4 ⟶ 0.3446 - 0.7257 - 0.3446 = 0.3811 - 2 z scores, twice to the same chart, and it should not be negative because areas (and percentages and probabilities) should never be negative

Z = (x-μ) / σ - To solve for x, algebraically manipulate the formula to become: - x = μ + Zσ - Then use the chart backwards - Go inside the body of the chart, find the closest thing you can to the percentage being asked about (ex: if it’s 90 percent, then find .9) - If you’re looking for like the top 10 percent, and if the problem required a whole number as the response, then round up (no matter what the decimal is) to the next whole number. - The Z score at the mean will always be 0 2/18/19 - If you find the areas that are both equally close to the percentages, you average the two .0000 areas to find the z score. - When you’re on the negative side of the mean, you will be on the negative side of the z chart - The Z score is what tells you how many standard deviations you are away from the mean 2/20/19 - 5% Rule - When a probability is less than 5%, we consider the event to be unusual and contradictory to the original parameters.

STA 205 Composite Notes (NKU)

-

-

-

-

-

8

→We are going to assume we did not select the outlier on the first try. In real life, you would test several samples before determining that the original parameters could be incorrect, however, this is a lower level class, and we are just trying to understand the concept. An acceptable wording of this would be: A probability of ____ is less than 5% and shows that the sample value contradicts the originally claimed parameter. CS: This essentially means, this new thing contradicts the old thing. Sampling distribution of the proportion - Population proportion: P →This means the sample, this is the proportion of the population which a problem is paying special attention to (ex: if you are looking at 24.6% of a population of 100, then p would be 0.246), this is helpful in honing in your focus on the part of the population that actually matters to this problem story. - Sample proportion: ^p (p under circumflex hat) Anything that deals with a proportion is categorical Random = representative (means the same thing is math “stories”) CS: Most “p hats” will be close to p →This is the same theory as any other normal distribution, most values will be close to the mean! And because the mean is p is a sampling distribution, and all of the values within that distribution are p hats, logically, the majority of p hats will be close to the mean (p). (68% will be within one standard deviations, 95% will be within 2 standard deviations, etc.). CS: Think of your population as a big bag of items, you get some out, you randomly get a percentage. Go back in, get a new sample and a second percentage. The idea is that every time you have a new sample, you would have a new p hat or new percentage. The mean of p hat will always equal p →Because p hats make up the values of the sample distribution, and the mean of those samples is p (the proportion give, of course in decimal form). ( x’s (different from x bars) - To describe this population (guaranteed test question) - Mu of x bar = mu - Sigma of x bar = sigma over the square root of n - Normal vs not normal - N ≥ 30 or pop is already normal - “30 rule” - CS: Xbar and Ybar are the same thing, don’t let that trip you up in Imath. - In these equations, “n” - refers to the sample size - If n exceeds 30, then the population is normal - CS: The population being skewed doesn’t mean the sample isn’t normal - BUT if the population is normal, the sample is always normal - When you have a normal population, the sample size doesn’t matter - When Imath asks “large” or “small” ⟶ larger or smaller than -

-

-

30, refers to the 30 rule CS: Drawing pictures will help you solve these problems because visualizing it can often eliminate silly mistakes of misunderstandings. If a value is considered “small” there will ALWAYS BE A CONTRADICTION WITH THE ORIGINAL PARAMETER - If there is contradiction, the contradiction would be with μ, not x bar. - All of these conjectures will have to do with μ A 0 on the end of a decimal means that it was rounded. - CS: So if you add a 0 on the end to get it to the “round to 4

STA 205 Composite Notes (NKU)

-

-

-

11

decimal places” requirement, that is technically wrong. There is a challenge in differentiating between μ and x bar. - Be careful mixing these up because it affects all of your calculations Standard error is the standard deviation of the sampling distribution. Sometimes these stories require to use just logic, not always numbers.

3/6/19 - Confidence interval (of the proportion)- takes a statistic (p hat) and making an estimate about p -

Why does this matter? ⟶ The confidence interval basically means I can be ___% confident, based on the sample (knowing nothing about the overall population), a value/parameter of the population is within these two values. You can use a sample to learn about a population, that is important, because that is what most of statistics is about, using a sample to learn about an overall

-

population. Sampling distribution for the proportion - Given the 10 rule met, there is a normal curve If you want to have . . . . - 90% confidence ⟶ Z = 1.645 -

-

95% confidence ⟶ Z = 1.960

- 99% confidence ⟶ Z = 2.576 When you take sample values and make some kind of statement about the population, there has to be context. This statement is an inference. CS: You don’t have to convert the bounds back into percentages, but it reads better that way. A larger confidence does mean more accuracy but will make the interval calculation broader (less precise). In order to maintain the accuracy but also be more precise we would need a larger sample. The story questions surrounding confidence interval will ALWAYS give you the percentage to calculate Changing terminology in this section’s story problems - Conjecture = alternative hypothesis (Ha)

-

-

-

Initial statement about a parameter = null hypothesis (HO)

-

5% rule = Decision Rule (five percent doesn’t apply to all the stories, this changes) ⟶We will accept the alternative hypothesis when the p-value (probability) is less than ɑ. - P value = the probability of getting the sample value when the null

STA 205 Composite Notes (NKU)

-

hypothesis is true Ɑ = .01, .05, or .10 (depending on the story)

3/8/19 - Validity Requirements

-

-

-

Random sample

-

P hat * n = ≥ 10

- (1 - p hat)n = ≥ 10 If the denominator of a fraction (ex: standard error) gets smaller, the value increases. - Called an “inverse relationship” Significance level = ɑ - Usually, ɑ = .01, ɑ = .05, or ɑ = .10 Conjectures - Accepting the alternative is the same thing as rejecting the null. -

-

12

Rejecting the null is the same thing as accepting the alternative

3/18/19 - Decision Rule: - When the p-value (probability) is small (less than α) we can support the conjecture. ⟶ The sample value doesn’t have to equal the population value (or parameter) because there is a cer...


Similar Free PDFs