Data 8 Midterm Study Guide (Autosaved) PDF

Title	Data 8 Midterm Study Guide (Autosaved)
Author	Elle Baker
Course	Introduction to Data Science
Institution	University of California, Berkeley
Pages	4
File Size	75.9 KB
File Type	PDF
Total Downloads	18
Total Views	145

Preview

CLICK TO PREVIEW PDF

Summary

Professor Adhikari...

Description

Data 8 Midterm Study Guide Lecture 1: Learning about the world from data using computation  Exploration o identifying patterns o visualization  Inference o Drawing reliable conclusions about the world o Statistics  Prediction o Informed guesses about unobserved data o Machine learning Lecture 2: Cause and Effect Association: any kind of link, relation Causation: leading to the outcome We usually care about causation when we make decisions Involves counterfactual reasoning: thinking about things that could have happened but didn’t. John Snow: theorized that cholera was from water, not bad air. He made visualizations and maps, collected data. Key to established causality: if the treatment and control groups are similar apart from the treatment, then differences between the outcomes in the two groups can be ascribed to the treatment. If you assign the treatment and control groups at random, then they are more likely to be similar apart from the treatment. Lecture 3: Assignment statements (hours_per_week = 24*7) don’t have value, they just perform an action. It changes the meaning of the expression to the left of the equals sign. The name is thus bond to a value, not an equation. Python doesn’t auto-update – you have to reassign values as you go along.

Lecture 4: Numbers and Strings Arrays Int: integer Float: fractional, decimal point There can be some loss when you store floats, but in this class it shouldn’t throw you off. You can convert floats to ints by int(10 / 5). Concatenation – piecing together strings with a plus sign Can’t multiply string by a float Lecture 5 Minard: 5 scale map Lecture 19 Observed significance level, or p-value. P-value is the chance under the null hypothesis That the test statistic is equal to the value observed in the data or is even further in the direction that supports the alternative. Small values of the statistic support the alternative hypothesis. If mass is to the right, then alternative is true. Figure out p-value before you do any kind of simulation. A/B Testing Comparing two samples that are a little different – figure out how. A: non-smokers B: smokers Statistics: difference between average weight B–A So, negative = alternative hypothesis (smokers have lighter babies). Lecture 22

Figure out the viewpoints that the question wants. Alternative will be the opposing viewpoint. Test statistics should help me decide between the null and alternative. What kinds of values will make me lean towards the alternative? This bus has a 70% chance of being late. It could be not true, or in either direction. Data: watch the bus for 200 days. Null: the bus has a 70% chance of being late. This can’t be “it’s due to chance” because we were given a very specific percentage to test. Alternative: chance of “late” is more than 70% Test statistic: % late – 70, number of days late – 140, number of days late, % of days late. If we have a positive statistic, then the alternative is true. For p-value: direction that supports the alternative: look right! This bus has a 70% chance of being late. It’s not late that often! Data: watch the bus for 200 days. Null: the bus has a 70% chance of being late. This can’t be “it’s due to chance” because we were given a very specific percentage to test. Alternative: chance of “late” is less than 70% Test statistic: % late – 70, number of days late – 140, number of days late, % of days late. If we have a negative statistic, then the alternative is true. For p-value: direction that supports the alternative: look left! This bus has a 70% chance of being late. That’s not true! Data: watch the bus for 200 days. Null: the bus has a 70% chance of being late. This can’t be “it’s due to chance” because we were given a very specific percentage to test. Alternative: chance of “late” is not 70% Test statistic: abs(number of days late – 140), abs(% days late – 70%) For p-value: direction that supports the alternative: a distance greater than 0. Null_proportions = make_array (0.7, 0.3) Def distance_under_null(): Proportion_late = sample_proportions(200, null_proportions).item(0) Return abs(proportion_late – 0.7) Distances = make_array() For I in np.arange(10000): Distances = np.append(distances, distance_under_null()) Distance_tbl = Table().with_column( Observed statistic = abs(150/200) - .7) Empirical_p = np.count_nonzero(distances >= observed_statistic) / 10000

p-value: it is the chance, assuming that the bus is late 70% of the time, that we get a statistic that is 0.05 or greater....