Title | Data 8 Tutoring Solutions for Week 14 Worksheet 2 |
---|---|
Course | Introduction to Data Science |
Institution | University of California, Berkeley |
Pages | 13 |
File Size | 736.7 KB |
File Type | |
Total Downloads | 60 |
Total Views | 130 |
This is the final review worksheet/practice exam provided by the tutoring session for Data 8. It has questions as well as answers. These are only available to students who took the weekly tutoring session. ...
Tutoring Week 14: Final Review Welcome to the final week of Data 8 tutoring—congratulations! The tutors have enjoyed working with you this semester and hope you’ve found this course useful, and we hope you continue to be involved in data science at Berkeley in the future. Table of Contents 1……Expressions and Probability 2……Distributions, Centers, Sample Means 3……Bootstrapping and Confidence Intervals 4……Correlation, Regression, Least Squares, Residuals 5……Classification This week’s worksheet is quite long, so we recommend examining the table of contents above and prioritizing the sections you’d like more practice on. Great resources along with this worksheet are ● Past semesters’ exams: http://data8.org/fa17/resources.html ● General and topical review sessions during RRR week: https://piazza.com/class/j6axyxfljct5w3?cid=1569
1 Expressions and Probability Review of the most important table and array functions: http://data8.org/fa17/resources.html 1) We have a table called johnsmith containing all of John Smith’s family members, along with the field they work in, their position in the job, and their salary. Field
Name
Position Salary
Journalism John Smith Editor
113,203
Music
Betty Smith Singer
52,492
Medicine
Mike Smith Doctor
294,291
… (97 rows omitted)
Write a line of code to compute the described values. a. The total a mount o f s alary p aid t o e veryone i n J ohn’s family np.sum(johnsmith.column(“Salary”)) b. The name o f t he f ifth h ighest p aid f amily member johnsmith.sort(“Salary”, descending=True).column(“Name”).item(4)
c. Multiple members o f t he S mith f amily a re i n t he f ield o f J ournalism, m ultiple m embers a re i n the field of Music, and so on. How many family members are in the field with the most people in it? max(johnsmith.group(“Field”).column(“count”)) d. Surprise! Everybody g ets a 1 0% r aise. W hat i s t he a verage s alary o f t he f amily m embers now? np.average(johnsmith.column(“Salary”) * ( 1.1)) Bayes’ Rule P (Event A given Event B ) =
P (Event A and Event B) P (Event B )
2) In a population, 66% of the population are in Class X and the remaining 34% are in Class Y. For people in Class X, a classifier has an accuracy of 81% (that is, among Class X people, 81% are classified as Class X and 19% as Class Y). For people in Class Y, the accuracy of the classifier is 90%. One person is picked at random from the population. a. What is t he c hance t hat t he p erson i s c lassified correctly? P(Correct) = 0.66 * 0.81 + 0.34 * 0.9 = 0.8406 b. Given that t he p erson i s c lassified c orrectly, w hat i s t he c hance t hat t he p erson i s i n C lass Y? P(Class Y given Correct) = 0.34 * 0.9 / 0.8406 c. If the p erson w as i n C lass X , w hat i s t he c hance t hat t he p erson i s c lassified correctly? P(Correct given Class X) = 0.81
2 Distributions, Centers, Sample Means Summary Statistic
Code
Explanation
Example (Data: 4, 3, 1, 5, 6)
Mean
np.mean()
Summing all values and dividing by the number of values there are
(4 + 3 + 1 + 5 + 6) / 5 Mean = 3.8
Median
np.median()
Ordering all the values and finding Ordered: 1, 3, 4, 5, 6 the value in the middle Median = 4
Standard Deviation
np.std()
Root mean square of deviations from average
(((4-3.8)2 + (3-3.8)2 + (1-3.8)2 + (5-3.8)2 + (6-3.8)2 ) / 5)1/2 Standard Deviation = 1.72
Variance
np.std()**2
By definition, the variance is equal to the standard deviation squared
1.722 Variance = 2.96
1
https://www.inferentialthinking.com/chapters/12/5/variability-of-the-sample-mean.html
1
Distributions 1) Using the histogram, match the term to its value. Mean 7 .9 Median 7.0 Standard Deviation 4.3 Variance 18.6
7.9 18.6 4.3 7.0
2) Using the histograms, calculate each quantity below. If you cannot determine the answer with the information given, write Unknown.
a. The percentage o f f athers w ho a re a t l east 6 6 i nches b ut l ess t han 7 2 inches. (66-72): 2 inches * (7 + 15 + 15) %/inch = 74% b. The percentage o f m others w ho a re a t l east 6 0 a nd l ess t han 65. Unknown. We cannot tell how heights are distributed within a bin. c. The number o f f athers w ho a re a t l east 6 4 i nches tall. 100% - (2 inches * 1 %/inch) * 200 fathers = 196 fathers
Chebyshev’s Bounds
Standard Units ● Given a normal distribution, standard units tell you how many standard deviations above or below the average (typically within -5 and 5 standard units) a value is. ● Formula for conversion into standard units def standard_units(numbers_array): return ( numbers_array - np.mean (numbers_array )) / np.std (numbers_array) Variability of the Sample Mean:
3) Circle which bound is a Chebyshev bound and box which bound is a normal distribution. Explain your reasoning. Average ± 1.5 SD
86.6% N ormal
55.56% Chebychev
Average ± 2.5 SD
84% C hebychev 98.8% Normal
Average ± 3.5 SD
91.84% C hebychev 99.9999% N ormal
4) Use Chebyshev’s bounds to calculate the proportion of values that are within 3 standard deviations of the mean for any given distribution. 1 - 1 /3^2 = 1 - 1 /9 = 0.888 Chebyshev’s Bounds 5) You want to buy a Toyota Prius so you can drive to Costco. You conduct a survey of used Prius prices. You manage to record the prices of a random sample of 400 cars, and you see that this sample has a mean of $4500 and a standard deviation of $100.2 a. By Chebyshev’s b ounds, w hat i s t he m inimum n umber o f c ars b etween $ 4200 a nd $ 4800 i n your sample? ($4200, $4800) = (mean - 3 * SD, mean + 3 * SD). By Chebyshev’s, at least (1 - 1 /9 ) * 400 ≈ 355 cars must be in this range.
2
nd 5 d o mitted f rom f inal v ersion o f w orksheet 5 c a
b. Can we c ompute t he p roportion o f u sed P riuses b etween $ 4200 a nd $ 4800 i n t he p opulation? No. Since we only have the SD of the sample and NOT the original population, we can’t draw any strict conclusions about the population distribution. We estimate that ~88.8% of the population is in this range, but due to the error in the sample mean and SD, we’re unsure.
3 Bootstrapping and Confidence Intervals 1) Understanding confidence intervals Code Skeleton for Bootstrapping Confidence Intervals results = m ake_array() resamples = 10000 # Or however many desired for i in np.arange(resamples): sample = Table.sample().column(“Column Title”) test_stat = ... # Compute test statistic from sample table results = np.append(results, test_stat) # Compute confidence interval from results left_end = percentile(2.5, results) right_end = percentile(97.5, results) confidence_interval = (left_end, right_end) Note: C onfidence i ntervals d on’t m ake c onclusions a bout w here t he t rue p opulation p arameter l ies, but w here the predictions of the parameter based on the bootstrapped regression lines (which are based o n the sample) lie. T his i s a c ommon m isconception! E .g. a 9 5% C I m eans t hat 95 o ut o f 1 00 t imes y ou d o a r egression o n t he s ample, t he r esulting r egression l ine’s p rediction o f the p arameter w ill fall i n t he C I. T hus t he “ confidence” i sn’t i n t he t rue p arameter, b ut i n y our prediction process. a. Using the c ode a bove, w hat p ercent i s o ur c onfidence i nterval? W hat d oes o ur c onfidence interval represent? This is a 95% confidence interval. We have 95% “confidence” in the process (bootstrap) of estimating the parameter that has resulted in the range we constructed. This interval represents an interval in which we believe the true parameter is in. b. How many t imes w ould w e e xpect o ur i nterval t o c ontain t he t rue p arameter i f w e r epeated this process many times? 95% of the time. If we were to repeat this process many times and construct a confidence interval for each time we repeated the process, we expect 95% of these confidence intervals to contain the true parameter.
c. What is t he r elationship b etween c onfidence i ntervals a nd p -values? H ow w ould s omeone use confidence intervals to decide to reject or fail to reject the null hypothesis? When making determination using confidence intervals, using a 95% confidence interval is associated with a 5% p-value cutoff. The tradeoff is that they typically add up to 100. If we construct a confidence interval of 95% and the true parameter lies OUTSIDE the range, then one can reject the null hypothesis using a 5% p-value cutoff. If it lies inside the range, then one would fail to reject the null hypothesis. d. What p-value c utoff d oes a 9 5% c onfidence i nterval c orrespond t o? Choosing a 95% confidence interval is an arbitrary choice—it is a convention in statistics to use a 5% p-value cutoff, which corresponds with a 95% confidence interval. Bootstrapping 2) Suppose we want to know the average distance an incoming student lives from Berkeley. To solve our problem, we take a random sample of 10 students from the population of 100 students called distances . Compute a 95% confidence interval for the parameter. a. Using bootstrapping results = m ake_array() resamples = 1 0000 for i in np.arange(r esamples): sample = distances.select(‘Distance’).sample().column(0)
test_stat = np.mean(sample) results = np.append(results, test_stat) left_end = p ercentile(2.5, results) right_end = p ercentile(97.5, results) CI = ( left_end, right_end) b. Using normal distribution * upper_bound = n p.mean(distances.column(‘Distance’)) + 2 np.std(distances.column(‘Distance’)) lower_bound = n p.mean(distances.column(‘Distance’)) - 2 * np.std(distances.column(‘Distance’)) / np.sqrt(100) CI = ( lower_bound, upper_bound)
Writing Null and Alternative Hypotheses Null: The key to a complete and correct null hypothesis is stating that 1) the theoretical probability distribution holds, and 2) any observed variation from this distribution is just due to random chance. Note: B e s ure t o i nclude t he q uestion’s c ontext w hen w riting y our a nswer. E .g. i f t he q uestion i s d ice, y our n ull h ypothesis s hould i nclude t hat: 1 ) A ll s ides o f t he d ice a re w eighted e qually, about a and 2) A ny o bserved v ariation f rom t his d istribution i s due t o r andom c hance. Alternative: The observed event occurred under a different probability distribution. A/B Testing 3) We have a sample of Berkeley students in a table students with their height and information about whether or not they are enrolled in Data 8t. We want to know if Data 8 students are taller, on average, than non-Data 8 students. a. What are our null and alternative hypotheses? Null: T here is no difference between the average height of Data 8 and non-Data 8 students, and any variation from a difference of 0 is due to random chance. Alternative: Data 8 students are sooo tall! What should we use as our test statistic? mean height of Data 8 students - mean height of non-Data 8 students b. Fill in the missing lines of code to simulate an A/B test of our experiment. N ote: W hen g rouping a c olumn o f T rue/False v alues, t he r esulting t able h as t wo r ows, w ith t he alse, a nd r ow 1 b eing T rue. items i n r ow 0 b eing F . B ample f unction, Table.sample(#,with_replacement=) y Also, c onsider u sing t he s n umber of r ows i n t he t able a nd w ith_replacement=True. default, # = stats = m ake_array() for i in np.arange(1000): simulated_heights = s tudents.select('Height').sample(with_replacement=False)
simulated_data8 = s tudents.select('DS8').sample() simulated_outcomes = T able().with_columns( “Height”, s imulated_heights.column(0), “Data 8”, simulated_data8.column(0))
def test_stat(table): means = table.group(“Data 8”, np.mean) diffs = means.column(“Height mean’).item(1) means.column(“Height mean”).item(0) return diffs simulated_stat = t est_stat(simulated_outcomes) stats = np.append(stats, simulated_stat) c. Compute the empirical p-value: observed = t est_stat(students) p = n p.count_nonzero(stats >= observed) / 1 000 d. Do you think Data 8 students are taller? We don’t have real data, this is just for fun :-)
4 Correlation, Regression, Least Squares, Residuals Correlation ● Association is any type of relationship between two variables ● Correlation is a specific type of association where the relationship between two variables is linear We use the correlation coefficient (r ) to measure the strength of a linear relationship between mean of the product of two two variables. r is a number between -1 and 1, and is calculated as the variables in standard units. r = np.mean(standard_units(x ) * standard_units(y ) ) Changing the units of a variable does not affect the r-value because the variables are standardized when computing the correlation coefficient. The correlation coefficient has no units and is unaffected by the order of the variables. Residuals residual = y − fitted value of y = y − height of regression line at x Root Mean Squared Error RMSE = np.sqrt (np.mean ((observed - predicted) ** 2))) # Code 2 2 = √ (((observed1 - predicted) + (observed 2 - predicted) + …) / length of observed) # Algebraic
0.8, can we assume the two variables have a 1) If we have two variables with correlation coefficient r = linear association? No. r measures the strength of a linear association, if it exists. We can still compute r for two variables prove a linear that don’t have a linear association using the given formula, but that r wouldn’t relationship exists. 2) In the right graph, is the regression line overestimating, underestimating, or accurately estimating the trend? If it is overestimating or underestimating, what can we do to the slope to make it more accurate? The line underestimates most points. We can fix it by increasing slope.
3) The following table, movie , depicts the number of movies a person has watched in the last week, along with a number from 0 to 10 that quantifies their satisfaction. a) What are the steps to find the equation of the regression line? ● Find r ● Multiply ratio of SDy to SDx with r to find the slope slope * average of x ● Find the y- intercept using average of y − ● Plug into the equation of a line, y = mx + b 0.023, m ≈ 0.466, b ≈ 6.922 r ≈ 0.466x + 6.922 y =
b) How would you calculate the residuals for each point? For each pair (Movies Watched m, Satisfaction s) , residual = s - (slope * m + b ) . No need for solutions; just make sure they know how to calculate residuals! c) What do the residuals sum to? 0
Linear Regression Linear regression is one of the ways we can use to predict an output with a given input. We focus on two equations: = r * x SU ) ● Standard units (y SU (estimate of y − average of y) (the given x − average of x) ● Original units ( = r * SD of x SD of y In the following graphs, visually, you can see that both equations are identical (you can derive this).
We end up using the original units equation more often than not, since we do not have to of standard units. When doing so, re...