Data 8 Tutoring Solutions for Week 14 Worksheet 2 PDF

Title	Data 8 Tutoring Solutions for Week 14 Worksheet 2
Course	Introduction to Data Science
Institution	University of California, Berkeley
Pages	13
File Size	736.7 KB
File Type	PDF
Total Downloads	60
Total Views	130

Preview

CLICK TO PREVIEW PDF

Summary

This is the final review worksheet/practice exam provided by the tutoring session for Data 8. It has questions as well as answers. These are only available to students who took the weekly tutoring session. ...

Description

Tutoring Week 14: Final Review Welcome to the final week of Data 8 tutoring—congratulations! The tutors have enjoyed working with you this semester and hope you’ve found this course useful, and we hope you continue to be involved in data science at Berkeley in the future. Table of Contents 1……Expressions and Probability 2……Distributions, Centers, Sample Means 3……Bootstrapping and Confidence Intervals 4……Correlation, Regression, Least Squares, Residuals 5……Classification This week’s worksheet is quite long, so we recommend examining the table of contents above and prioritizing the sections you’d like more practice on. Great resources along with this worksheet are ● Past semesters’ exams: http://data8.org/fa17/resources.html ● General and topical review sessions during RRR week: https://piazza.com/class/j6axyxfljct5w3?cid=1569

1  Expressions and Probability Review of the most important table and array functions: http://data8.org/fa17/resources.html 1) We   have   a table called johnsmith   containing all of John Smith’s family members,   along with the field they work in, their position in the job, and their salary. Field

Name

Position Salary

Journalism John Smith  Editor

113,203

Music

Betty Smith  Singer

52,492

Medicine

Mike Smith  Doctor

294,291

… (97   rows omitted) 

 Write a line of code to compute the described values. a. The total a mount o f s alary p aid t o e veryone i n J ohn’s family np.sum(johnsmith.column(“Salary”)) b. The name o f t he f ifth h ighest p aid f amily member johnsmith.sort(“Salary”, descending=True).column(“Name”).item(4) 

c. Multiple members o f t he S  mith f amily a re i n t he f ield o f J ournalism, m  ultiple m  embers a re i n the field of   Music,   and   so on. How many family members are in the field with the most people in it? max(johnsmith.group(“Field”).column(“count”)) d. Surprise! Everybody g ets a  1 0% r aise. W  hat i s t he a verage s alary o f t he f amily m  embers now? np.average(johnsmith.column(“Salary”) * (  1.1)) Bayes’ Rule P (Event A given Event B ) =

P (Event A and Event B) P (Event B )

2) In   a  population, 66% of the population are in Class X and the remaining 34% are in Class Y. For people in Class X, a classifier has an accuracy of 81% (that is, among   Class X people, 81% are classified as   Class  X   and 19% as Class Y). For people in Class Y, the   accuracy   of   the   classifier   is   90%. One person is picked at random from the population. a. What is t he c hance t hat t he p erson i s c lassified correctly? P(Correct) = 0.66 * 0.81 + 0.34 * 0.9 = 0.8406 b. Given that t he p erson i s c lassified c orrectly, w  hat i s t he c hance t hat t he p erson i s i n C  lass Y? P(Class Y given Correct) = 0.34 * 0.9 / 0.8406 c. If the p erson w  as i n C  lass X  , w  hat i s t he c hance t hat t he p erson i s c lassified correctly? P(Correct given Class X) = 0.81

2  Distributions, Centers, Sample Means Summary Statistic

Code

Explanation

Example (Data: 4, 3, 1, 5, 6)

Mean

np.mean()

Summing all values and dividing by the number of values there are

(4 + 3 + 1 + 5 + 6) / 5 Mean = 3.8

Median

np.median()

Ordering all the values and finding Ordered: 1, 3, 4, 5, 6 the value in the middle Median = 4

Standard Deviation

np.std()

Root mean square of deviations from average

(((4-3.8)2  + (3-3.8)2  + (1-3.8)2  + (5-3.8)2  + (6-3.8)2 ) / 5)1/2  Standard Deviation = 1.72

Variance

np.std()**2

By definition, the variance is equal to the standard deviation squared

1.722 Variance = 2.96

1

 https://www.inferentialthinking.com/chapters/12/5/variability-of-the-sample-mean.html

1

Distributions 1) Using the histogram, match the term to its value. Mean 7 .9 Median 7.0 Standard Deviation 4.3 Variance 18.6

7.9 18.6 4.3 7.0

2) Using   the histograms, calculate each quantity below. If you cannot   determine the answer with the information given, write Unknown.

a. The percentage   o f f athers w  ho a re a t l east 6 6 i nches b ut l ess t han 7 2 inches. (66-72): 2 inches * (7 + 15 + 15) %/inch = 74% b. The percentage   o f m  others w  ho a re a t l east 6 0 a nd l ess t han 65. Unknown. We cannot tell how heights are distributed within a bin. c. The number   o f f athers w  ho a re a t l east 6 4 i nches tall. 100% - (2 inches * 1 %/inch) * 200 fathers = 196 fathers

Chebyshev’s Bounds

Standard Units ● Given a normal   distribution, standard units tell you how many standard   deviations   above or below the average (typically within -5 and 5 standard units) a value is. ● Formula for conversion into standard units def standard_units(numbers_array): return ( numbers_array -  np.mean  (numbers_array  )) /  np.std  (numbers_array) Variability of the Sample Mean:

3) Circle   which bound is a Chebyshev bound and box which bound is   a  normal   distribution.   Explain your reasoning. Average ± 1.5 SD

86.6%     N  ormal

55.56%   Chebychev

Average ± 2.5 SD

84%        C  hebychev 98.8%     Normal

Average ± 3.5 SD

91.84%   C  hebychev 99.9999%   N  ormal

4) Use   Chebyshev’s   bounds to calculate the proportion of values that are within 3 standard deviations of the mean for any given distribution. 1 - 1 /3^2 = 1 - 1 /9 = 0.888 Chebyshev’s Bounds 5) You   want   to buy a Toyota Prius so you can drive to Costco. You conduct a survey of used Prius prices. You manage to record the prices of a random sample of 400 cars, and you see that this sample has a mean of $4500 and a standard deviation of $100.2 a. By Chebyshev’s b ounds, w  hat i s t he m  inimum n umber o f c ars b etween $ 4200 a nd $ 4800 i n your sample? ($4200, $4800)  =   (mean   - 3 * SD, mean + 3 * SD). By Chebyshev’s, at least   (1   -  1 /9 ) *  400  ≈   355 cars must be in this range.

2

 nd  5 d o mitted f rom f inal  v ersion  o f  w  orksheet 5  c a

b. Can we c ompute t he p roportion o f u sed P  riuses b etween $ 4200 a nd $ 4800 i n t he p  opulation? No. Since   we   only   have   the   SD   of the sample and NOT the original population,   we can’t draw any strict conclusions about the population distribution. We   estimate   that   ~88.8%   of the population is in this range, but due to the error in the sample mean   and   SD,   we’re unsure.

3  Bootstrapping and Confidence Intervals 1) Understanding confidence intervals Code Skeleton for Bootstrapping Confidence Intervals results = m  ake_array() resamples = 10000 # Or   however many desired for i in   np.arange(resamples): sample = Table.sample().column(“Column Title”) test_stat = ...   # Compute test statistic from   sample table results = np.append(results, test_stat) # Compute confidence interval from results left_end = percentile(2.5, results) right_end = percentile(97.5, results) confidence_interval = (left_end, right_end) Note: C  onfidence i ntervals d  on’t m  ake c onclusions a  bout w  here  t he t rue p  opulation p  arameter l ies, but w  here the  predictions of the parameter  based on the  bootstrapped regression lines (which are based  o  n  the sample)  lie. T  his  i s a   c ommon  m  isconception! E  .g.  a  9  5%  C  I m  eans t hat 95 o  ut  o f 1  00  t imes y ou d  o a   r egression  o n  t he  s ample, t he  r esulting  r egression l ine’s  p  rediction o f the  p  arameter w  ill  fall i n t he C  I. T  hus t he “ confidence” i sn’t i n t he  t rue p  arameter,  b  ut i n y our prediction process. a. Using the c ode a bove, w  hat p ercent i s o ur c onfidence i nterval? W  hat d oes o ur c onfidence interval represent? This is   a  95%   confidence interval. We have 95% “confidence” in the   process   (bootstrap)   of estimating the parameter that has resulted in the range we constructed. This interval represents an interval in which we believe the true parameter is in. b. How many t imes w  ould w  e e xpect o ur i nterval t o c ontain t he t rue p arameter i f w  e r epeated this process many times? 95% of   the   time.   If we were to repeat this process many times and construct   a  confidence   interval for each time we repeated the process, we expect 95% of these confidence intervals to contain the true parameter.

c. What is t he r elationship b etween c onfidence i ntervals a nd p -values? H  ow w  ould s omeone use confidence intervals to decide to reject or fail to reject the null   hypothesis? When making determination using confidence intervals, using a 95% confidence interval is associated with a 5% p-value cutoff. The tradeoff is that they typically add up to 100. If we   construct   a  confidence interval of 95% and the true parameter lies OUTSIDE the range, then one can reject the null hypothesis using a 5% p-value cutoff. If   it   lies   inside the range, then one would fail to reject the null hypothesis. d. What p-value c utoff d oes a  9 5% c onfidence i nterval c orrespond t o? Choosing a 95% confidence interval is an arbitrary choice—it is a convention in statistics to use a 5% p-value cutoff, which corresponds with a 95% confidence interval. Bootstrapping 2) Suppose   we want to know the average distance an incoming student lives from Berkeley. To   solve   our   problem,   we take a random sample of 10 students from   the population of 100 students called distances  . Compute a 95% confidence interval for the parameter. a. Using bootstrapping results = m  ake_array() resamples = 1  0000 for i in   np.arange(r  esamples): sample = distances.select(‘Distance’).sample().column(0)

test_stat = np.mean(sample) results = np.append(results, test_stat) left_end = p  ercentile(2.5, results) right_end = p  ercentile(97.5, results) CI = (  left_end, right_end) b.   Using normal distribution   *  upper_bound = n  p.mean(distances.column(‘Distance’)) + 2    np.std(distances.column(‘Distance’)) lower_bound = n  p.mean(distances.column(‘Distance’)) - 2   *     np.std(distances.column(‘Distance’)) /  np.sqrt(100) CI = (  lower_bound, upper_bound)

Writing Null and Alternative Hypotheses Null: The key to a complete and correct null hypothesis is stating that 1) the theoretical probability distribution holds, and 2) any observed variation from this distribution is just due to random chance. Note: B  e s ure t o  i nclude  t he  q  uestion’s c ontext  w  hen  w  riting y our a  nswer. E  .g. i f t he q  uestion  i s  d  ice, y our n  ull h  ypothesis  s hould i nclude  t hat:  1  ) A  ll  s ides o f t he  d  ice  a  re w  eighted  e qually, about a and  2) A  ny o bserved v ariation f rom  t his d  istribution i s  due t o  r andom c hance. Alternative: The observed event occurred under a different probability distribution. A/B Testing 3) We   have   a sample of Berkeley students in a table students   with their height and information about whether or   not they are enrolled in Data 8t. We want to know if Data 8  students   are taller, on average, than non-Data 8 students. a.   What are our null and alternative hypotheses? Null: T  here is no difference between the average height of Data 8 and non-Data 8  students,   and   any variation from a difference of 0 is   due   to random chance. Alternative: Data 8 students are sooo tall! What should we use as our test statistic? mean height   of Data 8 students - mean height of non-Data 8 students b.   Fill   in   the   missing   lines of code to simulate an A/B test of our experiment.    N  ote: W  hen  g rouping  a  c olumn o f  T  rue/False v alues,  t he r esulting t able  h as t wo r ows, w  ith t he  alse, a nd  r ow 1  b eing  T  rue. items i n r ow 0  b eing F  . B  ample f unction, Table.sample(#,with_replacement=)  y Also, c onsider u sing t he s   n umber  of r ows  i n t he  t able a nd  w  ith_replacement=True. default,  #  = stats = m  ake_array() for i in   np.arange(1000): simulated_heights =   s tudents.select('Height').sample(with_replacement=False)

simulated_data8 = s  tudents.select('DS8').sample() simulated_outcomes = T  able().with_columns( “Height”, s  imulated_heights.column(0), “Data 8”, simulated_data8.column(0))

def test_stat(table): means = table.group(“Data 8”, np.mean) diffs = means.column(“Height mean’).item(1)     means.column(“Height mean”).item(0) return diffs simulated_stat = t  est_stat(simulated_outcomes) stats = np.append(stats, simulated_stat) c.   Compute the empirical p-value: observed = t  est_stat(students) p = n  p.count_nonzero(stats >=   observed) /   1  000 d.   Do   you   think   Data   8  students are taller? We don’t have real data, this   is   just   for   fun   :-)

4  Correlation, Regression, Least Squares, Residuals Correlation ● Association is   any   type   of   relationship   between   two   variables ● Correlation is   a  specific   type   of   association   where   the   relationship   between   two variables is linear We use the correlation coefficient (r )  to measure the strength of a linear relationship between   mean   of   the   product   of   two  two variables. r is a number between -1 and 1, and is calculated as the variables in standard units. r = np.mean(standard_units(x    )  * standard_units(y ) ) Changing the units of a variable does not affect the r-value because the variables are standardized when computing   the correlation coefficient. The correlation coefficient   has no units and is unaffected by the order of the variables. Residuals residual  =  y  − fitted value of   y   =  y  − height of regression line at x Root Mean Squared Error RMSE = np.sqrt  (np.mean  ((observed - predicted) ** 2)))   #   Code  2 2 = √  (((observed1 -  predicted)  +   (observed  2 -  predicted)  +   …)   /  length   of   observed)  #   Algebraic 

  0.8,   can   we   assume   the   two   variables   have   a 1) If we have two variables with correlation coefficient  r  = linear association? No. r measures   the strength of   a  linear   association, if it   exists.   We   can   still   compute   r  for two   variables    prove   a  linear  that don’t have a linear association using the given formula, but that r wouldn’t relationship exists. 2) In the right graph, is the regression line overestimating, underestimating, or accurately estimating the trend? If it is overestimating or underestimating, what can we do to the slope to make it more accurate? The line underestimates most points. We can fix it by increasing slope.

3) The   following   table, movie  , depicts the number of movies a person has   watched   in   the   last   week, along with a number from 0 to 10 that quantifies their satisfaction. a) What   are   the steps to find the equation of the regression line? ● Find r ● Multiply ratio of SDy  to   SDx  with   r  to   find the slope    slope   *  average of   x ● Find the y- intercept using average of y − ● Plug into the equation of a line, y =   mx  +   b   0.023,  m  ≈   0.466,   b  ≈   6.922  r ≈   0.466x   +   6.922  y =

b)   How would you calculate the residuals for each point? For each pair (Movies Watched m,  Satisfaction s) , residual =  s  -  (slope *  m   + b ) . No need for solutions; just make sure they know how to calculate residuals! c)   What do the residuals sum to?  0

Linear Regression Linear regression   is   one   of   the ways we can use to predict an output with   a given   input.  We focus on two equations:  = r  * x SU  ) ● Standard units (y SU (estimate of y − average of y) (the given x − average of x) ● Original units  ( =  r * SD of x SD of y In the following graphs, visually, you can see that both equations are identical (you can derive this).

We end   up   using   the   original units equation more often than not, since   we   do not have to     of   standard   units.    When   doing so, re...