R-retake exam SS 2021/2022 PDF

Title	R-retake exam SS 2021/2022
Author	zhaoxuan guan
Course	Data analysis and visualization in R
Institution	Technische Universität München
Pages	24
File Size	813.1 KB
File Type	PDF
Total Downloads	372
Total Views	888

Preview

CLICK TO PREVIEW PDF

Summary

Chair of Bioinformatics and Computational BiologyDepartment of InformaticsTechnical University of MunichSPersonal stickerCompliance to the code of conduct I hereby assure that I solve and submit this exam myself under my own name by only using the allowed tools listed below.Signature or full name if...

Description

Chair of Bioinformatics and Computational Biology Department of Informatics Technical University of Munich

Compliance to the code of conduct I hereby assure that I solve and submit this exam myself under my own name by only using the allowed tools listed below.

Personal sticker S5115

Signature or full name if no pen input available

Data Analysis and Visualization in R Exam: Examiner:

P1

P2

IN2339 / Retake Julien Gagneur

P3

P4

P5

Date: Time:

P6

P7

Monday 29th June, 2020 08:15 – 09:45

P8

P9

P 10 P 11

I

Exam empty

– Page 1 / 24 –

IN-DataViz-1-20200629-E5115-01

Working instructions • This exam consists of 24 pages with a total of 11 problems. Please make sure now that you received a complete copy of the exam. • The total amount of achievable credits in this exam is 27 credits. • Detaching pages from the exam is prohibited. • Allowed resources: – slides, exercises and notes from the lectures – Other content from the internet – You are not allowed to communicate with anyone except with the examiners during the exam and during the oral questioning (one hour following the written exam). Hence, you can consult a forum for an existing post but you are not allowed to post any question nor result on any communication media (e.g. forum, WhatsApp group, social media, etc.) up to one hour following the exam. – You should answer the questions using the knowledge (data analysis and statistical methods), and R packages taught during the lecture at the sole exception of the dslabs R package, whose datasets will be needed. In this respect, consulting other content from the internet is probably a bad idea as they may hint towards methods and code that were not taught. – The R libraries you can use are: data.table, ggplot2, tidyr, dslabs, magrittr and dplyr. Load them in your R session by running the following code: library(data.table); library(ggplot2); library(tidyr); library(dslabs); library(magrittr); library(dplyr). Be sure to have them already installed using install.packages(‘data.table’), and so on for each of those libraries. • Filling the exam – Download the pdf to your computer and edit it there. Make sure that your pdf editor supports native text input ﬁelds. Check the list of pdf readers in this document: https://tumexam.de/static/handreichung_ submissions_students.pdf – Do not work with the pdf loaded in a web browser as it does not save your edits. – Answer by typing, no handwriting or sketching. Write into the solution box inside the pdf document. – Not all R outputs (e.g. tables or plots) are required except for the answer to the question. Simply copy the executed code from R to the solution box in the exam. If the question states “justify”, provide a short justiﬁcation in plain English. In this case, only providing the code is not enough. – We do not accept any additional ﬁles. – Some questions value one point, other two points. No half-point will be given. • Interactions with examiners and oral questioning – The examiners are reachable during the exam and for the oral questioning via a zoom meeting – The zoom meeting will be open from 8.00 to 11.00 – The written exam will start at 8.15 sharp at what time point the exam will be downloadable from TUMexam. – Your exam should be uploaded back to TUMexam by 9.50 sharp. – Do not switch on your microphone, nor your camera, and do not share your screen during the written exam. – You should primarily use the zoom conference chat with direct messages to the examiner during the exam if you have any question during the written exam. – If your zoom connection breaks during the written exam, try ﬁrst to reconnect. If it keeps failing you can post questions at [email protected] – Immediately after the written exam starts the oral questioning 9.45-10.45. – The purpose of the oral questioning is to ensure your identity and that you did the exam by yourself. You should be able to explain why you gave a particular answer to a question (i.e. what was your reasoning). It does not matter whether your answer to the question is right or wrong. We only want to make sure that it comes from you. In the oral questioning you will not be allowed to consult any document any longer. – You are not allowed to communicate with anyone except the examiners during the entire hour reserved for oral questioning, even if you have been already orally questioned.

IN-DataViz-1-20200629-E5115-02

– Page 2 / 24 –

Page empty

– You must be reachable at all times by videoconference during the oral questioning hour. If your zoom connection breaks, immediately inform us at [email protected] and propose an alternative videoconference channel (preferably WhatsApp). We will not store your phone number after the oral questioning. – For the oral questioning, switch on the camera and microphone. Give us your matricule number, ﬁrst name, and last name as it appears in TUMonline by copy-pasting this information in the chat window. Show your student ID and face. We will then ask you a few questions about your submission, to verify that you wrote it yourself.

Left room from

Page empty

to

/

– Page 3 / 24 –

Early submission at

IN-DataViz-1-20200629-E5115-03

Problem 1 0 1 2

(6 credits)

a)

Question Nr. 4HR86EZ11LNA05GE44NY1 The olive dataset from the dslabs package contains the % of 8 fatty acids found in Italian olive oils. library(dslabs) data(olive) head(olive) ## ## ## ## ## ## ## ## ## ## ## ## ## ##

1 2 3 4 5 6 1 2 3 4 5 6

region area palmitic palmitoleic stearic oleic linoleic Southern Italy North-Apulia 10.75 0.75 2.26 78.23 6.72 Southern Italy North-Apulia 10.88 0.73 2.24 77.09 7.81 Southern Italy North-Apulia 9.11 0.54 2.46 81.13 5.49 Southern Italy North-Apulia 9.66 0.57 2.40 79.52 6.19 Southern Italy North-Apulia 10.51 0.67 2.59 77.71 6.72 Southern Italy North-Apulia 9.11 0.49 2.68 79.24 6.78 linolenic arachidic eicosenoic 0.36 0.60 0.29 0.31 0.61 0.29 0.31 0.63 0.29 0.50 0.78 0.35 0.50 0.80 0.46 0.51 0.70 0.44

Write R code using ggplot2 to plot the distribution of the % of the fatty acids except oleic using density plots with diﬀerent line colors to distinguish each fatty acid. Give meaningful labels to both axes.

IN-DataViz-1-20200629-E5115-04

– Page 4 / 24 –

Page empty

Page empty

– Page 5 / 24 –

IN-DataViz-1-20200629-E5115-05

b)

Question Nr. 3QM41LNA09GZ73DB55AQ5 The mpg dataset from the ggplot2 package contains diﬀerent cars features, mostly involving fuel. library(ggplot2) data(mpg) head(mpg) ## ## ## ## ## ## ## ## ##

# A tibble: 6 x 11 manufacturer model displ year cyl trans

1 audi a4 1.8 1999 4 auto(l5) 2 audi a4 1.8 1999 4 manual(m5) 3 audi a4 2 2008 4 manual(m6) 4 audi a4 2 2008 4 auto(av) 5 audi a4 2.8 1999 6 auto(l5) 6 audi a4 2.8 1999 6 manual(m5)

drv cty hwy fl f 18 29 p f 21 29 p f 20 31 p f 21 30 p f 16 26 p f 18 26 p

class

compa~ compa~ compa~ compa~ compa~ compa~

Write R code that produces the following plot. cyl

displ 0.3

0.2

0.2

0.1

density

0 1 2

0.1

0.0

0.0 4

5

6

7

8

2

3

4

hwy

5

6

7

25

30

35

cty

0.06

0.075

0.04

0.050

0.02

0.025 0.000

0.00 20

30

40

10

15

20

value

IN-DataViz-1-20200629-E5115-06

– Page 6 / 24 –

Page empty

Page empty

– Page 7 / 24 –

IN-DataViz-1-20200629-E5115-07

0 1 2

c)

Question Nr. 0GW52IK81LNA06UY23CW8 The admissions dataset from the dslabs package provides the number of applicants and admitted students to 6 diﬀerent majors stratiﬁed by gender. Write R code usingggplot2 to plot the diﬀerence between applicants and admitted students on each major, using bars and stratiﬁed by gender using facets. Give meaningful labels to both axes. Do not mind if you obtain negative values. library(dslabs) data(admissions) admissions ## ## ## ## ## ## ## ## ## ## ## ## ##

1 2 3 4 5 6 7 8 9 10 11 12

major gender admitted applicants A men 62 825 B men 63 560 C men 37 325 D men 33 417 E men 28 191 F men 6 373 A women 82 108 B women 68 25 C women 34 593 D women 35 375 E women 24 393 F women 7 341

IN-DataViz-1-20200629-E5115-08

– Page 8 / 24 –

Page empty

Page empty

– Page 9 / 24 –

IN-DataViz-1-20200629-E5115-09

Problem 2 0 1 2

(2 credits)

Which operation has been applied to table A and table B to return the result table? Justify your answer. Write one line of R code that would produce the result table assuming a data table A and a data table B in the working environment.

Table A: id

CreditCard

CCV

type

15 21 14 23 16 13 8 24 19 18

1837655746651971 5927428911423246 7393954899774435 7844437946592947 7364376521545978 3923818281216234 1764682661721638 2622321425978251 7271112241595296 4225693846619738

582 221 142 479 881 698 566 528 393 421

l i r l l o o o o r

Table B: ﬁrstName

lastName

customer_id

Aamina Marcus Derek Muntasir Alexis Alexis Khanea Julia Keith Tiana Adam

el-Sinai Hendrix Martinez al-Shariﬁ Smith Arreola Forrest Deronde Hart Ramirez Highman

16 13 8 24 19 18 1 6 25 9 2

Result table:

IN-DataViz-1-20200629-E5115-10

id

CreditCard

CCV

1 2 6 8 9 13 14 15 16 18 19 21 23 24 25

NA NA NA 1764682661721638 NA 3923818281216234 7393954899774435 1837655746651971 7364376521545978 4225693846619738 7271112241595296 5927428911423246 7844437946592947 2622321425978251 NA

NA NA NA 566 NA 698 142 582 881 421 393 221 479 528 NA

type NA NA NA o NA o r l l r o i l l NA

ﬁrstName Khanea Adam Julia Derek Tiana Marcus NA NA Aamina Alexis Alexis NA NA Muntasir Keith

lastName Forrest Highman Deronde Martinez Ramirez Hendrix NA NA el-Sinai Arreola Smith NA NA al-Shariﬁ Hart

– Page 10 / 24 –

Page empty

Problem 3

(2 credits)

a)

0 1

Question Nr. 6CK12JZ6CD9OQ44JL0 Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line.

density

0.4

0.2

0.0 −15−10 −5 0

5 10 15

x

B

15

15

10

10

5

5

sample

sample

A

0 −5 −10

0 −5 −10

−15

−15 −4

−2

0

2

4

−4

theoretical

D

15

2

4

15

10

10

5

5

0 −5 −10

0 −5 −10

−15

−15 −4

−2

0

2

theoretical

Page empty

0

theoretical

sample

sample

C

−2

4

−4

−2

0

2

4

theoretical

– Page 11 / 24 –

IN-DataViz-1-20200629-E5115-11

b)

Question Nr. 3JN34PA3DD3YU35RP0 Which normal Q-Q plot (i.e. Q-Q plot against the standard Normal distribution) A, B, C, or D corresponds to the distribution depicted in plain in the plot below? The standard Normal distribution, i.e. the Gaussian distribution with mean 0 and variance 1, is shown in the plot below with a dashed line. 0.4

density

0.3 0.2 0.1 0.0 −15−10 −5 0

5 10 15

x

A

B 1.00

15

0.75

5

sample

sample

10

0 −5

0.50 0.25

−10 −15

0.00 −4

−2

0

2

−4

4

theoretical

D

15

0

2

4

15

10

10

5

5

0 −5 −10

0 −5 −10

−15

−15 −4

−2

0

2

theoretical

IN-DataViz-1-20200629-E5115-12

−2

theoretical

sample

C sample

0 1

4

−4

−2

0

2

4

theoretical

– Page 12 / 24 –

Page empty

Problem 4

(2 credits) 0 1 2

Question Nr. 9OQ7L2ZT32SI73AD2 Thebrexit_polls dataset from the dslabs package contains poll outcomes for 127 polls performed by diﬀerent pollsters either online or by telephone (poll_type). library(dslabs) data(brexit_polls) head(brexit_polls) ## ## ## ## ## ## ## ## ## ## ## ## ## ##

1 2 3 4 5 6 1 2 3 4 5 6

startdate 2016-06-23 2016-06-22 2016-06-20 2016-06-20 2016-06-20 2016-06-17 spread 0.04 0.10 0.02 0.03 -0.01 0.08

enddate pollster poll_type samplesize remain leave undecided 2016-06-23 YouGov Online 4772 0.52 0.48 0.00 2016-06-22 Populus Online 4700 0.55 0.45 0.00 2016-06-22 YouGov Online 3766 0.51 0.49 0.00 2016-06-22 Ipsos MORI Telephone 1592 0.49 0.46 0.01 2016-06-22 Opinium Online 3011 0.44 0.45 0.09 2016-06-22 ComRes Telephone 1032 0.54 0.46 0.00

You are interested in generating a table that shows the polls of June 2016 only. Write R code to create such table with the same column names (header displayed below). ## ## ## ## ## ## ## ## ## ## ## ## ## ##

1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6:

startdate 2016-06-23 2016-06-22 2016-06-20 2016-06-20 2016-06-20 2016-06-17 spread 0.04 0.10 0.02 0.03 -0.01 0.08

enddate pollster poll_type samplesize remain leave undecided 2016-06-23 YouGov Online 4772 0.52 0.48 0.00 2016-06-22 Populus Online 4700 0.55 0.45 0.00 2016-06-22 YouGov Online 3766 0.51 0.49 0.00 2016-06-22 Ipsos MORI Telephone 1592 0.49 0.46 0.01 2016-06-22 Opinium Online 3011 0.44 0.45 0.09 2016-06-22 ComRes Telephone 1032 0.54 0.46 0.00

Page empty

– Page 13 / 24 –

IN-DataViz-1-20200629-E5115-13

Problem 5 0 1 2

(4 credits)

a) Question Nr. 4FS41VJ01LT35KP42ZM8 Consider the dataset “brca”. Which statistical test that we studied do you suggest to test the association between the variable “concavity_se” and the variable “outcome”? Assume normally distributed values of the variable“concavity_se” given the value of “outcome”. Justify the choice of the test and provide the two-sided p-value rounded to two signiﬁcant digits using signif(...,digits=2).

IN-DataViz-1-20200629-E5115-14

– Page 14 / 24 –

Page empty

b)

0 1 2

Question Nr. 5GH17ZA51LS07UH62QI6 Consider the dataset “olive”. Which statistical test that we studied do you suggest to test the association between the variable “oleic” and the variable “stearic”? Do not make any assumption of normality. Justify the choice of the test and provide the two-sided p-value rounded to two signiﬁcant digits using signif(...,digits=2).

Page empty

– Page 15 / 24 –

IN-DataViz-1-20200629-E5115-15

Problem 6 0 1

(2 credits)

a) QuestionId: 0PE69QK81ME3BC1VR8 Which of the following dependent variables ‘rawpoll_clinton’ and ‘rawpoll_trump’ explains most variance of the response variable ‘rawpoll_mcmullin’ in the ‘polls_us_election_2016’ dataset from the dslabs package? Assume that the assumptions of linear regression are met. Provide code and justify your answer.

0 1

b) QuestionId: 3FQ41UA96LY2U0K14MQ1 Consider the ‘polls_us_election_2016’ dataset from the dslabs package. Assuming all assumptions of linear regression being met, is the eﬀect of ‘rawpoll_clinton’ on ‘rawpoll_mcmullin’ signiﬁcant at the signiﬁcance level of 0.05? Justify.

IN-DataViz-1-20200629-E5115-16

– Page 16 / 24 –

Page empty

Problem 7

(2 credits)

a)

0 1 Question Nr. 8ZF48OQ3WP45KH55GQ0 We consider a linear regression model parameterized as yi = α + β · xi + ǫi where i= 1...N denotes the data point indices, yi is the response variable, α and β the coeﬃcients, xi the explanatory variable and ǫi the error term. Let yˆi be the ﬁtted value.

response quantiles

Does the following plot provide evidence against the assumptions of the linear regression? Justify.

5

0

−5

−10 −2

0

2

theoretical quantiles

Page empty

– Page 17 / 24 –

IN-DataViz-1-20200629-E5115-17

b) Question Nr. 0JU8SP106HN41KM58HL9 We consider a linear regression model parameterized as yi = α + β · xi + ǫi where i = 1...N denotes the data point indices, yi is the response variable, α and β the coeﬃcients, xi the explanatory variable and ǫi the error term. Let yˆi be the ﬁtted value. Does the following plot provide evidence against the assumptions of the linear regression? Justify.

600

400

y

0 1

200

0 0

20

40

60

x

IN-DataViz-1-20200629-E5115-18

– Page 18 / 24 –

Page empty

Problem 8

(2 credits) 0 1 2

Question Nr. 4LY90YX24AQ71F66EX2 library(dslabs) Consider the “brca” dataset from dslabs package. Fit a logistic regression model which predicts the response variable brca$y given the feature perimeter_se. Assume that all assumptions of the logistic regression model are met. Starting from an original probability of 10 % of malignant (cancer) how much does the probability of developping a malignant (cancer) increase, when the feature perimeter_se increases by 0.8.

Page empty

– Page 19 / 24 –

IN-DataViz-1-20200629-E5115-19

Problem 9 0 1 2

(2 credits)

QuestionId: 4YF2RH7TD00JE84ZN0 Consider the features smoothness_mean, radius_mean from the brca dataset. Provide R code that plots a ROC curve of both features as predictors of malignancy (variable brca$y == “M”), and indicate the feature that has the highest true positive rate when the false positive rate is 0.4.

IN-DataViz-1-20200629-E5115-20

– Page 20 / 24 –

Page empty

Problem 10

(2 credits) 0 1 2

library(dslabs) library(data.table) Question Nr. 4IL88XL37CT72GN68OW7 Consider the variab...