FIT3152 Lecture 02 PDF

Title FIT3152 Lecture 02
Author Wong Kai Jeng
Course Data Science
Institution Monash University
Pages 98
File Size 3.7 MB
File Type PDF
Total Downloads 65
Total Views 123

Summary

Lecture 2...


Description

FIT3152 Data analytics – Lecture 2 Visualization of data Recent examples •  Common themes

• 

Draft: do not circulate First steps: looking at the data

Visualization using R • 

Visualization for analysis •  Presentation quality graphics

• 

FIT3152 Data analytics – Lecture 2

Slide 1

Dean’s Student Forum Meet your Dean Prof. Frieder Seible Please join us to share your ideas.

Draft: do not circulate CLAYTON:

Monday 8 August 12noon – 1pm New Horizons, 20 Research Way, G29 & G30

CAULFIELD: Tuesday 9 August 11am – 12 noon Building H, Level 7, Room H7.84 Register at:

tinyurl.com/ITForum2016

FIT3152 Data analytics – Lecture 2

Slide 3

RMIT Analytics Competition

https://sites.google.com/site/rmitanalytics/ FIT3152 Data analytics – Lecture 2

Slide 4

Recap on R from last lecture

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 5

Data Structures Data is stored in R using data structures (objects) to which functions (methods) are applied. Array Contain data of the same type. •  Vector: 1D, Matrix: 2D, Array: 3+ Dimensions.

• 

Draft: do not circulate

Data Frame • 

Row x Column data format – each column is a vector.

List • 

An ordered collection of (possibly different) types.

FIT3152 Data analytics – Lecture 2

Slide 6

Descriptive statistics Problem: describe a simple data set, calculate some basic statistics, draw a simple histogram >  thedata'  sd(thedata)'''#"calculate"standard"devia4on" [1] 4.743416

>  hist(thedata)''#"draw"a"basic"histogram' FIT3152 Data analytics – Lecture 2

Slide 7

Bivariate data The data: • 

In 1998, Choice magazine tested 1500 toothbrushes and made a summary of price and function. Are these two factors related? Price Function

Draft: do not circulate Toothbrush.csv'

3.95 2.96 2.95 0.66 0.69 3.20 1.08 3.69 …

65.10 78.00 72.00 40.00 57.00 61.00 49.00 76.00 …

Data from Selvanathan Australian Business Statistics (Abridged 4th Ed) FIT3152 Data analytics – Lecture 2

Slide 8

Bivariate data Problem: analyse the relationship between price and function. • 

Read the data and create a data frame

Draft: do not circulate

>  Toothbrush'  cor(Toothbrush)'#'which'is'x'or'y'not'important' Price Function Price 1.0000000 0.6645614 Function 0.6645614 1.0000000

FIT3152 Data analytics – Lecture 2

Slide 9

Scatterplot >  plot(Toothbrush)' >  #"the"default"plot"pu9ng"Func4on"on"y"axis"

60

70

80

ulate

40

50

Function

Draft

1

2

3

4

5

Price

FIT3152 Data analytics – Lecture 2

Slide 10

Bivariate data (attach function) Problem: calculate the regression equation • 

The ‘attach’ function lets you call columns by name without having to specify data set – assuming column name is unique amongst attached data sets.

Draft: do not circulate

>  aIach(Toothbrush)' • 

Scatterplot using Price and Function >  plot(Price,'FuncLon)'

FIT3152 Data analytics – Lecture 2

Slide 11

Bivariate data Problem: calculate the regression equation cont. • 

To calculate the regression equation define variable ‘fitted’ and use linear model (lm) function. >  fiIed'='lm(FuncLon'~'Price)'' >  fiIed'

Draft: do not circulate Call: lm(formula = Function ~ Price) Coefficients: (Intercept) Price 44.020 6.942

• 

Now overplot the fitted model on scatterplot >  abline(fiIed)'

FIT3152 Data analytics – Lecture 2

Slide 12

Scatterplot + regression line >  plot(Price,'FuncLon)' >  abline(fiIed)'#"overplo9ng"

ulate

60 40

50

Function

70

80

Draf

1

2

3

4

5

Price

FIT3152 Data analytics – Lecture 2

Slide 13

Estimation/Hypothesis testing The data: • 

The number of claims processed by two workers is given below. For convenience create two vectors: >  Workers'  aIach(Workers)'

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Workers.csv' WorkerA 23 45 21 22 17 42 45 41 49 19

WorkerB 33 23 19 51 32 15

Slide 14

Estimation/Hypothesis testing Problem 1: • 

Calculate the confidence interval for the average number of claims processed by Worker A.

Problem 2: • 

Draft: do not circulate Can we conclude that worker A processes more

claims than Worker B?

FIT3152 Data analytics – Lecture 2

Slide 15

Estimation/Hypothesis testing Quick comparison of data using a boxplot: 50

>  boxplot(WorkerA,'WorkerB)''

40

45

e

15

20

25

30

35

D

1

FIT3152 Data a

2

Slide 16

EHT: Problem 1 Performing a t.test (with alternative that mean ≠ 0) to generate confidence interval. >  t.test(WorkerA)' One Sample t-test data: WorkerA t = 7.93, df = 9, p-value = 2.374e-05 alternative hypothesis: true mean not equal to 0 95 percent confidence interval: 23.1574 41.6426 sample estimates: mean of x 32.4

Draft: do not circulate • 

Specify confidence level as a parameter to change default, for example >  t.test(WorkerA,'conf.level'='0.55)'

FIT3152 Data analytics – Lecture 2

Slide 17

EHT: Problem 2 Performing a t.test to test whether the means are different: >  t.test(WorkerA,'WorkerB)' Welch Two Sample t-test data: WorkerA and WorkerB t = 0.5333, df = 10.634, p-value = 0.6048 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.21422 18.34755 sample estimates: mean of x mean of y 32.40000 28.83333

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 18

t.test: syntax From the help file: • 

Description Performs one and two sample t-tests on vectors of data.

• 

Draft: do not circulate

Usage

t.test(x, ...) ## Default S3 method: t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)

FIT3152 Data analytics – Lecture 2

Slide 19

Time series analysis The data: • 

The quantity of pre-mixed concrete produced, Australia (Mar 19990 – Mar 2013). From ABS: 8301.0 Production of Selected Construction Materials.

Draft: do not circulate Concrete.csv'

FIT3152 Data analytics – Lecture 2

SeasYear Mar$90 Jun$90 Sep$90 Dec$90 Mar$91 Jun$91 Sep$91 Dec$91 …

PreMix000 3814 3862 4049 3886 3114 3238 3365 3418 …

Slide 20

Time series analysis Problem: read the data and declare as class ts: >  >  >  >  > 

Concrete'  plot(PreMix000)'

ulate

5000 3000

4000

PreMix000

6000

Draf

1990

1995

2000

2005

2010

Time

FIT3152 Data analytics – Lecture 2

Slide 22

Time series analysis Problem: decompose the time series >  decomp'  plot(decomp)'#'object'stores'all'info'about'Lme'series' 7000 5000

observed

5500 3000

trend

3500 0 400 -400 0

random seasonal

ulate

-400

Draft

Decomposition of additive time series

1990

1995

2000

2005

2010

Time

FIT3152 Data analytics – Lecture 2

Slide 23

A few review questions…

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 24

Question 1 Predict the output from the following commands: >  X'  X'  X'  X'  X'  class(X)'

Draft: do not circulate (a) (b) (c)

numeric character numeric, character

FIT3152 Data analytics – Lecture 2

Slide 29

Question 6 Predict the output from the following commands: >  >  >  > 

X'  str(iris)'

'data.frame': 150 obs. of 5 variables: $ $ $ $ $ 1

Sepal.Length: num 5.1 Sepal.Width : num 3.5 Petal.Length: num 1.4 Petal.Width : num 0.2 Species : Factor w/ 3 1 1 1 ...

FIT3152 Data analytics – Lecture 2

4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... levels "setosa","versicolor",..: 1 1 1 1 1 1

Slide 48

Print head and tail >  head(iris)' 1 2 3 4

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa

Draft: do not circulate 5 6

5.0 5.4

3.6 3.9

1.4 1.7

0.2 0.4

setosa setosa

>  tail(iris)' Sepal.Length Sepal.Width Petal.Length Petal.Width 145 146 147 148 149 150

6.7 6.7 6.3 6.5 6.2 5.9

FIT3152 Data analytics – Lecture 2

3.3 3.0 2.5 3.0 3.4 3.0

5.7 5.2 5.0 5.2 5.4 5.1

2.5 2.3 1.9 2.0 2.3 1.8

Species virginica virginica virginica virginica virginica virginica Slide 49

…or a selection of rows >  iris[10:15,]'#"by"conven4on"["]"index"rows' 10 11 12 13

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 4.9 3.1 1.5 0.1 setosa 5.4 3.7 1.5 0.2 setosa 4.8 3.4 1.6 0.2 setosa 4.8 3.0 1.4 0.1 setosa

Draft: do not circulate 14 15

4.3 5.8

3.0 4.0

1.1 1.2

0.1 0.2

setosa setosa

>  iris[11,]'#"single"row' 11

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.4 3.7 1.5 0.2 setosa

FIT3152 Data analytics – Lecture 2

Slide 50

…or part of a single column >  iris[10:20,'"Sepal.Length"]'#"iden4fy"column"as"string' [1] 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1

>  >  >  >  >  > 

#"or" iris$Sepal.Length[10:20]'#"iden4fy"column"by"name' [1]'4.9'5.4'4.8'4.8'4.3'5.8'5.7'5.4'5.1'5.7'5.1' #"or" iris[10:20,1]'#"iden4fy"column"by"number' [1]'4.9'5.4'4.8'4.8'4.3'5.8'5.7'5.4'5.1'5.7'5.1'

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 51

Summary Create a mean + 5 point summary of each numerical column and list of factor types

Draft: do not circulate

> summary(iris) Sepal.Length Min. :4.300 1st Qu.:5.100 Median :5.800

Sepal.Width Min. :2.000 1st Qu.:2.800 Median :3.000

Petal.Length Min. :1.000 1st Qu.:1.600 Median :4.350

Petal.Width Min. :0.100 1st Qu.:0.300 Median :1.300

Mean :5.843 3rd Qu.:6.400 Max. :7.900

Mean :3.057 3rd Qu.:3.300 Max. :4.400

Mean :3.758 3rd Qu.:5.100 Max. :6.900

Mean :1.199 3rd Qu.:1.800 Max. :2.500

FIT3152 Data analytics – Lecture 2

Species setosa :50 versicolor:50 virginica :50

Slide 52

Class activity The data set ‘mpg’ is contained in the ggplot2 package. Get to know it: >  >  >  >  >  > 

summary' str' head' tail' unique'#parLcular'columns' ?'

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 53

Seeing the data

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 54

Base graphics These are the graphic functions built into the basic R installation. High level graphics functions create new graphs with axis, labels and titles. •  Low level graphics functions then annotate plots with points, lines and text. •  See ATHR page 48.

• 

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 55

Base graphics: high level functions Some functions we have used/will use include : >  >  >  >  >  >  > 

plot'#"ScaEerplot" pairs'#"ScaEerplot"matrix" hist'#"Histogram" stem'#"StemGandGleaf"plot" boxplot'#"BoxGandGwhisker"plot" barplot'#"Bar"plot" dotchart'#"Dot"plot"

Draft: do not circulate

• 

See ATHR page 49

FIT3152 Data analytics – Lecture 2

Slide 56

Base graphics: low level functions Some low level functions include: >  lines'#"Draw"lines"between"given"coordinates" >  text'#"Draw"text"at"given"coordinates" >  abline'#"Draw"a"line"of"given"intercept"and"slope,"or"a" horizontal"and/or"ver4cal"line" >  axis'#"Add"an"axis" >  arrows'#"Draw"arrows" >  grid'#"Add"a"rectangular"grid" >  legend'#"Add"a"legend"(a"key)"

Draft: do not circulate

• 

See ATHR page 50

FIT3152 Data analytics – Lecture 2

Slide 57

Base graphics: graphics parameters Some low level functions include: >  >  >  >  >  >  >  > 

main'#"Title"of"the"plot" ylab,'xlab'#"Labels"for"the"yGaxis"and"xGaxis" type'#"Plot"type"(points,"lines,"both,"...)," pch'#"Plot"character"(circles,"dots,","symbols,"...)" lty'#"Line"type"(solid,"dots,"dashes,"...)" lwd'#"Line"width" col'#"Colour"of"plot"characters" ...and'many'others,'see:'help(par)'

Draft: do not circulate

• 

See ATHR page 50

FIT3152 Data analytics – Lecture 2

Slide 58

Boxplot Each variable can be viewed as a boxplot distinguished by level: >  boxplot(Sepal.Length~Species,'data'='iris)' 7.5

ulate

4.5

5.5

6.5

Draf

setosa

FIT3152 Data analytics – Lecture 2

versicolor

virginica

Slide 59

Scatterplot >  with(iris,'plot(Sepal.Length,'Sepal.Width))' >  #"using"‘with’"simplifies"column"names"etc."

3.0

3.5

4.0

ulate

2.0

2.5

Sepal.Width

Draft:

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Sepal.Length

FIT3152 Data analytics – Lecture 2

Slide 60

Scatterplot + colour >  with(iris,'plot(Sepal.Length,'Sepal.Width,'col'=' Species))'

3.0

3.5

4.0

ulate

2.0

2.5

Sepal.Width

Draft:

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Sepal.Length

FIT3152 Data analytics – Lecture 2

Slide 61

Scatterplot + plot symbol >  with(iris,'plot(Sepal.Length,'Sepal.Width,'col'=' Species,'pch=as.numeric(Species)))''

3.0

3.5

4.0

ulate

2.0

2.5

Sepal.Width

Draft:

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Sepal.Length

FIT3152 Data analytics – Lecture 2

Slide 62

Scatterplot + jitter >  with(iris,'plot(jiIer(Sepal.Length),'jiIer(Sepal.Width),' col'='Species,'pch=as.numeric(Species)))'

3.0

3.5

4.0

ulate

2.0

2.5

jitter(Sepal.Width)

Draft:

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

jitter(Sepal.Length)

FIT3152 Data analytics – Lecture 2

Slide 63

Scatterplot + labels >  with(iris,'plot(jiIer(Sepal.Length),'jiIer(Sepal.Width),' col'='Species,'pch=as.numeric(Species),'main'='("Iris' Data"),'xlab'='"Sepal'Length",'ylab'='("Sepal'Width")))'

3.0

3.5

4.0

ulate

2.0

2.5

Sepal Width

Draft:

4.5

Iris Data

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Sepal Length

FIT3152 Data analytics – Lecture 2

Slide 64

Scatterplot + legend >  #'Follow'the'plot'command'with:' >  with(iris,'legend(6.5,'4.4,'as.vector(unique(Species)),' pch=unique(Species),'col'='unique(Species)))' Iris Data

ulate

4.0 3.5 3.0 2.0

2.5

Sepal Width

Draft:

setosa versicolor virginica

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Sepal Length

FIT3152 Data analytics – Lecture 2

Slide 65

Complete plot command >  with(iris,'plot(jiIer(Sepal.Length),'jiIer(Sepal.Width),' col'='Species,'pch=as.numeric(Species),'main'='("Iris' Data"),'xlab'='"Sepal'Length",'ylab'='("Sepal'Width")))' >  with(iris,'legend(6.5,'4.4,'as.vector(unique(Species)),' pch=unique(Species),'col'='unique(Species)))'

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 66

Saving graphics Diverting graphics from RStudio window to a file: • 

The code below opens a file, diverts the output from RStudio to a named file (of type wmf in this case) and then closes the diversion.

Draft: do not circulate

>  setwd("C:/")'#"default"output"directory' >  win.metafile(file'='"toothbrush.wmf")'#"setup" diversion' >  plot('...')'#"plo9ng"func4ons"in"here' >  dev.off()'#"close"file"and"diversion' • 

See ATHR page 57

FIT3152 Data analytics – Lecture 2

Slide 67

Viewing more dimensions

Draft: do not circulate

FIT3152 Data analytics – Lecture 2

Slide 68

All interactions: scatterplot matrix The default method for a scatterplot matrix is >  pairs(iris)' 0.5

2.0

3.5

4.5

Sepal.Length

Sepal.Width

ulate

5

7

2.0

Draft:

3.5

6.5

2.0

2.0

1

3

Petal.Length

2.0

3.0

0.5

Petal.Width

1.0

Species

4.5

FIT3152 Data analytics – Lecture 2

6.5

1

3

5

7

1.0

2.0

3.0

Slide 69

Scatterplot matrix Adding colour >  pairs(iris[1:5],'pch'='21,'bg'='c("red",'"green3",'"blue") [unclass(iris$Species)])' 0.5

2.0

3.5

4.5

Sepal.Len...


Similar Free PDFs