Title | FIT3152 Lecture 02 |
---|---|
Author | Wong Kai Jeng |
Course | Data Science |
Institution | Monash University |
Pages | 98 |
File Size | 3.7 MB |
File Type | |
Total Downloads | 65 |
Total Views | 123 |
Lecture 2...
FIT3152 Data analytics – Lecture 2 Visualization of data Recent examples • Common themes
•
Draft: do not circulate First steps: looking at the data
Visualization using R •
Visualization for analysis • Presentation quality graphics
•
FIT3152 Data analytics – Lecture 2
Slide 1
Dean’s Student Forum Meet your Dean Prof. Frieder Seible Please join us to share your ideas.
Draft: do not circulate CLAYTON:
Monday 8 August 12noon – 1pm New Horizons, 20 Research Way, G29 & G30
CAULFIELD: Tuesday 9 August 11am – 12 noon Building H, Level 7, Room H7.84 Register at:
tinyurl.com/ITForum2016
FIT3152 Data analytics – Lecture 2
Slide 3
RMIT Analytics Competition
https://sites.google.com/site/rmitanalytics/ FIT3152 Data analytics – Lecture 2
Slide 4
Recap on R from last lecture
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 5
Data Structures Data is stored in R using data structures (objects) to which functions (methods) are applied. Array Contain data of the same type. • Vector: 1D, Matrix: 2D, Array: 3+ Dimensions.
•
Draft: do not circulate
Data Frame •
Row x Column data format – each column is a vector.
List •
An ordered collection of (possibly different) types.
FIT3152 Data analytics – Lecture 2
Slide 6
Descriptive statistics Problem: describe a simple data set, calculate some basic statistics, draw a simple histogram > thedata' sd(thedata)'''#"calculate"standard"devia4on" [1] 4.743416
> hist(thedata)''#"draw"a"basic"histogram' FIT3152 Data analytics – Lecture 2
Slide 7
Bivariate data The data: •
In 1998, Choice magazine tested 1500 toothbrushes and made a summary of price and function. Are these two factors related? Price Function
Draft: do not circulate Toothbrush.csv'
3.95 2.96 2.95 0.66 0.69 3.20 1.08 3.69 …
65.10 78.00 72.00 40.00 57.00 61.00 49.00 76.00 …
Data from Selvanathan Australian Business Statistics (Abridged 4th Ed) FIT3152 Data analytics – Lecture 2
Slide 8
Bivariate data Problem: analyse the relationship between price and function. •
Read the data and create a data frame
Draft: do not circulate
> Toothbrush' cor(Toothbrush)'#'which'is'x'or'y'not'important' Price Function Price 1.0000000 0.6645614 Function 0.6645614 1.0000000
FIT3152 Data analytics – Lecture 2
Slide 9
Scatterplot > plot(Toothbrush)' > #"the"default"plot"pu9ng"Func4on"on"y"axis"
60
70
80
ulate
40
50
Function
Draft
1
2
3
4
5
Price
FIT3152 Data analytics – Lecture 2
Slide 10
Bivariate data (attach function) Problem: calculate the regression equation •
The ‘attach’ function lets you call columns by name without having to specify data set – assuming column name is unique amongst attached data sets.
Draft: do not circulate
> aIach(Toothbrush)' •
Scatterplot using Price and Function > plot(Price,'FuncLon)'
FIT3152 Data analytics – Lecture 2
Slide 11
Bivariate data Problem: calculate the regression equation cont. •
To calculate the regression equation define variable ‘fitted’ and use linear model (lm) function. > fiIed'='lm(FuncLon'~'Price)'' > fiIed'
Draft: do not circulate Call: lm(formula = Function ~ Price) Coefficients: (Intercept) Price 44.020 6.942
•
Now overplot the fitted model on scatterplot > abline(fiIed)'
FIT3152 Data analytics – Lecture 2
Slide 12
Scatterplot + regression line > plot(Price,'FuncLon)' > abline(fiIed)'#"overplo9ng"
ulate
60 40
50
Function
70
80
Draf
1
2
3
4
5
Price
FIT3152 Data analytics – Lecture 2
Slide 13
Estimation/Hypothesis testing The data: •
The number of claims processed by two workers is given below. For convenience create two vectors: > Workers' aIach(Workers)'
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Workers.csv' WorkerA 23 45 21 22 17 42 45 41 49 19
WorkerB 33 23 19 51 32 15
Slide 14
Estimation/Hypothesis testing Problem 1: •
Calculate the confidence interval for the average number of claims processed by Worker A.
Problem 2: •
Draft: do not circulate Can we conclude that worker A processes more
claims than Worker B?
FIT3152 Data analytics – Lecture 2
Slide 15
Estimation/Hypothesis testing Quick comparison of data using a boxplot: 50
> boxplot(WorkerA,'WorkerB)''
40
45
e
15
20
25
30
35
D
1
FIT3152 Data a
2
Slide 16
EHT: Problem 1 Performing a t.test (with alternative that mean ≠ 0) to generate confidence interval. > t.test(WorkerA)' One Sample t-test data: WorkerA t = 7.93, df = 9, p-value = 2.374e-05 alternative hypothesis: true mean not equal to 0 95 percent confidence interval: 23.1574 41.6426 sample estimates: mean of x 32.4
Draft: do not circulate •
Specify confidence level as a parameter to change default, for example > t.test(WorkerA,'conf.level'='0.55)'
FIT3152 Data analytics – Lecture 2
Slide 17
EHT: Problem 2 Performing a t.test to test whether the means are different: > t.test(WorkerA,'WorkerB)' Welch Two Sample t-test data: WorkerA and WorkerB t = 0.5333, df = 10.634, p-value = 0.6048 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.21422 18.34755 sample estimates: mean of x mean of y 32.40000 28.83333
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 18
t.test: syntax From the help file: •
Description Performs one and two sample t-tests on vectors of data.
•
Draft: do not circulate
Usage
t.test(x, ...) ## Default S3 method: t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
FIT3152 Data analytics – Lecture 2
Slide 19
Time series analysis The data: •
The quantity of pre-mixed concrete produced, Australia (Mar 19990 – Mar 2013). From ABS: 8301.0 Production of Selected Construction Materials.
Draft: do not circulate Concrete.csv'
FIT3152 Data analytics – Lecture 2
SeasYear Mar$90 Jun$90 Sep$90 Dec$90 Mar$91 Jun$91 Sep$91 Dec$91 …
PreMix000 3814 3862 4049 3886 3114 3238 3365 3418 …
Slide 20
Time series analysis Problem: read the data and declare as class ts: > > > > >
Concrete' plot(PreMix000)'
ulate
5000 3000
4000
PreMix000
6000
Draf
1990
1995
2000
2005
2010
Time
FIT3152 Data analytics – Lecture 2
Slide 22
Time series analysis Problem: decompose the time series > decomp' plot(decomp)'#'object'stores'all'info'about'Lme'series' 7000 5000
observed
5500 3000
trend
3500 0 400 -400 0
random seasonal
ulate
-400
Draft
Decomposition of additive time series
1990
1995
2000
2005
2010
Time
FIT3152 Data analytics – Lecture 2
Slide 23
A few review questions…
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 24
Question 1 Predict the output from the following commands: > X' X' X' X' X' class(X)'
Draft: do not circulate (a) (b) (c)
numeric character numeric, character
FIT3152 Data analytics – Lecture 2
Slide 29
Question 6 Predict the output from the following commands: > > > >
X' str(iris)'
'data.frame': 150 obs. of 5 variables: $ $ $ $ $ 1
Sepal.Length: num 5.1 Sepal.Width : num 3.5 Petal.Length: num 1.4 Petal.Width : num 0.2 Species : Factor w/ 3 1 1 1 ...
FIT3152 Data analytics – Lecture 2
4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... levels "setosa","versicolor",..: 1 1 1 1 1 1
Slide 48
Print head and tail > head(iris)' 1 2 3 4
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
Draft: do not circulate 5 6
5.0 5.4
3.6 3.9
1.4 1.7
0.2 0.4
setosa setosa
> tail(iris)' Sepal.Length Sepal.Width Petal.Length Petal.Width 145 146 147 148 149 150
6.7 6.7 6.3 6.5 6.2 5.9
FIT3152 Data analytics – Lecture 2
3.3 3.0 2.5 3.0 3.4 3.0
5.7 5.2 5.0 5.2 5.4 5.1
2.5 2.3 1.9 2.0 2.3 1.8
Species virginica virginica virginica virginica virginica virginica Slide 49
…or a selection of rows > iris[10:15,]'#"by"conven4on"["]"index"rows' 10 11 12 13
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 4.9 3.1 1.5 0.1 setosa 5.4 3.7 1.5 0.2 setosa 4.8 3.4 1.6 0.2 setosa 4.8 3.0 1.4 0.1 setosa
Draft: do not circulate 14 15
4.3 5.8
3.0 4.0
1.1 1.2
0.1 0.2
setosa setosa
> iris[11,]'#"single"row' 11
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.4 3.7 1.5 0.2 setosa
FIT3152 Data analytics – Lecture 2
Slide 50
…or part of a single column > iris[10:20,'"Sepal.Length"]'#"iden4fy"column"as"string' [1] 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1
> > > > > >
#"or" iris$Sepal.Length[10:20]'#"iden4fy"column"by"name' [1]'4.9'5.4'4.8'4.8'4.3'5.8'5.7'5.4'5.1'5.7'5.1' #"or" iris[10:20,1]'#"iden4fy"column"by"number' [1]'4.9'5.4'4.8'4.8'4.3'5.8'5.7'5.4'5.1'5.7'5.1'
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 51
Summary Create a mean + 5 point summary of each numerical column and list of factor types
Draft: do not circulate
> summary(iris) Sepal.Length Min. :4.300 1st Qu.:5.100 Median :5.800
Sepal.Width Min. :2.000 1st Qu.:2.800 Median :3.000
Petal.Length Min. :1.000 1st Qu.:1.600 Median :4.350
Petal.Width Min. :0.100 1st Qu.:0.300 Median :1.300
Mean :5.843 3rd Qu.:6.400 Max. :7.900
Mean :3.057 3rd Qu.:3.300 Max. :4.400
Mean :3.758 3rd Qu.:5.100 Max. :6.900
Mean :1.199 3rd Qu.:1.800 Max. :2.500
FIT3152 Data analytics – Lecture 2
Species setosa :50 versicolor:50 virginica :50
Slide 52
Class activity The data set ‘mpg’ is contained in the ggplot2 package. Get to know it: > > > > > >
summary' str' head' tail' unique'#parLcular'columns' ?'
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 53
Seeing the data
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 54
Base graphics These are the graphic functions built into the basic R installation. High level graphics functions create new graphs with axis, labels and titles. • Low level graphics functions then annotate plots with points, lines and text. • See ATHR page 48.
•
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 55
Base graphics: high level functions Some functions we have used/will use include : > > > > > > >
plot'#"ScaEerplot" pairs'#"ScaEerplot"matrix" hist'#"Histogram" stem'#"StemGandGleaf"plot" boxplot'#"BoxGandGwhisker"plot" barplot'#"Bar"plot" dotchart'#"Dot"plot"
Draft: do not circulate
•
See ATHR page 49
FIT3152 Data analytics – Lecture 2
Slide 56
Base graphics: low level functions Some low level functions include: > lines'#"Draw"lines"between"given"coordinates" > text'#"Draw"text"at"given"coordinates" > abline'#"Draw"a"line"of"given"intercept"and"slope,"or"a" horizontal"and/or"ver4cal"line" > axis'#"Add"an"axis" > arrows'#"Draw"arrows" > grid'#"Add"a"rectangular"grid" > legend'#"Add"a"legend"(a"key)"
Draft: do not circulate
•
See ATHR page 50
FIT3152 Data analytics – Lecture 2
Slide 57
Base graphics: graphics parameters Some low level functions include: > > > > > > > >
main'#"Title"of"the"plot" ylab,'xlab'#"Labels"for"the"yGaxis"and"xGaxis" type'#"Plot"type"(points,"lines,"both,"...)," pch'#"Plot"character"(circles,"dots,","symbols,"...)" lty'#"Line"type"(solid,"dots,"dashes,"...)" lwd'#"Line"width" col'#"Colour"of"plot"characters" ...and'many'others,'see:'help(par)'
Draft: do not circulate
•
See ATHR page 50
FIT3152 Data analytics – Lecture 2
Slide 58
Boxplot Each variable can be viewed as a boxplot distinguished by level: > boxplot(Sepal.Length~Species,'data'='iris)' 7.5
ulate
4.5
5.5
6.5
Draf
setosa
FIT3152 Data analytics – Lecture 2
versicolor
virginica
Slide 59
Scatterplot > with(iris,'plot(Sepal.Length,'Sepal.Width))' > #"using"‘with’"simplifies"column"names"etc."
3.0
3.5
4.0
ulate
2.0
2.5
Sepal.Width
Draft:
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Sepal.Length
FIT3152 Data analytics – Lecture 2
Slide 60
Scatterplot + colour > with(iris,'plot(Sepal.Length,'Sepal.Width,'col'=' Species))'
3.0
3.5
4.0
ulate
2.0
2.5
Sepal.Width
Draft:
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Sepal.Length
FIT3152 Data analytics – Lecture 2
Slide 61
Scatterplot + plot symbol > with(iris,'plot(Sepal.Length,'Sepal.Width,'col'=' Species,'pch=as.numeric(Species)))''
3.0
3.5
4.0
ulate
2.0
2.5
Sepal.Width
Draft:
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Sepal.Length
FIT3152 Data analytics – Lecture 2
Slide 62
Scatterplot + jitter > with(iris,'plot(jiIer(Sepal.Length),'jiIer(Sepal.Width),' col'='Species,'pch=as.numeric(Species)))'
3.0
3.5
4.0
ulate
2.0
2.5
jitter(Sepal.Width)
Draft:
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
jitter(Sepal.Length)
FIT3152 Data analytics – Lecture 2
Slide 63
Scatterplot + labels > with(iris,'plot(jiIer(Sepal.Length),'jiIer(Sepal.Width),' col'='Species,'pch=as.numeric(Species),'main'='("Iris' Data"),'xlab'='"Sepal'Length",'ylab'='("Sepal'Width")))'
3.0
3.5
4.0
ulate
2.0
2.5
Sepal Width
Draft:
4.5
Iris Data
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Sepal Length
FIT3152 Data analytics – Lecture 2
Slide 64
Scatterplot + legend > #'Follow'the'plot'command'with:' > with(iris,'legend(6.5,'4.4,'as.vector(unique(Species)),' pch=unique(Species),'col'='unique(Species)))' Iris Data
ulate
4.0 3.5 3.0 2.0
2.5
Sepal Width
Draft:
setosa versicolor virginica
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Sepal Length
FIT3152 Data analytics – Lecture 2
Slide 65
Complete plot command > with(iris,'plot(jiIer(Sepal.Length),'jiIer(Sepal.Width),' col'='Species,'pch=as.numeric(Species),'main'='("Iris' Data"),'xlab'='"Sepal'Length",'ylab'='("Sepal'Width")))' > with(iris,'legend(6.5,'4.4,'as.vector(unique(Species)),' pch=unique(Species),'col'='unique(Species)))'
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 66
Saving graphics Diverting graphics from RStudio window to a file: •
The code below opens a file, diverts the output from RStudio to a named file (of type wmf in this case) and then closes the diversion.
Draft: do not circulate
> setwd("C:/")'#"default"output"directory' > win.metafile(file'='"toothbrush.wmf")'#"setup" diversion' > plot('...')'#"plo9ng"func4ons"in"here' > dev.off()'#"close"file"and"diversion' •
See ATHR page 57
FIT3152 Data analytics – Lecture 2
Slide 67
Viewing more dimensions
Draft: do not circulate
FIT3152 Data analytics – Lecture 2
Slide 68
All interactions: scatterplot matrix The default method for a scatterplot matrix is > pairs(iris)' 0.5
2.0
3.5
4.5
Sepal.Length
Sepal.Width
ulate
5
7
2.0
Draft:
3.5
6.5
2.0
2.0
1
3
Petal.Length
2.0
3.0
0.5
Petal.Width
1.0
Species
4.5
FIT3152 Data analytics – Lecture 2
6.5
1
3
5
7
1.0
2.0
3.0
Slide 69
Scatterplot matrix Adding colour > pairs(iris[1:5],'pch'='21,'bg'='c("red",'"green3",'"blue") [unclass(iris$Species)])' 0.5
2.0
3.5
4.5
Sepal.Len...