RGuide - btuh PDF

Title RGuide - btuh
Author Saharsha Bhandari
Course Data Science
Institution University of Sydney
Pages 26
File Size 1.3 MB
File Type PDF
Total Downloads 47
Total Views 130

Summary

btuh...


Description

03/09/2020

RGuide

1 Introduction to R/RStudio 2 Getting started with RStudio 3 Import data into R 4 Structure of data 5 Graphical Summaries 6 Numerical Summaries 7 Normal model 8 Linear model 9 Non-Linear Model 10 Simulate chance 11 Simulate chance variability (box model) 12 Sample surveys 13 Test for a proportion (using simulation) 14 Tests for a mean 15 Tests for relationships 16 More tests for relationships (Diagnostics) 17 FAQs 17.1 Why am I getting weird variable names?

RGuide

Code

Teach yourself R in DATA1001/DATA1901

 Aim  This is a self-study guide to R. It allows you to consolidate your learning from labs, by learning new R commands through the one simple data set mtcars , which is already stored in R. Sections marked ** are more for DATA1901, and for students who want to extend themselves.

1 Introduction to R/RStudio R is an incredibly powerful open source language for statistical analysis which we will run through RStudio. If you have never coded before, it may seem hard at first, but it will soon become easier!

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

1/26

03/09/2020

RGuide

1.1 Find RStudio (on campus) RStudio is available across campus in all university labs and on BYOdevice (https://byod.sydney.edu.au).

1.2 Install R and RStudio (on your own computer) If you can, we recommend that you install both R (https://cran.csiro.au/) and RStudio (https://www.rstudio.com/products/rstudio/download/) onto your own computer. R and RStudio are separate packages. First install R, and then install RStudio as the user interface. See this DataCamp blog (https://www.datacamp.com/community/tutorials/installing-R-windows-mac-ubuntu? utm_source=adwords_ppc&utm_campaignid=9942305733&utm_adgroupid=100189364546&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network= 517318242147:dsa-929501846124&utm_lo) for extra instructions on installation. 1.2.1 Install R

1.2.2 Install RStudio

1.2.3 Open RStudio

Go to R Project (http://cran.csiro.au/). Download the relevant version: Linux, Mac or Windows. When you are finished, you should see an icon on your desktop with a large capital `R’.

1.3 Using RStudio (if you don’t have a computer, or it’s not working) 1.3.0.1 Use the Ed Workspace in a browser It’s easy to run a RStudio workspace on Ed Discussion Board (https://edstem.org/dashboard). Datasets like the Australian Road Fataility data are already stored in Ed, in \course\data . The only downside is that you can’t import your own data into Ed (for projects).

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

2/26

03/09/2020

RGuide

1.3.0.2 Use an R complier You can run an R compiler on a browser, on a tablet such as ipad. Here is 1 example: click here (https://rextester.com/l/r_online_compiler) Again, you can’t import your own data (for projects).

1.3.0.3 Use RStudio Cloud You can run RStudio in the cloud for free. You can import your own data.

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

3/26

03/09/2020

RGuide

1.3.0.4 Use RStudio on ipad This is relatively new, so not recommended for most students. See details here. (https://levelup.gitconnected.com/using-rstudio-with-an-ipad-cb9f013bb3f)

2 Getting started with RStudio 2.1 Use RStudio as a calculator Go the console (usually LHS at the bottom, where you see the cursor > ).

Copy the following commands into the bottom Console. After each command, press the Enter/Return key on your keyboard. Code

Alternatively, you can copy the commands into the top Script Window. Highlight all the commands, and press Run function. Note that the output is in the bottom Console.

2.2 Experiment with data in R Many data sets (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) are already loaded into R. They are good to experiment with, and are used in the R/LQuizzes. For a list of all data available, type data() . Code

For example, we’re going to consider mtcars . To view the data, simple type its name. Code

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

4/26

03/09/2020

RGuide

Clearly this is not recommended for large data sets! Instead, look at the first or last rows using the head or tail functions. Code

You can find out about the data by using help() . Code

Or, read about the mtcars data set (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html).

2.3 Get organised: course folder It’s very important to set up a neat file management system for the semester. This is best practise, and will make your life easier! Create a course folder on your Desktop. eg DATA1001files .

2.4 Knit a given RMarkdown file RMarkdown is a clever document/file which can save all your Rcode and comments in one place for easy editing. You have been given the file Lab1.Rmd on your Labs page on Canvas. Download the file Lab1.Rmd , and store it in your new course folder DATA1001files . Double-click on the file and notice how it automatically opens in the top LHS window of RStudio. The information at the top is called the YAML, which you can edit it: eg in author: "xxx" , replace your name for xxx .) Render the file using Knit . This will create Lab1.html , which you can open in a browser. This becomes your final report. You can create new RMarkdown files by duplicating Lab1.Rmd and renaming it Project1.Rmd etc. To customise your RMarkdown file, see RMarkdown Cheat Sheet (https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)

2.5 Neat workflow There is a handy way to streamline your workflow, so that the output of the .Rmd opens next to your input in the RStudio console. This makes it very easy to edit and see your results. 1. Open preferences

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

5/26

03/09/2020

RGuide

2. Select R Markdown 3. Select ‘Viewer Pane’ 4. Click ‘Apply’

Now knit your LabDemo.Rmd' file and see what happens - neat!

This will remain the setup for next time you open RStudio.

2.6 Install packages What to do if a package is missing? Suppose you try and knit your .Rmd, only to face the error below: Error in library(somepackage) :

there is no package called ‘somepackage’

(Note, of course, somepackage is being used as a place holder here! Replace with the name of the actual package that is missing). No need to fret! All we need to do is install the package somepackage from CRAN, which is a repository for R packages.

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

6/26

03/09/2020

RGuide What is an R package? Good question. You can think of an R package like an app on the app store : it contains sets of functions that allow R to have extra functionality. For example, R was at the beginning, designed to be used very much like a calculator (very basic!). To be able to knit reports, we need new packages to enable R to extend its capabilities beyond the most basic packages downloaded with R.

CRAN is very much like the App store - but for R! So how do we access CRAN? Right through the command line! In the console in R, you simply need to type install.packages("somepackage")

And R automatically searches and downloads for that package on CRAN. You can then try knitting your report again! Sometimes, more than one package will be missing - if this is the case - just repeat the steps above until they are all installed. In summary: If want to use a certain package, use the following code. This is a 1 off step. It will now appear in the list of packages in the bottom LHS window. Code

Then every time, you start a new .Rmd document, or session in R, you need to call up the package by using library . Code

3 Import data into R

There are so many different ways to import data into R, so you have lots of options. It can be confusing at first. So just experiment with the different methods below, and find what suits you!

3.1 From the internet You can import data directly from the web. If you right click on the datasets in Canvas, you’ll get the url. For example, see Lab2 Road . Code

This is a super easy method, which we will use in the labs. However, it doesn’t work if your data doesn’t have an url (eg finding your own data for a Project).

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

7/26

03/09/2020

RGuide

3.2 From your folder Download a data set from the Canvas Lab page, for example AllFatalities.csv from Lab2. Put the datafile into the DATA1001files folder where your .Rmd . file is located.

Read the data into RStudio. Code

3.3 From a data subfolder Longer-term when you have more data files, it can be useful to create a sub folder called data within your DATA1001file folder.

Now you can store all your data inside data and then read your data straight into R. Code

Note: This method works well, unless your working directory is not pointed at your DATA1001files folder. ie Your computer needs to know where to get the data file from. See the next section for how to set your working directory to your DATA1001files folder, if it is currently pointing somewhere else.

3.4 Using file.choose() and working directories The working directory is where RStudio is pointing, ie where it is draws files from and where it save files to. It is generally best practise to store your data near your .Rmd file in the DATA1001files folder. In that case, if you open the .Rmd file directly from there, RStudio will set the working directory to that folder. However, if you are not sure what’s happening with your working directory … here’s one easy plan!. 1. Check where your current working directory is Code

2. Ask RStudio to browse the files on your computer. This will give you the file pathway (ie where the file is stored). Code

For example, for Mac users some possible paths might be: desktop: “/Users/johnsmith/Desktop” folder on desktop: “/Users/johnsmith/Desktop/DATA1001”

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

8/26

03/09/2020

RGuide

dropbox: “/Users/johnsmith/Dropbox” 3. Set the working directory to that path. Code

4. Read your data in Code

Note: Students often find working directories confusing at first! But once you have mastered it, it becomes straightforward.

3.5 Using Import dataset Another way to import data, is to use the “Import Datatset” tab in RStudio.

3.6 Note about Excel files Note: If your file is in Excel format PBS2015.xlsx , then you need the package readxl to first be installed. For example, consider this Excel data here (http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/data/PBS2015.xlsx). Code

3.7 More info This free Data Camp course (https://learn.datacamp.com/courses/introduction-to-importing-data-in-r) is good!

3.8 Data Wrangling ** Data preparation is an essential part of statistical analysis and can be very time-consuming. It can involve cleaning or tidying, cleansing scrubbing, reshaping, reforming, splitting and combining. It must be performed carefully and transparently. The aim is to change Messy (or Raw or Dirty) Data into Clean Data, which is correctly and consistently formatted and ready for analysis.This can involve removing redundant or useless information, standardising information (like calendar dates), separating or combining columns, and dealing with warnings. Simple data can be cleaned in Excel. Get rid of any extra formatting, so that the data looks like: ID (if applicable)

Variable1 Variable2 Variable3

1

14

25

34.4

2

15

23

19.7

More complex cleaning can be done through a package (http://tidyverse.org/) like tidyverse . Install the package, as a one off command. Code

Each new session of RStudio, you will need to load the package. Code

See cheat sheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) and article (https://cran.rproject.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf)

4 Structure of data We will use mtcars to illustrate the structure of data.

4.1 Classify variables Recall there are 2 main types of variables: qualitative and quantitative, which R calls Factor and num . View the data.

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

9/26

03/09/2020

RGuide Code

Calculate the dim ensions of the data set. Code

This means that there are 32 rows (the types of cars) and 11 variables (properties of the cars).

List the names of the variables. Code

See how R has classified the variables by viewing the str ucture of the data. Code

where ‘num’ is a quantitative (numerical) variable and ‘Factor’ is a qualitative (categorical) variable.

4.2 Isolate a variable Choose one variable from the data frame by using DataName$VariableName and store the result in a vector. Code

Note that RStudio has code completion, so will auto-predict your commands. When you type mtcars$ , the names of the all the variables will come up.

See the class ification of 1 variable. Code

See the length of 1 variable. Code

Calculate the sum of a (quantitative) variable. Code

If at any command you get the answer NA, it means that you need to specify what to do with missing values. See Resource (https://stat.ethz.ch/Rmanual/R-devel/library/base/html/sum.html) on how to solve this.

Sort the data in increasing order. Code

Work out (https://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html) how to sort the data in decreasing order. Code

Sum the 5 lowest values of the variable. Code

4.3 Select subset Pick the 1st and 5th elements of the vector mpg Code

4.4 Change classification You may not agree with R’s initial classification, and want to change it. Code

For example, note that the number of carburetors carb is classified as num . Reclassify carb as a factor . Code

To change from a factor to a num : Code

## Warning: NAs introduced by coercion

Note: (1) The warning message is not a problem - it is just alerting you to the introduction of NAs. (2) The mistake above if you just use as'numeric() .

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

10/26

03/09/2020

RGuide

5 Graphical Summaries The graphical summary must match up with the type of variable(s). Variable

Type of summary

1 Qualitative

(Single) Barplot

1 Quantitative

Histogram or (Single) Boxplot

2 Qualitative

Double Barplot, Mosaicplot

2 Quantitative

Scatterplot

1 Quantitative, 1 Qualitative Double Boxplot

5.1 Barplot A barplot is used for qualitative variables. Which variables are qualitative in mtcars ? Produce a single barplot of the gears. Code

Notice this is not useful! This is because R has classified gear as a quantitative variable. Instead, first summarise gears into a table as follows. Code

## ## 3 4 ## 15 12

5 5 Code

Now customise the barplot. Code

Make the names of bars perpendicular to axis. Code

Now consider 2 qualitative variables: gear and cyl . Produce a double barplot by faceting or filtering the barplot of gear by cyl . Code Code Code

What do you learn?

5.2 Histogram A histogram is used for quantitative variables. Which variables are quantitative in mtcars ? Produce a hist ogram of the weights. Code

Produce a probability hist ogram of the weights. What is the difference? Why do the 2 histograms have an identical shape here? Code

In this course, we will consider the probability histogram (2nd one) which means that the total area of the histogram is 1. What does the histogram tell us about weights of the cars? To see what customisations for hist are available, use help . Code

Try this hist ogram of the weights. Code

Experiment with the customisations to see how they work. Try a hist ogram of the gross horsepower. What do you learn? Produce a hist ogram of mpg. What do you learn?

5.3 Boxplot A boxplot is another summary for quantitative variables. Produce a single boxplot for the weights of cars.

www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

11/26

03/09/2020

RGuide Code

Produce a horizontal boxplot. Code

Which orientation do you prefer? Compare to the histogram of weights above: what different features are highlighted by a boxplot? Customise your boxplot. Code

Now consider dividing the weights (qualitative) by cylinders (qualitative). Produce a double boxplot, by filtering or faceting wt by cyl . Code Code

What do you learn about the weights of cars? Car Weight - see page 6 (http://web.mit.edu/sloan-autolab/research/beforeh2/files/MacKenzie%20Zoepf%20Heywood%20Car%20Weight%20Trends%20-%20IJVD.pdf) Try facetting the weights by another qualitative variable.

5.4 Mosaicplot A mosaic plot visualises 2 qualitative variables. Code

5.5 Scatterplot A scatter plot considers the relationship between 2 quantitative variables. What does this plot tell us? Code

Customise your scatterplot. You can change the plotting symbols (http://www.statmethods.net/advgraphs/parameters.html). Code

Add a line of best fit. Code

We’ll expore this further in Section 6. We can compare pairs of multiple quantitative variables using plot or pairs . Code Code Code

Which variables seem to be related?

5.6 ggplot www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries

12/26

03/09/2020

RGuide

All the previous (base R) plots can be done in ggplot , allowing much greater customisation. First, download the package (http://tidyverse.org/) ggplot into RStudio. This is a one off command. Code

Each time you open RStudio, load the ggplot2 package Code

5.6.1 Barplot Produce a single barplot. Code Code Code

Produce a double barplot. Code Code

5.6.2 Histogram Produce a hist ogram of the weights. Code

Using aes(y=..density..) turns a raw histogram into a probability histogram.

5.6.3 Boxplot Produce a single boxplot. Code

Produce a double boxplot. Code

geom_jitter plots the points with a small amount of random noise. We use it to investigate over-plotting in small data sets. Code

What do the following customisations do? Code Code Code

Now consider dividing the weights by another qualitative variable. Code Code

5.6.4 Scatterplo...


Similar Free PDFs
RGuide - btuh
  • 26 Pages