Title | RGuide - btuh |
---|---|
Author | Saharsha Bhandari |
Course | Data Science |
Institution | University of Sydney |
Pages | 26 |
File Size | 1.3 MB |
File Type | |
Total Downloads | 47 |
Total Views | 130 |
btuh...
03/09/2020
RGuide
1 Introduction to R/RStudio 2 Getting started with RStudio 3 Import data into R 4 Structure of data 5 Graphical Summaries 6 Numerical Summaries 7 Normal model 8 Linear model 9 Non-Linear Model 10 Simulate chance 11 Simulate chance variability (box model) 12 Sample surveys 13 Test for a proportion (using simulation) 14 Tests for a mean 15 Tests for relationships 16 More tests for relationships (Diagnostics) 17 FAQs 17.1 Why am I getting weird variable names?
RGuide
Code
Teach yourself R in DATA1001/DATA1901
Aim This is a self-study guide to R. It allows you to consolidate your learning from labs, by learning new R commands through the one simple data set mtcars , which is already stored in R. Sections marked ** are more for DATA1901, and for students who want to extend themselves.
1 Introduction to R/RStudio R is an incredibly powerful open source language for statistical analysis which we will run through RStudio. If you have never coded before, it may seem hard at first, but it will soon become easier!
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
1/26
03/09/2020
RGuide
1.1 Find RStudio (on campus) RStudio is available across campus in all university labs and on BYOdevice (https://byod.sydney.edu.au).
1.2 Install R and RStudio (on your own computer) If you can, we recommend that you install both R (https://cran.csiro.au/) and RStudio (https://www.rstudio.com/products/rstudio/download/) onto your own computer. R and RStudio are separate packages. First install R, and then install RStudio as the user interface. See this DataCamp blog (https://www.datacamp.com/community/tutorials/installing-R-windows-mac-ubuntu? utm_source=adwords_ppc&utm_campaignid=9942305733&utm_adgroupid=100189364546&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network= 517318242147:dsa-929501846124&utm_lo) for extra instructions on installation. 1.2.1 Install R
1.2.2 Install RStudio
1.2.3 Open RStudio
Go to R Project (http://cran.csiro.au/). Download the relevant version: Linux, Mac or Windows. When you are finished, you should see an icon on your desktop with a large capital `R’.
1.3 Using RStudio (if you don’t have a computer, or it’s not working) 1.3.0.1 Use the Ed Workspace in a browser It’s easy to run a RStudio workspace on Ed Discussion Board (https://edstem.org/dashboard). Datasets like the Australian Road Fataility data are already stored in Ed, in \course\data . The only downside is that you can’t import your own data into Ed (for projects).
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
2/26
03/09/2020
RGuide
1.3.0.2 Use an R complier You can run an R compiler on a browser, on a tablet such as ipad. Here is 1 example: click here (https://rextester.com/l/r_online_compiler) Again, you can’t import your own data (for projects).
1.3.0.3 Use RStudio Cloud You can run RStudio in the cloud for free. You can import your own data.
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
3/26
03/09/2020
RGuide
1.3.0.4 Use RStudio on ipad This is relatively new, so not recommended for most students. See details here. (https://levelup.gitconnected.com/using-rstudio-with-an-ipad-cb9f013bb3f)
2 Getting started with RStudio 2.1 Use RStudio as a calculator Go the console (usually LHS at the bottom, where you see the cursor > ).
Copy the following commands into the bottom Console. After each command, press the Enter/Return key on your keyboard. Code
Alternatively, you can copy the commands into the top Script Window. Highlight all the commands, and press Run function. Note that the output is in the bottom Console.
2.2 Experiment with data in R Many data sets (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) are already loaded into R. They are good to experiment with, and are used in the R/LQuizzes. For a list of all data available, type data() . Code
For example, we’re going to consider mtcars . To view the data, simple type its name. Code
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
4/26
03/09/2020
RGuide
Clearly this is not recommended for large data sets! Instead, look at the first or last rows using the head or tail functions. Code
You can find out about the data by using help() . Code
Or, read about the mtcars data set (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html).
2.3 Get organised: course folder It’s very important to set up a neat file management system for the semester. This is best practise, and will make your life easier! Create a course folder on your Desktop. eg DATA1001files .
2.4 Knit a given RMarkdown file RMarkdown is a clever document/file which can save all your Rcode and comments in one place for easy editing. You have been given the file Lab1.Rmd on your Labs page on Canvas. Download the file Lab1.Rmd , and store it in your new course folder DATA1001files . Double-click on the file and notice how it automatically opens in the top LHS window of RStudio. The information at the top is called the YAML, which you can edit it: eg in author: "xxx" , replace your name for xxx .) Render the file using Knit . This will create Lab1.html , which you can open in a browser. This becomes your final report. You can create new RMarkdown files by duplicating Lab1.Rmd and renaming it Project1.Rmd etc. To customise your RMarkdown file, see RMarkdown Cheat Sheet (https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)
2.5 Neat workflow There is a handy way to streamline your workflow, so that the output of the .Rmd opens next to your input in the RStudio console. This makes it very easy to edit and see your results. 1. Open preferences
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
5/26
03/09/2020
RGuide
2. Select R Markdown 3. Select ‘Viewer Pane’ 4. Click ‘Apply’
Now knit your LabDemo.Rmd' file and see what happens - neat!
This will remain the setup for next time you open RStudio.
2.6 Install packages What to do if a package is missing? Suppose you try and knit your .Rmd, only to face the error below: Error in library(somepackage) :
there is no package called ‘somepackage’
(Note, of course, somepackage is being used as a place holder here! Replace with the name of the actual package that is missing). No need to fret! All we need to do is install the package somepackage from CRAN, which is a repository for R packages.
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
6/26
03/09/2020
RGuide What is an R package? Good question. You can think of an R package like an app on the app store : it contains sets of functions that allow R to have extra functionality. For example, R was at the beginning, designed to be used very much like a calculator (very basic!). To be able to knit reports, we need new packages to enable R to extend its capabilities beyond the most basic packages downloaded with R.
CRAN is very much like the App store - but for R! So how do we access CRAN? Right through the command line! In the console in R, you simply need to type install.packages("somepackage")
And R automatically searches and downloads for that package on CRAN. You can then try knitting your report again! Sometimes, more than one package will be missing - if this is the case - just repeat the steps above until they are all installed. In summary: If want to use a certain package, use the following code. This is a 1 off step. It will now appear in the list of packages in the bottom LHS window. Code
Then every time, you start a new .Rmd document, or session in R, you need to call up the package by using library . Code
3 Import data into R
There are so many different ways to import data into R, so you have lots of options. It can be confusing at first. So just experiment with the different methods below, and find what suits you!
3.1 From the internet You can import data directly from the web. If you right click on the datasets in Canvas, you’ll get the url. For example, see Lab2 Road . Code
This is a super easy method, which we will use in the labs. However, it doesn’t work if your data doesn’t have an url (eg finding your own data for a Project).
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
7/26
03/09/2020
RGuide
3.2 From your folder Download a data set from the Canvas Lab page, for example AllFatalities.csv from Lab2. Put the datafile into the DATA1001files folder where your .Rmd . file is located.
Read the data into RStudio. Code
3.3 From a data subfolder Longer-term when you have more data files, it can be useful to create a sub folder called data within your DATA1001file folder.
Now you can store all your data inside data and then read your data straight into R. Code
Note: This method works well, unless your working directory is not pointed at your DATA1001files folder. ie Your computer needs to know where to get the data file from. See the next section for how to set your working directory to your DATA1001files folder, if it is currently pointing somewhere else.
3.4 Using file.choose() and working directories The working directory is where RStudio is pointing, ie where it is draws files from and where it save files to. It is generally best practise to store your data near your .Rmd file in the DATA1001files folder. In that case, if you open the .Rmd file directly from there, RStudio will set the working directory to that folder. However, if you are not sure what’s happening with your working directory … here’s one easy plan!. 1. Check where your current working directory is Code
2. Ask RStudio to browse the files on your computer. This will give you the file pathway (ie where the file is stored). Code
For example, for Mac users some possible paths might be: desktop: “/Users/johnsmith/Desktop” folder on desktop: “/Users/johnsmith/Desktop/DATA1001”
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
8/26
03/09/2020
RGuide
dropbox: “/Users/johnsmith/Dropbox” 3. Set the working directory to that path. Code
4. Read your data in Code
Note: Students often find working directories confusing at first! But once you have mastered it, it becomes straightforward.
3.5 Using Import dataset Another way to import data, is to use the “Import Datatset” tab in RStudio.
3.6 Note about Excel files Note: If your file is in Excel format PBS2015.xlsx , then you need the package readxl to first be installed. For example, consider this Excel data here (http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/data/PBS2015.xlsx). Code
3.7 More info This free Data Camp course (https://learn.datacamp.com/courses/introduction-to-importing-data-in-r) is good!
3.8 Data Wrangling ** Data preparation is an essential part of statistical analysis and can be very time-consuming. It can involve cleaning or tidying, cleansing scrubbing, reshaping, reforming, splitting and combining. It must be performed carefully and transparently. The aim is to change Messy (or Raw or Dirty) Data into Clean Data, which is correctly and consistently formatted and ready for analysis.This can involve removing redundant or useless information, standardising information (like calendar dates), separating or combining columns, and dealing with warnings. Simple data can be cleaned in Excel. Get rid of any extra formatting, so that the data looks like: ID (if applicable)
Variable1 Variable2 Variable3
1
14
25
34.4
2
15
23
19.7
More complex cleaning can be done through a package (http://tidyverse.org/) like tidyverse . Install the package, as a one off command. Code
Each new session of RStudio, you will need to load the package. Code
See cheat sheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) and article (https://cran.rproject.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf)
4 Structure of data We will use mtcars to illustrate the structure of data.
4.1 Classify variables Recall there are 2 main types of variables: qualitative and quantitative, which R calls Factor and num . View the data.
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
9/26
03/09/2020
RGuide Code
Calculate the dim ensions of the data set. Code
This means that there are 32 rows (the types of cars) and 11 variables (properties of the cars).
List the names of the variables. Code
See how R has classified the variables by viewing the str ucture of the data. Code
where ‘num’ is a quantitative (numerical) variable and ‘Factor’ is a qualitative (categorical) variable.
4.2 Isolate a variable Choose one variable from the data frame by using DataName$VariableName and store the result in a vector. Code
Note that RStudio has code completion, so will auto-predict your commands. When you type mtcars$ , the names of the all the variables will come up.
See the class ification of 1 variable. Code
See the length of 1 variable. Code
Calculate the sum of a (quantitative) variable. Code
If at any command you get the answer NA, it means that you need to specify what to do with missing values. See Resource (https://stat.ethz.ch/Rmanual/R-devel/library/base/html/sum.html) on how to solve this.
Sort the data in increasing order. Code
Work out (https://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html) how to sort the data in decreasing order. Code
Sum the 5 lowest values of the variable. Code
4.3 Select subset Pick the 1st and 5th elements of the vector mpg Code
4.4 Change classification You may not agree with R’s initial classification, and want to change it. Code
For example, note that the number of carburetors carb is classified as num . Reclassify carb as a factor . Code
To change from a factor to a num : Code
## Warning: NAs introduced by coercion
Note: (1) The warning message is not a problem - it is just alerting you to the introduction of NAs. (2) The mistake above if you just use as'numeric() .
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
10/26
03/09/2020
RGuide
5 Graphical Summaries The graphical summary must match up with the type of variable(s). Variable
Type of summary
1 Qualitative
(Single) Barplot
1 Quantitative
Histogram or (Single) Boxplot
2 Qualitative
Double Barplot, Mosaicplot
2 Quantitative
Scatterplot
1 Quantitative, 1 Qualitative Double Boxplot
5.1 Barplot A barplot is used for qualitative variables. Which variables are qualitative in mtcars ? Produce a single barplot of the gears. Code
Notice this is not useful! This is because R has classified gear as a quantitative variable. Instead, first summarise gears into a table as follows. Code
## ## 3 4 ## 15 12
5 5 Code
Now customise the barplot. Code
Make the names of bars perpendicular to axis. Code
Now consider 2 qualitative variables: gear and cyl . Produce a double barplot by faceting or filtering the barplot of gear by cyl . Code Code Code
What do you learn?
5.2 Histogram A histogram is used for quantitative variables. Which variables are quantitative in mtcars ? Produce a hist ogram of the weights. Code
Produce a probability hist ogram of the weights. What is the difference? Why do the 2 histograms have an identical shape here? Code
In this course, we will consider the probability histogram (2nd one) which means that the total area of the histogram is 1. What does the histogram tell us about weights of the cars? To see what customisations for hist are available, use help . Code
Try this hist ogram of the weights. Code
Experiment with the customisations to see how they work. Try a hist ogram of the gross horsepower. What do you learn? Produce a hist ogram of mpg. What do you learn?
5.3 Boxplot A boxplot is another summary for quantitative variables. Produce a single boxplot for the weights of cars.
www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
11/26
03/09/2020
RGuide Code
Produce a horizontal boxplot. Code
Which orientation do you prefer? Compare to the histogram of weights above: what different features are highlighted by a boxplot? Customise your boxplot. Code
Now consider dividing the weights (qualitative) by cylinders (qualitative). Produce a double boxplot, by filtering or faceting wt by cyl . Code Code
What do you learn about the weights of cars? Car Weight - see page 6 (http://web.mit.edu/sloan-autolab/research/beforeh2/files/MacKenzie%20Zoepf%20Heywood%20Car%20Weight%20Trends%20-%20IJVD.pdf) Try facetting the weights by another qualitative variable.
5.4 Mosaicplot A mosaic plot visualises 2 qualitative variables. Code
5.5 Scatterplot A scatter plot considers the relationship between 2 quantitative variables. What does this plot tell us? Code
Customise your scatterplot. You can change the plotting symbols (http://www.statmethods.net/advgraphs/parameters.html). Code
Add a line of best fit. Code
We’ll expore this further in Section 6. We can compare pairs of multiple quantitative variables using plot or pairs . Code Code Code
Which variables seem to be related?
5.6 ggplot www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html#5_graphical_summaries
12/26
03/09/2020
RGuide
All the previous (base R) plots can be done in ggplot , allowing much greater customisation. First, download the package (http://tidyverse.org/) ggplot into RStudio. This is a one off command. Code
Each time you open RStudio, load the ggplot2 package Code
5.6.1 Barplot Produce a single barplot. Code Code Code
Produce a double barplot. Code Code
5.6.2 Histogram Produce a hist ogram of the weights. Code
Using aes(y=..density..) turns a raw histogram into a probability histogram.
5.6.3 Boxplot Produce a single boxplot. Code
Produce a double boxplot. Code
geom_jitter plots the points with a small amount of random noise. We use it to investigate over-plotting in small data sets. Code
What do the following customisations do? Code Code Code
Now consider dividing the weights by another qualitative variable. Code Code
5.6.4 Scatterplo...