This is a basic intro to R, and an example of the most common workflow I employ when using R.
R is a free, open-source programming language. Though a programming language in the strictest sense, what sets it apart from other general purpose languages (e.g. Python, C++) is the variety of statistical methods and data-manipulation operations it affords the user, and the ease with which these can be used as encapsulated in simple functions. This means it may be more appropriate to think of R as a language-based environment for statistical analyses.
R is available for download from the “The Comprehensive R Archive Network” (CRAN): https://cran.r-project.org/. Simply download the version for your platform and install.
This R download comes by default with its own user interface. You can use it if you wish, but I recommend RStudio instead as a means to interact with R.
RStudio is a more user-friendly interface. It consists of four “panels” in its interface:
The general idea is that you execute R-code either by typing directly into the console, or “running” code from the source using the “Run” button. Pressing “Run” while your cursor is on line 1 of the source editor sends line 1 to the console for execution.
These are all kind of intuitive
1 + 1
## [1] 2
4 - 1
## [1] 3
12 * 3
## [1] 36
12 / 3
## [1] 4
4 ^ 2
## [1] 16
We can also compute logarithms, roots, etc.
# Natural log
log(4)
## [1] 1.386294
# Log base 10 or base 2
log10(4)
## [1] 0.60206
log2(4)
## [1] 2
# Log with an arbitrary base
logb(4, base = 3)
## [1] 1.26186
# Square-root
sqrt(4)
## [1] 2
# Or
4 ^ (1/2)
## [1] 2
8 ^ (1/3)
## [1] 2
Comments (i.e. unexecuted text) in R are preceded by the # symbol, by the way.
In R, everything that “exists” in a session is an object. We assign values (or components of other objects) into variables (which are objects!) using a little arrow (<-
), not an equal sign.
x <- 2
When we execute an object-call alone…
x
## [1] 2
… it displays the contents of that object.
Now, you may notice the little [1]
to the left of the output. This is because almost all objects in R are “vectors”. Even if an object only has one component, it is a vector of length 1. When we have multiple items (grouped together by concatenation using c()
)…
y <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
y
## [1] 2 4 6 8 10 12 14 16 18 20
…it displays them all. Should the output run onto multiple lines, the next line will have the index (e.g. [63]
) of the first item on the new line, like so
y
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
We use what is essentially this notation if we wish to access a specific position’s value…
y[14]
## [1] 27
…or set of positions’ values…
y[c(1, 18, 23)]
## [1] 1 35 45
…or range of positions’ values.
y[10:20]
## [1] 19 21 23 25 27 29 31 33 35 37 39
Vectors in R are composed of items of a single type (e.g. numerics, characters, logicals, etc.). Should you want to mix types, you must construct a “list”, or else R will “coerce” all the items in the vector to the same type (usually character).
my_good_vector <- 1:10
typeof(my_good_vector)
## [1] "integer"
my_bad_vector <- c(1, 2, 3, 4, "Carex sp.", TRUE, FALSE)
typeof(my_bad_vector)
## [1] "character"
my_list <- list(1, "a", TRUE, "Carex sp.")
typeof(my_list)
## [1] "list"
Data-frames are basically tables, and each column can have a different data type.
Values in any column can even be NA
, a special placeholder
data
## x y something
## 1 1 1 a
## 2 2 3 b
## 3 3 5 a
## 4 4 7 b
## 5 5 9 a
## 6 6 11 b
## 7 7 13 a
## 8 8 15 b
## 9 9 17 a
## 10 10 19 b
We extract (i.e. get values from) a data-frame with the following syntax: data[row_index, column_index]
# The 1st row of the 1st column (i.e. 1 cell)
data[1, 1]
## [1] 1
# The 6th row of the 2nd column
data[6, 2]
## [1] 11
# The 1st and 5th rows of the 1st column
data[c(1, 5), 1]
## [1] 1 5
# The 3rd row of every column
data[3, ]
## x y something
## 3 3 5 a
# Every row of the 1st column
data[, 1]
## [1] 1 2 3 4 5 6 7 8 9 10
# The 3rd to 8th rows of columns 1 and 2
data[3:8, c(1, 2)]
## x y
## 3 3 5
## 4 4 7
## 5 5 9
## 6 6 11
## 7 7 13
## 8 8 15
We can also call on columns by names using the “extract dollar sign” $
, but then we only specify the rows numerically
data$x[1]
## [1] 1
data$y[2:7]
## [1] 3 5 7 9 11 13
As I said above, everything that exists in R is an object. This includes functions. Also: basically everything that “happens” in R is a function.
We’ve been using them a lot above (e.g. typeof()
to get an objects type, log()
for logarithms, etc.).
Some other useful functions I encourage you to explore:
length()
seq()
mean()
, max()
, median()
, etc.data.frame()
na.exclude()
So, let’s take an example. Let’s say you have some data (hopefully in a CSV and not an Excel spreadsheet), and want to make some scatterplots and fit a few linear regressions. Let’s say you also want to compare model-fits between some regression models, or plot your model’s predicted curves. You’d probably also want save your figures.
Let’s do that! Before we get started, I recommend doing all these things in one folder (from hereon called my-analysis/
). That means your data is in there, and this is where we will store our script, and where we will save the outputted figures.
Here we tell R which folder to regard as the main folder for this session, relative to which we will refer to all files.
First, ask R where it is currently looking (i.e. by “getting” the working directory)
getwd()
## [1] "/Users/ruanvanmazijk/R-tut-for-Muasya-lab"
This is where I am at the moment (my default), but I want to be in my-analysis/
. So, I need to “set” the working directory, relative to the current one
setwd("my-analysis")
getwd()
## [1] "/Users/ruanvanmazijk/R-tut-for-Muasya-lab/my-analysis"
If I wanted to move to a directory completely outside of the default, you can do say setwd("/Users/ruanvanmazijk/MSc-genome-ecophys/analyses")
or something, with a complete path.
Here, we read the CSV into the R-session, and assign it into an object so that we can access it in memory later in our session
iris <- read.csv("iris.csv")
This data-set comes preloaded in R, but I thought it useful to use it to show how we would import it were it in CSV format. It contains morphological data for various Iris spp. Let’s look at our data now.
The iris
data-frame is quite big, and we don’t want to overwhelm our console with hundreds of lines of numbers. Instead, let’s just look at the first 6 rows (head()
). We might also want to know what “types” of data R has interpreted to be in each column of our data-frame (summary()
)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Note that the summary()
output displays a 5-number-summary for numeric data, and counts of each species as species has been interpreted as a category (a “factor” in R-lingo).
I am interested in the allometry between petal length and width, for example. To plot these variables against each other, do as follows
# Syntax: plot(formula = y ~ x, data)
plot(formula = Petal.Length ~ Petal.Width, data = iris)
You can also use the slightly different syntax plot(formula = iris$Petal.Length ~ iris$Petal.Width)
, which lacks a data argument. You can also specify x
and y
directly (though note that now x
is first)
# Syntax: plot(x, y, data)
plot(formula = Petal.Length ~ Petal.Width, data = iris)
I want to test the relationship we plotted above. The syntax is very similar!
my_model <- lm(formula = iris$Sepal.Length ~ iris$Petal.Length)
# Or
my_model <- lm(formula = Sepal.Length ~ Petal.Length, data = iris)
The raw model object is kind of uninformative and hard to read
my_model
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
##
## Coefficients:
## (Intercept) Petal.Length
## 4.3066 0.4089
So, we employ one of the most useful functions in R: summary()
. This function can be called on multiple object types (like we used for data-frames earlier). But on model objects it is your bread-and-butter
summary(my_model)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24675 -0.29657 -0.01515 0.27676 1.00269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
## Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
Here I employ another one of R’s strengths: community packages. The R community has made over 6000 packages, free for anyone to use. They are extra-add ons to R (I explain more below). The function visreg()
(from the package also called visreg) takes a model object as an input and makes very nice plots of the data and fit for that model. Observe:
# Install the package
install.packages("visreg", dependencies = TRUE)
# Load it into R's brain
library(visreg)
# Use the function "visreg" from the package "visreg"
visreg::visreg(my_model)
Note, the package::function
syntax is optional. I could simply call on visreg()
directly, as the package visreg has been loaded. But sometimes you don’t want to load a whole package into memory (for e.g. on a very computationally intensive project) so we use ::
to steal one function only from a package, use it, and put it back.
The simplest and easiest is to use the “Export” button in the plots pane in RStudio.
R packages represent bundles of functions and methods for specific computations. This is similar to the idea of modules from other languages.
R has several “base” packages, each responsible for different aspects of day-to-day work in R. For example, “base” (the name of one of the base packages itself) handles basic programming things and arithmetic, “graphics” is what R uses to make plots, and “stats” is where statistical methods are deployed from.
The reason I emphasise this is so that the idea of adding community-made, open-source packages is that someone has likely encountered the problem your having in your work, and written functions and code that solves it neatly and reliably. So take advantage of that and Google around for packages that suit your needs.
Some I use very often include
ggplot2
for more advanced figurestidyr
and dplyr
for data-frame manipulation and reshapingraster
and sp
for adding GIS functionality to Rstringr
for advanced text manipulation and character matching (very good REGEX functionality)Some more esoteric/special-use packages include
dismo
for making species distribution modelstaxize
for querying taxanomic databases with your species names, and for dealing with taxanomic hierarchies as objects in Rpacman
for automated R package installationrmarkdown
for document preparation that relies on R input (how I made this tutorial!)