This is a basic intro to R, and an example of the most common workflow I employ when using R.

1 What is R?

R is a free, open-source programming language. Though a programming language in the strictest sense, what sets it apart from other general purpose languages (e.g. Python, C++) is the variety of statistical methods and data-manipulation operations it affords the user, and the ease with which these can be used as encapsulated in simple functions. This means it may be more appropriate to think of R as a language-based environment for statistical analyses.

2 Setting up R for yourself

R is available for download from the “The Comprehensive R Archive Network” (CRAN): https://cran.r-project.org/. Simply download the version for your platform and install.

This R download comes by default with its own user interface. You can use it if you wish, but I recommend RStudio instead as a means to interact with R.

RStudio is a more user-friendly interface. It consists of four “panels” in its interface:

  1. Top-left: the “Source”, or script-editor, where we will write and modify our saved R-scripts
  2. Bottom-left: the “Console”, where R code is run directly—i.e. sent to R’s “brain”, and where text output from our code is displayed
  3. Top-right: the “Environment”, where we can see what objects and values your current R-session has in memory, and “History”, where we see a record of recently executed R-code
  4. Bottom-right: the “Files”, “Plots”, “Packages”, and “Help” tabs, discussed later

The general idea is that you execute R-code either by typing directly into the console, or “running” code from the source using the “Run” button. Pressing “Run” while your cursor is on line 1 of the source editor sends line 1 to the console for execution.

3 R language basics

3.1 Mathematical operations

These are all kind of intuitive

1 + 1
## [1] 2
4 - 1
## [1] 3
12 * 3
## [1] 36
12 / 3
## [1] 4
4 ^ 2
## [1] 16

We can also compute logarithms, roots, etc.

# Natural log
log(4)
## [1] 1.386294
# Log base 10 or base 2
log10(4)
## [1] 0.60206
log2(4)
## [1] 2
# Log with an arbitrary base
logb(4, base = 3)
## [1] 1.26186
# Square-root
sqrt(4)
## [1] 2
# Or
4 ^ (1/2)
## [1] 2
8 ^ (1/3)
## [1] 2

Comments (i.e. unexecuted text) in R are preceded by the # symbol, by the way.

3.2 Objects

3.2.1 Vectors & lists

In R, everything that “exists” in a session is an object. We assign values (or components of other objects) into variables (which are objects!) using a little arrow (<-), not an equal sign.

x <- 2

When we execute an object-call alone…

x
## [1] 2

… it displays the contents of that object.

Now, you may notice the little [1] to the left of the output. This is because almost all objects in R are “vectors”. Even if an object only has one component, it is a vector of length 1. When we have multiple items (grouped together by concatenation using c())…

y <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
y
##  [1]  2  4  6  8 10 12 14 16 18 20

…it displays them all. Should the output run onto multiple lines, the next line will have the index (e.g. [63]) of the first item on the new line, like so

y
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99

We use what is essentially this notation if we wish to access a specific position’s value…

y[14]
## [1] 27

…or set of positions’ values…

y[c(1, 18, 23)]
## [1]  1 35 45

…or range of positions’ values.

y[10:20]
##  [1] 19 21 23 25 27 29 31 33 35 37 39

Vectors in R are composed of items of a single type (e.g. numerics, characters, logicals, etc.). Should you want to mix types, you must construct a “list”, or else R will “coerce” all the items in the vector to the same type (usually character).

my_good_vector <- 1:10
typeof(my_good_vector)
## [1] "integer"
my_bad_vector <- c(1, 2, 3, 4, "Carex sp.", TRUE, FALSE)
typeof(my_bad_vector)
## [1] "character"
my_list <- list(1, "a", TRUE, "Carex sp.")
typeof(my_list)
## [1] "list"

3.2.2 Data-frames

Data-frames are basically tables, and each column can have a different data type.

Values in any column can even be NA, a special placeholder

data
##     x  y something
## 1   1  1         a
## 2   2  3         b
## 3   3  5         a
## 4   4  7         b
## 5   5  9         a
## 6   6 11         b
## 7   7 13         a
## 8   8 15         b
## 9   9 17         a
## 10 10 19         b

We extract (i.e. get values from) a data-frame with the following syntax: data[row_index, column_index]

# The 1st row of the 1st column (i.e. 1 cell)
data[1, 1]
## [1] 1
# The 6th row of the 2nd column
data[6, 2]
## [1] 11
# The 1st and 5th rows of the 1st column
data[c(1, 5), 1]
## [1] 1 5
# The 3rd row of every column
data[3, ]
##   x y something
## 3 3 5         a
# Every row of the 1st column
data[, 1]
##  [1]  1  2  3  4  5  6  7  8  9 10
# The 3rd to 8th rows of columns 1 and 2
data[3:8, c(1, 2)]
##   x  y
## 3 3  5
## 4 4  7
## 5 5  9
## 6 6 11
## 7 7 13
## 8 8 15

We can also call on columns by names using the “extract dollar sign” $, but then we only specify the rows numerically

data$x[1]
## [1] 1
data$y[2:7]
## [1]  3  5  7  9 11 13

3.3 Functions

As I said above, everything that exists in R is an object. This includes functions. Also: basically everything that “happens” in R is a function.

We’ve been using them a lot above (e.g. typeof() to get an objects type, log() for logarithms, etc.).

Some other useful functions I encourage you to explore:

  • length()
  • seq()
  • mean(), max(), median(), etc.
  • data.frame()
  • na.exclude()

4 Example R workflow

So, let’s take an example. Let’s say you have some data (hopefully in a CSV and not an Excel spreadsheet), and want to make some scatterplots and fit a few linear regressions. Let’s say you also want to compare model-fits between some regression models, or plot your model’s predicted curves. You’d probably also want save your figures.

Let’s do that! Before we get started, I recommend doing all these things in one folder (from hereon called my-analysis/). That means your data is in there, and this is where we will store our script, and where we will save the outputted figures.

4.1 Setting the working directory: telling R where we are working

Here we tell R which folder to regard as the main folder for this session, relative to which we will refer to all files.

First, ask R where it is currently looking (i.e. by “getting” the working directory)

getwd()
## [1] "/Users/ruanvanmazijk/R-tut-for-Muasya-lab"

This is where I am at the moment (my default), but I want to be in my-analysis/. So, I need to “set” the working directory, relative to the current one

setwd("my-analysis")
getwd()
## [1] "/Users/ruanvanmazijk/R-tut-for-Muasya-lab/my-analysis"

If I wanted to move to a directory completely outside of the default, you can do say setwd("/Users/ruanvanmazijk/MSc-genome-ecophys/analyses") or something, with a complete path.

4.2 Importing your data

Here, we read the CSV into the R-session, and assign it into an object so that we can access it in memory later in our session

iris <- read.csv("iris.csv")

This data-set comes preloaded in R, but I thought it useful to use it to show how we would import it were it in CSV format. It contains morphological data for various Iris spp. Let’s look at our data now.

4.3 Exploring your data

The iris data-frame is quite big, and we don’t want to overwhelm our console with hundreds of lines of numbers. Instead, let’s just look at the first 6 rows (head()). We might also want to know what “types” of data R has interpreted to be in each column of our data-frame (summary())

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Note that the summary() output displays a 5-number-summary for numeric data, and counts of each species as species has been interpreted as a category (a “factor” in R-lingo).

4.4 Exploratory scatter plots

I am interested in the allometry between petal length and width, for example. To plot these variables against each other, do as follows

# Syntax: plot(formula = y ~ x, data)
plot(formula = Petal.Length ~ Petal.Width, data = iris)

You can also use the slightly different syntax plot(formula = iris$Petal.Length ~ iris$Petal.Width), which lacks a data argument. You can also specify x and y directly (though note that now x is first)

# Syntax: plot(x, y, data)
plot(formula = Petal.Length ~ Petal.Width, data = iris)

4.5 Models!

4.5.1 Fitting models

I want to test the relationship we plotted above. The syntax is very similar!

my_model <- lm(formula = iris$Sepal.Length ~ iris$Petal.Length)
# Or
my_model <- lm(formula = Sepal.Length ~ Petal.Length, data = iris)

4.5.2 Model inspection and interpretation

The raw model object is kind of uninformative and hard to read

my_model
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##       4.3066        0.4089

So, we employ one of the most useful functions in R: summary(). This function can be called on multiple object types (like we used for data-frames earlier). But on model objects it is your bread-and-butter

summary(my_model)
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24675 -0.29657 -0.01515  0.27676  1.00269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.30660    0.07839   54.94   <2e-16 ***
## Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

4.6 Plotting model fits

Here I employ another one of R’s strengths: community packages. The R community has made over 6000 packages, free for anyone to use. They are extra-add ons to R (I explain more below). The function visreg() (from the package also called visreg) takes a model object as an input and makes very nice plots of the data and fit for that model. Observe:

# Install the package
install.packages("visreg", dependencies = TRUE)
# Load it into R's brain
library(visreg)
# Use the function "visreg" from the package "visreg"
visreg::visreg(my_model)

Note, the package::function syntax is optional. I could simply call on visreg() directly, as the package visreg has been loaded. But sometimes you don’t want to load a whole package into memory (for e.g. on a very computationally intensive project) so we use :: to steal one function only from a package, use it, and put it back.

4.7 Saving figures and outputs

The simplest and easiest is to use the “Export” button in the plots pane in RStudio.

5 R packages

R packages represent bundles of functions and methods for specific computations. This is similar to the idea of modules from other languages.

R has several “base” packages, each responsible for different aspects of day-to-day work in R. For example, “base” (the name of one of the base packages itself) handles basic programming things and arithmetic, “graphics” is what R uses to make plots, and “stats” is where statistical methods are deployed from.

The reason I emphasise this is so that the idea of adding community-made, open-source packages is that someone has likely encountered the problem your having in your work, and written functions and code that solves it neatly and reliably. So take advantage of that and Google around for packages that suit your needs.

Some I use very often include

  • ggplot2 for more advanced figures
  • tidyr and dplyr for data-frame manipulation and reshaping
  • raster and sp for adding GIS functionality to R
  • stringr for advanced text manipulation and character matching (very good REGEX functionality)

Some more esoteric/special-use packages include

  • dismo for making species distribution models
  • taxize for querying taxanomic databases with your species names, and for dealing with taxanomic hierarchies as objects in R
  • pacman for automated R package installation
  • rmarkdown for document preparation that relies on R input (how I made this tutorial!)