1 Introduction

Despite the countless blog posts, textbooks, and online courses focused on R that litter the internet, this document attempts to provide an overview of how R is typically used in paleolimnological analyses at PEARL.

The purpose of the guide is to aid both new and experienced users refine their many inevitable internet searches prefaced with “r help …”; as even with some familiarity with a given statistical technique, it may be unclear how to perform it using R.

The guide begins with some basic information for users new to R; however, the majority of content is focused on common analyses used in paleolimnological studies and the specific packages and functions that can facilitate them. The guide was created using R Markdown, specifically to allow easy updates and a gradual evolution over time. Therefore, please share any suggestions you may have to improve future versions.

2 Fundamental Concepts

2.1 What is R?

R is an open-source programming language and software environment for statistical computing and graphics. R is freely available for Linux, macOS, and Windows and the current version can be obtained directly from the project website: https://www.r-project.org/.

2.2 R vs. RStudio

As R is a command line program, many people choose to control it through a graphical interface. There are several different graphical interfaces or integrated development environments (IDE) to choose from, but the one most commonly used at PEARL is RStudio. RStudio is an open-source IDE for R, freely available for Linux, macOS, and Windows, and its current version can be obtained directly from its project website: https://www.rstudio.com/.

RStudio is not required for anything described in this guide, it simply provides multiple panes, menus, and buttons to make some aspects of R more accessible. However, for simplicity, the guide assumes RStudio is installed, and that if you choose to either use R on its own or with a different IDE, that you are also capable of translating the described operations to your particular setup.

2.3 Objects, Classes, and Functions

Objects are the fundamental elements of the R programming language. Within R, an object is simply a data structure with some attributes. Therefore, essentially everything that you work with during an R session will be an object of some kind.

Most objects have a class attribute that provides some definition regarding the type of data it contains. Classes include things like ‘character’, ‘numeric’, ‘logical’, ‘data.frame’ and ‘function’.

Functions are the principal tools used in R. Functions are objects that can take arguments as input, and return a new object as output. Some are quite simple (e.g. mean returns the mean value of a numeric object), whereas others are much more complex (e.g. those used to perform ordinations). Many different functions are available within a base installation of R; however, as you become familiar with R, it becomes possible to write functions of your own, that can be shared via scripts and packages.

2.4 Packages

All R functions and datasets are stored in packages, including everything contained within a base install of R. However, this base functionality can be extended through external packages that contain additional functions (along with relevant documentation). It may be helpful to view this relationship as packages are the ‘apps’ available for the R ‘operating system’. The main source for external packages is the Comprehensive R Archive Network (CRAN) repository (the ‘app store’ in the OS analogy).

It is also possible to install external packages from a local archive file (e.g. .zip or .tar.gz), and this is how the jezioro package (that contains several functions and datasets created at PEARL) is currently distributed. The latest version of jezioro is always available on the PEARL website.

R is able to access the CRAN repository directly, and external packages are typically installed either through the console, or using an IDE such as RStudio.

2.4.1 Installing a package using the console

To install the package rmarkdown using the console, you would enter the following command into the lower left pane of RStudio (the ‘Console’ pane):

install.packages("rmarkdown")

However, the functions contained within that package do not become available until it is loaded into memory with the library command.

library("rmarkdown")

2.4.2 Installing a package using RStudio

To install a package using RStudio:

  • click on the ‘Packages’ tab of the lower right pane of RStudio (the ‘Viewer’ pane)
  • click the ‘Install’ button
  • at the prompt, type the name of the package you wish to install (e.g. rmarkdown)
  • finally, click the ‘Install’ button and the package will be downloaded and installed

2.4.3 Useful packages in the CRAN repository

The jezioro package contains several functions and data sets written at PEARL and is distributed as an archive file; however many other packages useful to the analyses performed at PEARL are available from the CRAN repository. These include:

  • analogue - Gavin Simpson’s stat package needed for some WA techniques
  • dplyr - A set of tools for working with data frame like objects
  • ggplot2 - A system for ‘declaratively’ creating graphics, based on the book “The Grammar of Graphics”
  • magrittr - Provides a mechanism to chain commands with a forward-pipe operator, %>%
  • rioja - Functions for the analysis of Quaternary science data (the successor to C2)
  • rLakeAnalyzer - Physical limnology package
  • rmarkdown - A package for converting R Markdown documents into a variety of formats
  • roxygen2 - A package to simplify documenting your own packages
  • vegan - Ordination methods, diversity analysis and other functions for community and vegetation ecologists

2.5 Help Documentation

All objects within R should have documentation accessible with the help function. To bring up the documentation for a function, package, or dataset, enter ‘help(“object of interest”)’ into the console and the help file should appear in the ‘Viewer’ pane of RStudio (on the bottom right). For example, to bring up the documentation for the help function itself, enter:

help("help")

The ? is a shortcut for the help function, so entering the following will also bring up the documentation for the help function.

?help

The help documentation for a package (which should contain a list of all of its functions) is accessed in the same way. So to bring up the help file for the rmarkdown package installed earlier, use:

?rmarkdown

However, this will generate an error if the package is not loaded into memory. To search for help documentation on objects that are installed, but not currently loaded, use a double question mark ??.

??rmarkdown

3 Basic Usage

3.1 The Console

After installing both R and RStudio, and then starting RStudio the screen will be split into four panes as shown in the official RStudio Cheat Sheet.

By default these panes are:

  • Upper left, the ‘Source’ pane (where files and data are displayed)
  • Upper right, the ‘Environment’ pane (which displays objects and packages currently memory and a history of commands)
  • Lower right, the ‘Viewer’ pane (which contains several tabs including a file manager, one for plots, a package browser, and one for help files)
  • Lower left, the ‘Console’ pane (where commands are entered)

Typing directly into the console is the simplest way to use R. For example, entering

2+2

will return the output,

## [1] 4

Where the [1] indicates that the first element of the result (in this case, there is only one element) has a value of 4.

As you start performing more complex operations, you will find that it is easier to enter your input in the ‘Source’ pane, where it can saved to a file for later use, rather than retyping everything each time. These R script files typically have an .R extension.

Individual lines can be sent from the ‘Source’ pane to the console, by moving the cursor to the desired line and either hitting the ‘Run’ button, or using the keyboard shortcut ‘Ctrl + Enter’ (or ‘Ctrl + R’). The same can also be done for selections of text.

3.2 Assignment

3.2.1 Assigning a numeric value to an object

The next logical step after determining that 2+2 = 4, is storing that information for later use. This is done by ‘assigning’ the value ‘4’ to an object we will call x by using the assignment operator <- (you could also use =, but that has some restrictions on its use, whereas <- can be used anywhere).

x <- 4

You should now see x and its value displayed in the ‘Environment’ pane. Which means it is stored in memory and can be called back again using the console.

x
## [1] 4

Inspecting the class of x with the class function reveals that it is ‘numeric’.

class(x)
## [1] "numeric"

Which means that it can be used in arithmetic operations.

x + 2
## [1] 6

x currently contains only one element (i.e. it has a length of 1), but this can be changed by reassigning multiple values to x using the c or ‘combine’ function. Below we reassign x with two 8’s, changing it to a numeric vector with two elements.

x <- c(8, 8)
x
## [1] 8 8
class(x)
## [1] "numeric"
length(x)
## [1] 2

Note that x can still be used in arithmetic operations. Adding ‘5’ to x will add ‘5’ to each of element of x.

x + 5
## [1] 13 13

3.2.2 Numeric vs. Character

As mentioned in the introduction, ‘numeric’ is one of several available classes. Another class that you will frequently encounter is ‘character’. A ‘character’ element is typically a string of alphanumeric characters, such as words.

y <- "blue"
class(y)
## [1] "character"

We can combine ‘character’ elements into a vector with the c function.

y <- c(y, "red", y)
y
## [1] "blue" "red"  "blue"

Predictably, errors occur when you try to perform addition on characters.

y + 2 
## Error in y + 2: non-numeric argument to binary operator

However, what about if you have a vector with both data types?

z <- c(2, 4, 6, "blue")
class(z)
## [1] "character"
z + 2
## Error in z + 2: non-numeric argument to binary operator

So, it only takes one ‘character’ element, to force the assignment of the z object to the ’character` class.

3.3 Classes

3.3.1 Vectors

Within R, all objects have a class. The class defines what kind of information the object contains and how it may be used. The previous example demonstrated some characteristics of ‘numeric’ vs ‘character’ objects.

In general, data will most often be classified as either a vector, data frame or matrix.

As seen previously, a vector is a sequence of data elements.

x <- c(1,2,3)
y <- c(2,4,6)

Operations can be performed on the vector as a whole.

x+2
## [1] 3 4 5

Or on individual elements of a vector using [] to indicate which element.

y[3]*2
## [1] 12

Or using multiple vectors.

x+y
## [1] 3 6 9

3.3.2 Matrices

A matrix is a two dimensional array. So lets make one by combining x and y using the rbind function. Note that in addition to the data elements, table1 also has row names (‘x’ and ‘y’) and column names ([,1], [,2], and [,3]).

table1 <- rbind(x, y)
table1
##   [,1] [,2] [,3]
## x    1    2    3
## y    2    4    6
class(table1)
## [1] "matrix" "array"

Individual elements of a matrix can be also accessed using [] to indicate the indices of each dimension separated by a comma (row then column).

table1[2,3]
## y 
## 6
table1[1,3] + table1[2,1]
## x 
## 5
table1*2
##   [,1] [,2] [,3]
## x    2    4    6
## y    4    8   12

3.3.3 Matrices vs. Data Frames

A matrix can contain only one type of data (i.e. only ‘numeric’ or only ‘character’). In contrast, a data frame is a table of data that can (but doesn’t necessarily) contain mixed data types. They can be created with data.frame function.

table2 <- data.frame(Col1=c(1, 2), Col2=c("black", "white"))
table2
##   Col1  Col2
## 1    1 black
## 2    2 white
class(table2)
## [1] "data.frame"

There are many more classes within R, but initially you will likely spend most of your time working with matrices and data frames. It is important to stay mindful of the differences between these two classes.

A matrix can be changed into a data frame using the as.data.frame function.

x <- as.data.frame(table1)
class(x)
## [1] "data.frame"
x
##   V1 V2 V3
## x  1  2  3
## y  2  4  6

Similarly, a data frame can be changed into a matrix by using the as.matrix function. However, matrices must contain only one data type, and a number can be converted into a character, but not vice versa. Thus, when converting table2 to a matrix, the numeric values are converted to character values (denoted in quotation marks), and are no longer available for arithmetic operations.

x <- as.matrix(table2)
class(x)
## [1] "matrix" "array"
x
##      Col1 Col2   
## [1,] "1"  "black"
## [2,] "2"  "white"
x[1,1]*2
## Error in x[1, 1] * 2: non-numeric argument to binary operator

This becomes important when trying to combine matrices of different data types. For example, lets combine table1 a numeric matrix, with x (the copy of table2 with everything converted to ‘character’ data) using the cbind function.

x <- cbind(table1, x)
x
##               Col1 Col2   
## x "1" "2" "3" "1"  "black"
## y "2" "4" "6" "2"  "white"

By default, output inherits class from the input. So here, x must be a matrix, and matrices can contain only one data type, therefore all the numeric values in both table1 and x are converted to character values (as denoted by the quotation marks).

class(x)
## [1] "matrix" "array"
x[1,3] + x[2,1]
## Error in x[1, 3] + x[2, 1]: non-numeric argument to binary operator

In this case, it would probably make more sense for the output to be a data frame, containing both numeric and character columns. One way to do this, is by nesting the as.data.frame function within the cbind function. As we are now combining objects of different classes, R must choose a class for the output object. In this case it defaults to a data frame, as this can be done withoutchange any of the data with either object.

x <- cbind(table1, as.data.frame(table2))
class(x)
## [1] "data.frame"
x
##   1 2 3 Col1  Col2
## x 1 2 3    1 black
## y 2 4 6    2 white
x[1,3] + x[2,1]
## [1] 5

3.4 Functions

The past few examples should have provided some insight regarding the potential strength of functions. Functions allow very complex operations and can be combined in many ways.

Functions follow the general form: function(input, optional arguments).

For example, the function round, will round an input value ‘x’ to ‘n’ significant digits, when round(x,digits=n) is entered into the console. Replacing ‘x’ and ‘n’ with actual numbers gives something like:

round(3.141593, digits=5)
## [1] 3.14159
round(3.141593, digits=2) 
## [1] 3.14

The available documentation for a function can be viewed by entering a ? before its name into the console.

?round

A base install of R contains many functions and so far we have already used: class, c, length, rbind, cbind, as.data.frame, as.matrix, and round. It is also possible to define your own (that can incorporate other functions) with the syntax: function(input){operations}.

For example, we can create a function that will calculate what percentage each element of some input data contributes to the total sum of the input: function(x){x/sum(x)}. Assigning (<-) this function to an object called percentage then allows it to be applied repeatedly to different datasets.

percentage <- function(x){x/sum(x)}

Although not particularly useful on objects containing only a single element, we can use the c function to combine several numbers into a vector, then run percentage on the vector as a whole.

percentage(10)
## [1] 1
percentage(c(5,5,10))
## [1] 0.25 0.25 0.50

3.5 Importing and Exporting Files

Most of the data you will want to analyze with R, will be generated outside of R. Thus, it will need to be imported before you work with it. The simplest procedure to import data files is:

  • First convert your data file into .csv format (this can be done in excel or whatever spreadsheet program you are using)
  • The import will done by entering the read.csv function into the console along with several arguments
  • The file=file.choose() argument will open a dialog box that will allow you select the file you wish to import
  • The header=T argument will assign the top row of the imported file to the column names of the generated object
  • The stringsAsFactors=FALSE argument will ensure character strings stay as character strings
count.data <- read.csv(file=file.choose(), header=T, stringsAsFactors=FALSE)

Similarly, to export an object from within your R environment to a .csv file, use the write.csv function. If we wanted to export the object output.data to a .csv file, we would enter the following into the console to open up a dialog box to select the location of the new file.

write.csv(output.data, file=file.choose())

3.6 Graphics

In addition to statistical analyses, one of the main uses of R is the generation of high quality figures. However, creating figures with R is a complex topic, and currently beyond the scope of this guide.

To begin, basic plots can be generated with the plot function.

x <- c(1,2,3,4)
y <- c(2,4,6,8)

plot(x,y, type ='l')

?plot
## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /usr/lib/R/library
##   base                  /usr/lib64/R/library
## 
## 
## Using the first match ...

Complex plots are also possible with the plot function, but there are also alternate graphic systems available through external packages such as ggplot2.

3.7 Using Packages

A few different methods for installing external packages were given in the Introduction of this guide. However, before the functions within a package become available for use, the package needs to be loaded into memory.

For example, if you installed the rmarkdown function earlier, trying to bring up the help file for its render function without the package loaded should generate an error.

?render

Load an installed package (in this case rmarkdown) into memory with the library function, and try again.

library(rmarkdown)
?render

4 Data Analysis Tools

This section lists common data analysis techniques used at PEARL and the functions and/or packages that contain the tools necessary to perform them.

Over time I would like to add some detailed example code for each of these techniques.

4.1 Normality

To check your the normality of your data, you can use the Shapiro-Wilk Normality Test provided in a base installation of R by the shapiro.test function.

shapiro.test(input.data)

If you are dealing with a large number of variables, it may make sense to examine all of them at once using the apply function.

apply(input.data, 2, shapiro.test)

4.2 Transformations

Data can be square-root transformed using the sqrt function. Data can be log (base 10) transformed using log10. Note that by default the log function will calculate the natural logarithm (i.e. ln).

sqrt(input.data)
log10(input.data)
log(input.data)

4.3 Standardization

The decostand function within the vegan package provides several standardization methods (e.g. ‘chi square’ and ‘hellinger’)

?decostand

4.3.1 Z-Scores

Z-scores can be calculated with the scale function available in a base installation of R. Checking the details of the help file reveals that scale has two arguments, center and scale.

?scale
  • If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns
  • If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise.

Therefore, z-scores for input.data can be calculated with:

scale(input.data, center=TRUE, scale=TRUE)

4.4 Correlation

Correlations (Pearson, Kendall, and Spearman) can be examined using the cor function.

cor(input.data)

A nice correlation matrix can be produced using the corrplot package.

library(corrplot)
newdatacor = cor(my.dat[1:13])
corrplot(newdatacor, method = "number")

data(mtcars)
M <- cor(mtcars)
set.seed(0)

corrplot(M, method = "number", col = "black", cl.pos = "n")

4.5 Species Diversity (Hill’s N2)

Hill’s N2 values (Hill 1973)1 can be calculated for your data using the Hill.N2 function provided by the rioja package.

N2=Hill.N2(input.data, margin=2)

4.6 Rarefaction

Rarefaction is a method to correct for bias in species richness between samples of unequal size, by standardizing across samples to the number of species expected in a sample of the same total size as the smallest sample.

This can be done within R using the rarefy function provided by the vegan package.

?rarefy

4.7 Age-Depth Modelling

4.7.1 Gamma Dating

At PEARL, we currently use ScienTissiME to produce 210Pb chronologies. However, when reanalyzing cores dated prior to 2012-2013, they have been dated with the binford functions provided by the jezioro package.

4.7.2 Radiocarbon Dating

For time-scales too long for 210Pb analyses, an implementation of the Bacon approach to age-depth modelling (i.e. using Bayesian statistics to reconstruct accumulation histories for deposits, by combining radiocarbon and other dates with prior information; Blaauw and Christen, 20112) is provided by the package rbacon.

4.8 Ordination

Tools to perform ordinations are provided by the vegan package.

A detrended correspondance analysis (DCA) can be performed using the function decorana. For example, to run a DCA on a matrix called “speciesData” (arranged with species as columns and samples as rows).

library(vegan)
outputDCA <- decorana(speciesData)
summary(outputDCA)
plot(outputDCA)

A principal component analysis (PCA) can be performed using the function rda (Note that the “scale” argument can be used to scale values to unit variance).

outputPCA <- rda(speciesData, scale=TRUE)
summary(outputPCA, axes=4)

The biplot function can be use to generate a PCA biplot with species scores indicated by biplot arrows (Note that the scaling argument can be used to scale either site(1), species (2), or both by eigenvalues. See ?biplot.rda for more details.

biplot(outputPCA, scaling=2) 

A redundancy analysis (RDA) is performed by the rda function when two matrices are provided as input, one containing the species data and the second containing the environmental data.

outputRDA <- rda(speciesData ~  ., envData)
outputRDA
summary(outputRDA, scaling=0, axes=4)
plot(outputRDA, scaling=2)

To manually perform forward selection on an RDA model use add1.cca

testRDA <- rda(speciesData ~ 1, envData) # defines starting model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # examine F-ratios
testRDA <- update (testRDA, . ~ . + pH) # update model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # examine F-ratio
testRDA <- update (testRDA, . ~ . + Ca) # updates model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # continue until no more variables can improve model

anova (testRDA, by="terms", step=500) # tests terms in final model
testRDA
summary(testRDA, scaling=0, axes=4)
plot(testRDA, scaling=2) 

Finally, examine the variable inflation factors in both the full model and reduced models.

vif.cca(outputRDA)
vif.cca(testRDA)

4.9 ANOSIM and SIMPER

An ANOSIM can be performed using the anosim function provided by the vegan package. Similarity percentages can be determined with the simper function also provided by the vegan package.

indVAL?

4.10 Transfer Functions

Transfer functions can be built using the WA function provided by the rioja package.

For example, the midge/VWHO calibration set described in detail by Quinlan and Smol3 4 is provided as a dataset called vwhoQuinlan2010 by the jezioro package.

Therefore the Quinlan and Smol midge/VWHO model can be built using the following code:

library(jezioro)
library(rioja)
data(vwhoQuinlan2010)
vwho.model <- WA(vwhoQuinlan2010[,3:47], vwhoQuinlan2010[,2], tolDW=TRUE)

The Quinlan and Smol papers settled on the WA.inf.tol_XVal model as having the best performance metrics (RMSE=1.98 mg/L, R2=0.5998). To evaluate that the vwho.model object has the same performance as the model described by Quinlan and Smol, use the crossval function also provided by rioja.

library(rioja)
rioja::crossval(vwho.model, cv.method="loo", verbose=FALSE, ngroups=10, nboot=100, h.cutoff=0, h.dist=NULL)
## 
## Method : Weighted Averaging
## Call   : WA(y = vwhoQuinlan2010[, 3:47], x = vwhoQuinlan2010[, 2], tolDW = TRUE) 
## 
## Tolerance DW       : Yes 
## Monotonic deshrink : No 
## No. samples        : 54 
## No. species        : 45 
## Cross val.         : loo 
## 
## Deshrinking regression coefficients:
##           Inverse d/s Classical d/s
## wa.b0        -14.1270        4.1927
## wa.b1          3.7760        0.1758
## wa.tol.b0    -12.0037        3.9839
## wa.tol.b1      3.4135        0.2007
## 
## Performance:
##                    RMSE      R2  Avg.Bias  Max.Bias    Skill
## WA.inv           1.8168  0.6636    0.0000    2.7096  66.3643
## WA.cla           2.2302  0.6636    0.0000    2.5285  49.3166
## WA.inv.tol       1.7577  0.6852    0.0000    2.4689  68.5171
## WA.cla.tol       2.1235  0.6852    0.0000    2.1941  54.0510
## WA.inv_XVal      1.9896  0.5967   -0.0209    3.0363  59.6633
## WA.cla_XVal      2.3260  0.6085   -0.0389    2.9700  44.8688
## WA.inv.tol_XVal  1.9817  0.5998    0.0017    2.8652  59.9821
## WA.cla.tol_XVal  2.2577  0.6117   -0.0011    2.7239  48.0607

Confident that your model is the same as the one described in the manuscripts, it can then be applied down-core to midge sedimentary assemblage data to reconstruct VWHO concentrations.

4.11 Merging Data Sets

Data sets can be merged on common columns using the join function provided by the analogue package.

This can be useful when comparing downcore data sets with a calibration set, by creating a merged data set adding rows/columns (populated by zeroes) as necessary for those taxa absent from either of the original datasets.

library(analogue)
calSet <- data.frame(taxa1=c(0.25, 0.75, 0.25),taxa2=c(0.20, 0.05, 0.35), taxa3=c(0.55, 0.20, 0.50), row.names=c("cal1", "cal2", "cal3"))

testSet <- data.frame(taxa1=c(0.25, 0.25, 0.50),taxa4=c(0.75, 0.10, 0.15), row.names=c("test1", "test2", "test3"))

join(calSet, testSet, type="outer", split=FALSE)

The aggregate funciton provided in a base installation of R can also be useful when harmonizing disparate data sets.

4.12 Analog Matching

Identifying whether a fossil data set has any close modern analogs within a calibration set (i.e. evaluating a transfer function) can be done using the analog function provided by the analogue package.

The analogue package also contains a vignette with a detailed analog matching example that follows the approach described in Flower et al. 19975 and Simpson et al. 20056.

library(analogue)
vignette("analogue_methods")

4.13 Multivariate Regression Trees

Regression trees can be constructed using the rpart package.

4.14 Breakpoint Analysis

Davies Test

4.15 LOESS

Locally Estimated Scatterplot Smoothing (LOESS) can be performed using the loess function provided by the bast stats package.

4.16 GAMs

Generalized Additive Models (or GAMs) can be constructed using the mgcv package.

GAMs are increasingly applied in paleolimnological analyses because:

  • Time series from the sediment record are often irregularly spaced in time, complicating the use of classical statistical time series models.
  • Paleolimnological analyses of temporal trends often use linear regressions or (non-)parametric correlations despite potential violations of the underlying assumptions that likely occur due to temporal dependencies within the dataset.
  • LOESS estimated trends contain inherent subjectivity regarding the parameters chosen to achieve the LOESS fit (e.g. the span)
  • GAMS are statistical models that can be used to estimate trends as smooth functions of time; however, unlike LOESS, GAMs use automatic smoothness selection methods to objectively determine the complexity of the fitted trend.
  • As formal statistical models, GAMs allow for potentially complex, non-linear trends, provide an account of model uncertainty, and can identify periods of significant temporal change.

Simpson (2018)7 provides a detailed introduction to the use of GAMs and their construction using R (the supplementary information for the article contains the annotated R code used to perform the described analyses).

5 Advanced Usage

5.1 Data Manipulation Verbs

The dplyr package provides several functions that act intuitively as verbs for data manipulation.

These functions include: filter, select, mutate, arrange, and summarise.

5.2 Forward Pipe: %>%

The %>% command provided by the magrittr package functions similarly to the Unix pipe “|” command. It “pipes” the output from one function directly into the next (as its first argument), allowing intermediate steps to be bypassed, reducing the need for repetition in the code.

Note that due to the special characters in %>%, pulling up it’s help file requires:

?magrittr::`%>%`

A simple example:

x <- c(5,4,3,2,1)
x %>% mean
## [1] 3

A more complicated example using %>% in combination with some of the data frame manipulation verbs provided by dplyr:

Import the vwhoQuinlan2010 dataset from jezioro, and convert it to a data frame.

data(vwhoQuinlan2010)
vwhoQuinlan2010 <- as.data.frame(vwhoQuinlan2010)

Calculate the mean VWHO of those lakes (rows) where TANYSL <10 and MICROP >40.

filter(vwhoQuinlan2010, MICROP>30) %>% filter(TANYSL<10) %>% summarise(mean(VWHO))
##   mean(VWHO)
## 1        4.6

5.3 Plotting with ggplot2

ggplot2 is a data visualization package that implements concepts outlined in ‘The Grammar of Graphics’ a book ‘about grammatical rules for creating perceivable graphs’.

The syntax used to produce plots with ggplot2 is quite different from that of the base graphics system. For example:

library(ggplot2)

activities <- read.csv(file="data/ggplot2_activities.csv", header =T)

ggplot(activities, aes(x=Lake, y=Pb210, ymin=Pb210-Pb210err, ymax=Pb210+Pb210err)) +
  theme(legend.position=c(0.90, 0.85), axis.text.x=element_text(size=8)) +
  geom_point(aes(reorder(Lake, -Pb210))) +
  geom_errorbar(aes(reorder(Lake, -Pb210), colour=Interval)) +
  ylab("Total 210Pb Activity (Bq/kg)")

The ggplot2 package is able to construct very complex figures (click here for an extensive list of examples), and there are extensive resources available detailing its use. However, at PEARL we occasionally need to break some of its rules (e.g multiple y-axes), thus ggplot2 is not the best tool for producing stratigraphies that show both depth and age.

5.4 Composite Plots

5.5 Basic Maps

With the appropriate shapefiles, basic maps can be generated using ggplot2 and rgdal (Bindings for the ‘Geospatial’ Data Abstraction Library).

library(ggplot2)
library(rgdal)

For example, to add a few points to an outline of Ontario:

  1. Read in the shapefile, and convert it to a dataframe for use with ggplot2.
ontario<-readOGR("data/ontarioShapefile/Ont2.shp", layer="Ont2")
ontario_df <- fortify(ontario)
  1. Import (or in this example create) a dataframe containing the data for the plot.
sites <- data.frame(name=c("queens", "lake"), decLatitude=c(44.22, 45.18), decLongitude=c(-76.50, -78.83), stringsAsFactors = FALSE)
  1. Generate the plot with geom_polygon and add the points with geom_point. Note that by default this will use a Mercator projection centered on the Prime Meridian.
map <- ggplot() + 
  geom_polygon(data = ontario_df, aes(x = long, y = lat, group = group), fill="grey40") +
  geom_point(data=sites, aes(x=decLongitude, y=decLatitude))
print(map)

  1. The projection can be changed with coord_map. For example, to change the previous map to a Lambert projection o a Lambert projection with a true scale between latitudes of 40 and 70 degrees N:
#The projection can be changed using 'coord_map'
map <- map +
  coord_map("lambert", lat0=40, lat1=60)
print(map)