Despite the countless blog posts, textbooks, and online courses focused on R that litter the internet, this document attempts to provide an overview of how R is typically used in paleolimnological analyses at PEARL.
The purpose of the guide is to aid both new and experienced users refine their many inevitable internet searches prefaced with “r help …”; as even with some familiarity with a given statistical technique, it may be unclear how to perform it using R.
The guide begins with some basic information for users new to R; however, the majority of content is focused on common analyses used in paleolimnological studies and the specific packages and functions that can facilitate them. The guide was created using R Markdown, specifically to allow easy updates and a gradual evolution over time. Therefore, please share any suggestions you may have to improve future versions.
R is an open-source programming language and software environment for statistical computing and graphics. R is freely available for Linux, macOS, and Windows and the current version can be obtained directly from the project website: https://www.r-project.org/.
As R is a command line program, many people choose to control it through a graphical interface. There are several different graphical interfaces or integrated development environments (IDE) to choose from, but the one most commonly used at PEARL is RStudio. RStudio is an open-source IDE for R, freely available for Linux, macOS, and Windows, and its current version can be obtained directly from its project website: https://www.rstudio.com/.
RStudio is not required for anything described in this guide, it simply provides multiple panes, menus, and buttons to make some aspects of R more accessible. However, for simplicity, the guide assumes RStudio is installed, and that if you choose to either use R on its own or with a different IDE, that you are also capable of translating the described operations to your particular setup.
Objects are the fundamental elements of the R programming language. Within R, an object is simply a data structure with some attributes. Therefore, essentially everything that you work with during an R session will be an object of some kind.
Most objects have a class attribute that provides some definition regarding the type of data it contains. Classes include things like ‘character’, ‘numeric’, ‘logical’, ‘data.frame’ and ‘function’.
Functions are the principal tools used in R. Functions are objects that can take arguments as input, and return a new object as output. Some are quite simple (e.g. mean
returns the mean value of a numeric object), whereas others are much more complex (e.g. those used to perform ordinations). Many different functions are available within a base installation of R; however, as you become familiar with R, it becomes possible to write functions of your own, that can be shared via scripts and packages.
All R functions and datasets are stored in packages, including everything contained within a base install of R. However, this base functionality can be extended through external packages that contain additional functions (along with relevant documentation). It may be helpful to view this relationship as packages are the ‘apps’ available for the R ‘operating system’. The main source for external packages is the Comprehensive R Archive Network (CRAN) repository (the ‘app store’ in the OS analogy).
It is also possible to install external packages from a local archive file (e.g. .zip or .tar.gz), and this is how the jezioro
package (that contains several functions and datasets created at PEARL) is currently distributed. The latest version of jezioro
is always available on the PEARL website.
R is able to access the CRAN repository directly, and external packages are typically installed either through the console, or using an IDE such as RStudio.
To install the package rmarkdown
using the console, you would enter the following command into the lower left pane of RStudio (the ‘Console’ pane):
install.packages("rmarkdown")
However, the functions contained within that package do not become available until it is loaded into memory with the library
command.
library("rmarkdown")
To install a package using RStudio:
rmarkdown
)The jezioro
package contains several functions and data sets written at PEARL and is distributed as an archive file; however many other packages useful to the analyses performed at PEARL are available from the CRAN repository. These include:
analogue
- Gavin Simpson’s stat package needed for some WA techniquesdplyr
- A set of tools for working with data frame like objectsggplot2
- A system for ‘declaratively’ creating graphics, based on the book “The Grammar of Graphics”magrittr
- Provides a mechanism to chain commands with a forward-pipe operator, %>%
rioja
- Functions for the analysis of Quaternary science data (the successor to C2)rLakeAnalyzer
- Physical limnology packagermarkdown
- A package for converting R Markdown documents into a variety of formatsroxygen2
- A package to simplify documenting your own packagesvegan
- Ordination methods, diversity analysis and other functions for community and vegetation ecologistsAll objects within R should have documentation accessible with the help
function. To bring up the documentation for a function, package, or dataset, enter ‘help(“object of interest”)’ into the console and the help file should appear in the ‘Viewer’ pane of RStudio (on the bottom right). For example, to bring up the documentation for the help
function itself, enter:
help("help")
The ?
is a shortcut for the help function, so entering the following will also bring up the documentation for the help
function.
?help
The help documentation for a package (which should contain a list of all of its functions) is accessed in the same way. So to bring up the help file for the rmarkdown
package installed earlier, use:
?rmarkdown
However, this will generate an error if the package is not loaded into memory. To search for help documentation on objects that are installed, but not currently loaded, use a double question mark ??
.
??rmarkdown
After installing both R and RStudio, and then starting RStudio the screen will be split into four panes as shown in the official RStudio Cheat Sheet.
By default these panes are:
Typing directly into the console is the simplest way to use R. For example, entering
2+2
will return the output,
## [1] 4
Where the [1]
indicates that the first element of the result (in this case, there is only one element) has a value of 4
.
As you start performing more complex operations, you will find that it is easier to enter your input in the ‘Source’ pane, where it can saved to a file for later use, rather than retyping everything each time. These R script files typically have an .R extension.
Individual lines can be sent from the ‘Source’ pane to the console, by moving the cursor to the desired line and either hitting the ‘Run’ button, or using the keyboard shortcut ‘Ctrl + Enter’ (or ‘Ctrl + R’). The same can also be done for selections of text.
The next logical step after determining that 2+2 = 4, is storing that information for later use. This is done by ‘assigning’ the value ‘4’ to an object we will call x
by using the assignment operator <-
(you could also use =
, but that has some restrictions on its use, whereas <-
can be used anywhere).
x <- 4
You should now see x
and its value displayed in the ‘Environment’ pane. Which means it is stored in memory and can be called back again using the console.
x
## [1] 4
Inspecting the class of x
with the class
function reveals that it is ‘numeric’.
class(x)
## [1] "numeric"
Which means that it can be used in arithmetic operations.
x + 2
## [1] 6
x
currently contains only one element (i.e. it has a length of 1), but this can be changed by reassigning multiple values to x
using the c
or ‘combine’ function. Below we reassign x
with two 8’s, changing it to a numeric vector with two elements.
x <- c(8, 8)
x
## [1] 8 8
class(x)
## [1] "numeric"
length(x)
## [1] 2
Note that x
can still be used in arithmetic operations. Adding ‘5’ to x
will add ‘5’ to each of element of x
.
x + 5
## [1] 13 13
As mentioned in the introduction, ‘numeric’ is one of several available classes. Another class that you will frequently encounter is ‘character’. A ‘character’ element is typically a string of alphanumeric characters, such as words.
y <- "blue"
class(y)
## [1] "character"
We can combine ‘character’ elements into a vector with the c
function.
y <- c(y, "red", y)
y
## [1] "blue" "red" "blue"
Predictably, errors occur when you try to perform addition on characters.
y + 2
## Error in y + 2: non-numeric argument to binary operator
However, what about if you have a vector with both data types?
z <- c(2, 4, 6, "blue")
class(z)
## [1] "character"
z + 2
## Error in z + 2: non-numeric argument to binary operator
So, it only takes one ‘character’ element, to force the assignment of the z
object to the ’character` class.
Within R, all objects have a class. The class defines what kind of information the object contains and how it may be used. The previous example demonstrated some characteristics of ‘numeric’ vs ‘character’ objects.
In general, data will most often be classified as either a vector, data frame or matrix.
As seen previously, a vector
is a sequence of data elements.
x <- c(1,2,3)
y <- c(2,4,6)
Operations can be performed on the vector
as a whole.
x+2
## [1] 3 4 5
Or on individual elements of a vector
using []
to indicate which element.
y[3]*2
## [1] 12
Or using multiple vectors.
x+y
## [1] 3 6 9
A matrix
is a two dimensional array. So lets make one by combining x
and y
using the rbind
function. Note that in addition to the data elements, table1
also has row names (‘x’ and ‘y’) and column names ([,1], [,2], and [,3]).
table1 <- rbind(x, y)
table1
## [,1] [,2] [,3]
## x 1 2 3
## y 2 4 6
class(table1)
## [1] "matrix" "array"
Individual elements of a matrix
can be also accessed using []
to indicate the indices of each dimension separated by a comma (row then column).
table1[2,3]
## y
## 6
table1[1,3] + table1[2,1]
## x
## 5
table1*2
## [,1] [,2] [,3]
## x 2 4 6
## y 4 8 12
A matrix
can contain only one type of data (i.e. only ‘numeric’ or only ‘character’). In contrast, a data frame
is a table of data that can (but doesn’t necessarily) contain mixed data types. They can be created with data.frame
function.
table2 <- data.frame(Col1=c(1, 2), Col2=c("black", "white"))
table2
## Col1 Col2
## 1 1 black
## 2 2 white
class(table2)
## [1] "data.frame"
There are many more classes within R, but initially you will likely spend most of your time working with matrices and data frames. It is important to stay mindful of the differences between these two classes.
A matrix can be changed into a data frame using the as.data.frame
function.
x <- as.data.frame(table1)
class(x)
## [1] "data.frame"
x
## V1 V2 V3
## x 1 2 3
## y 2 4 6
Similarly, a data frame can be changed into a matrix by using the as.matrix
function. However, matrices must contain only one data type, and a number can be converted into a character, but not vice versa. Thus, when converting table2
to a matrix, the numeric values are converted to character values (denoted in quotation marks), and are no longer available for arithmetic operations.
x <- as.matrix(table2)
class(x)
## [1] "matrix" "array"
x
## Col1 Col2
## [1,] "1" "black"
## [2,] "2" "white"
x[1,1]*2
## Error in x[1, 1] * 2: non-numeric argument to binary operator
This becomes important when trying to combine matrices of different data types. For example, lets combine table1
a numeric matrix, with x
(the copy of table2
with everything converted to ‘character’ data) using the cbind
function.
x <- cbind(table1, x)
x
## Col1 Col2
## x "1" "2" "3" "1" "black"
## y "2" "4" "6" "2" "white"
By default, output inherits class from the input. So here, x
must be a matrix, and matrices can contain only one data type, therefore all the numeric values in both table1
and x
are converted to character values (as denoted by the quotation marks).
class(x)
## [1] "matrix" "array"
x[1,3] + x[2,1]
## Error in x[1, 3] + x[2, 1]: non-numeric argument to binary operator
In this case, it would probably make more sense for the output to be a data frame, containing both numeric and character columns. One way to do this, is by nesting the as.data.frame
function within the cbind
function. As we are now combining objects of different classes, R must choose a class for the output object. In this case it defaults to a data frame, as this can be done withoutchange any of the data with either object.
x <- cbind(table1, as.data.frame(table2))
class(x)
## [1] "data.frame"
x
## 1 2 3 Col1 Col2
## x 1 2 3 1 black
## y 2 4 6 2 white
x[1,3] + x[2,1]
## [1] 5
The past few examples should have provided some insight regarding the potential strength of functions. Functions allow very complex operations and can be combined in many ways.
Functions follow the general form: function(input, optional arguments).
For example, the function round
, will round an input value ‘x’ to ‘n’ significant digits, when round(x,digits=n)
is entered into the console. Replacing ‘x’ and ‘n’ with actual numbers gives something like:
round(3.141593, digits=5)
## [1] 3.14159
round(3.141593, digits=2)
## [1] 3.14
The available documentation for a function can be viewed by entering a ?
before its name into the console.
?round
A base install of R contains many functions and so far we have already used: class
, c
, length
, rbind
, cbind
, as.data.frame
, as.matrix
, and round
. It is also possible to define your own (that can incorporate other functions) with the syntax: function(input){operations}.
For example, we can create a function that will calculate what percentage each element of some input data contributes to the total sum of the input: function(x){x/sum(x)}
. Assigning (<-
) this function to an object called percentage
then allows it to be applied repeatedly to different datasets.
percentage <- function(x){x/sum(x)}
Although not particularly useful on objects containing only a single element, we can use the c
function to combine several numbers into a vector, then run percentage
on the vector as a whole.
percentage(10)
## [1] 1
percentage(c(5,5,10))
## [1] 0.25 0.25 0.50
Most of the data you will want to analyze with R, will be generated outside of R. Thus, it will need to be imported before you work with it. The simplest procedure to import data files is:
read.csv
function into the console along with several argumentsfile=file.choose()
argument will open a dialog box that will allow you select the file you wish to importheader=T
argument will assign the top row of the imported file to the column names of the generated objectstringsAsFactors=FALSE
argument will ensure character strings stay as character stringscount.data <- read.csv(file=file.choose(), header=T, stringsAsFactors=FALSE)
Similarly, to export an object from within your R environment to a .csv file, use the write.csv
function. If we wanted to export the object output.data
to a .csv file, we would enter the following into the console to open up a dialog box to select the location of the new file.
write.csv(output.data, file=file.choose())
In addition to statistical analyses, one of the main uses of R is the generation of high quality figures. However, creating figures with R is a complex topic, and currently beyond the scope of this guide.
To begin, basic plots can be generated with the plot
function.
x <- c(1,2,3,4)
y <- c(2,4,6,8)
plot(x,y, type ='l')
?plot
## Help on topic 'plot' was found in the following packages:
##
## Package Library
## graphics /usr/lib/R/library
## base /usr/lib64/R/library
##
##
## Using the first match ...
Complex plots are also possible with the plot
function, but there are also alternate graphic systems available through external packages such as ggplot2
.
A few different methods for installing external packages were given in the Introduction of this guide. However, before the functions within a package become available for use, the package needs to be loaded into memory.
For example, if you installed the rmarkdown
function earlier, trying to bring up the help file for its render
function without the package loaded should generate an error.
?render
Load an installed package (in this case rmarkdown
) into memory with the library
function, and try again.
library(rmarkdown)
?render
This section lists common data analysis techniques used at PEARL and the functions and/or packages that contain the tools necessary to perform them.
Over time I would like to add some detailed example code for each of these techniques.
To check your the normality of your data, you can use the Shapiro-Wilk Normality Test provided in a base installation of R by the shapiro.test
function.
shapiro.test(input.data)
If you are dealing with a large number of variables, it may make sense to examine all of them at once using the apply
function.
apply(input.data, 2, shapiro.test)
Data can be square-root transformed using the sqrt
function. Data can be log (base 10) transformed using log10
. Note that by default the log
function will calculate the natural logarithm (i.e. ln).
sqrt(input.data)
log10(input.data)
log(input.data)
The decostand
function within the vegan
package provides several standardization methods (e.g. ‘chi square’ and ‘hellinger’)
?decostand
Z-scores can be calculated with the scale
function available in a base installation of R. Checking the details of the help file reveals that scale
has two arguments, center and scale.
?scale
x
from their corresponding columnsx
by their standard deviations if center is TRUE, and the root mean square otherwise.Therefore, z-scores for input.data
can be calculated with:
scale(input.data, center=TRUE, scale=TRUE)
Correlations (Pearson, Kendall, and Spearman) can be examined using the cor
function.
cor(input.data)
A nice correlation matrix can be produced using the corrplot
package.
library(corrplot)
newdatacor = cor(my.dat[1:13])
corrplot(newdatacor, method = "number")
data(mtcars)
M <- cor(mtcars)
set.seed(0)
corrplot(M, method = "number", col = "black", cl.pos = "n")
Hill’s N2 values (Hill 1973)1 can be calculated for your data using the Hill.N2
function provided by the rioja
package.
N2=Hill.N2(input.data, margin=2)
Rarefaction is a method to correct for bias in species richness between samples of unequal size, by standardizing across samples to the number of species expected in a sample of the same total size as the smallest sample.
This can be done within R using the rarefy
function provided by the vegan
package.
?rarefy
At PEARL, we currently use ScienTissiME to produce 210Pb chronologies. However, when reanalyzing cores dated prior to 2012-2013, they have been dated with the binford functions provided by the jezioro
package.
For time-scales too long for 210Pb analyses, an implementation of the Bacon approach to age-depth modelling (i.e. using Bayesian statistics to reconstruct accumulation histories for deposits, by combining radiocarbon and other dates with prior information; Blaauw and Christen, 20112) is provided by the package rbacon
.
Tools to perform ordinations are provided by the vegan
package.
A detrended correspondance analysis (DCA) can be performed using the function decorana
. For example, to run a DCA on a matrix called “speciesData” (arranged with species as columns and samples as rows).
library(vegan)
outputDCA <- decorana(speciesData)
summary(outputDCA)
plot(outputDCA)
A principal component analysis (PCA) can be performed using the function rda
(Note that the “scale” argument can be used to scale values to unit variance).
outputPCA <- rda(speciesData, scale=TRUE)
summary(outputPCA, axes=4)
The biplot
function can be use to generate a PCA biplot with species scores indicated by biplot arrows (Note that the scaling argument can be used to scale either site(1), species (2), or both by eigenvalues. See ?biplot.rda
for more details.
biplot(outputPCA, scaling=2)
A redundancy analysis (RDA) is performed by the rda
function when two matrices are provided as input, one containing the species data and the second containing the environmental data.
outputRDA <- rda(speciesData ~ ., envData)
outputRDA
summary(outputRDA, scaling=0, axes=4)
plot(outputRDA, scaling=2)
To manually perform forward selection on an RDA model use add1.cca
testRDA <- rda(speciesData ~ 1, envData) # defines starting model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # examine F-ratios
testRDA <- update (testRDA, . ~ . + pH) # update model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # examine F-ratio
testRDA <- update (testRDA, . ~ . + Ca) # updates model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # continue until no more variables can improve model
anova (testRDA, by="terms", step=500) # tests terms in final model
testRDA
summary(testRDA, scaling=0, axes=4)
plot(testRDA, scaling=2)
Finally, examine the variable inflation factors in both the full model and reduced models.
vif.cca(outputRDA)
vif.cca(testRDA)
An ANOSIM can be performed using the anosim
function provided by the vegan
package. Similarity percentages can be determined with the simper
function also provided by the vegan
package.
indVAL?
Transfer functions can be built using the WA
function provided by the rioja
package.
For example, the midge/VWHO calibration set described in detail by Quinlan and Smol3 4 is provided as a dataset called vwhoQuinlan2010
by the jezioro
package.
Therefore the Quinlan and Smol midge/VWHO model can be built using the following code:
library(jezioro)
library(rioja)
data(vwhoQuinlan2010)
vwho.model <- WA(vwhoQuinlan2010[,3:47], vwhoQuinlan2010[,2], tolDW=TRUE)
The Quinlan and Smol papers settled on the WA.inf.tol_XVal model as having the best performance metrics (RMSE=1.98 mg/L, R2=0.5998). To evaluate that the vwho.model
object has the same performance as the model described by Quinlan and Smol, use the crossval
function also provided by rioja
.
library(rioja)
rioja::crossval(vwho.model, cv.method="loo", verbose=FALSE, ngroups=10, nboot=100, h.cutoff=0, h.dist=NULL)
##
## Method : Weighted Averaging
## Call : WA(y = vwhoQuinlan2010[, 3:47], x = vwhoQuinlan2010[, 2], tolDW = TRUE)
##
## Tolerance DW : Yes
## Monotonic deshrink : No
## No. samples : 54
## No. species : 45
## Cross val. : loo
##
## Deshrinking regression coefficients:
## Inverse d/s Classical d/s
## wa.b0 -14.1270 4.1927
## wa.b1 3.7760 0.1758
## wa.tol.b0 -12.0037 3.9839
## wa.tol.b1 3.4135 0.2007
##
## Performance:
## RMSE R2 Avg.Bias Max.Bias Skill
## WA.inv 1.8168 0.6636 0.0000 2.7096 66.3643
## WA.cla 2.2302 0.6636 0.0000 2.5285 49.3166
## WA.inv.tol 1.7577 0.6852 0.0000 2.4689 68.5171
## WA.cla.tol 2.1235 0.6852 0.0000 2.1941 54.0510
## WA.inv_XVal 1.9896 0.5967 -0.0209 3.0363 59.6633
## WA.cla_XVal 2.3260 0.6085 -0.0389 2.9700 44.8688
## WA.inv.tol_XVal 1.9817 0.5998 0.0017 2.8652 59.9821
## WA.cla.tol_XVal 2.2577 0.6117 -0.0011 2.7239 48.0607
Confident that your model is the same as the one described in the manuscripts, it can then be applied down-core to midge sedimentary assemblage data to reconstruct VWHO concentrations.
Data sets can be merged on common columns using the join
function provided by the analogue
package.
This can be useful when comparing downcore data sets with a calibration set, by creating a merged data set adding rows/columns (populated by zeroes) as necessary for those taxa absent from either of the original datasets.
library(analogue)
calSet <- data.frame(taxa1=c(0.25, 0.75, 0.25),taxa2=c(0.20, 0.05, 0.35), taxa3=c(0.55, 0.20, 0.50), row.names=c("cal1", "cal2", "cal3"))
testSet <- data.frame(taxa1=c(0.25, 0.25, 0.50),taxa4=c(0.75, 0.10, 0.15), row.names=c("test1", "test2", "test3"))
join(calSet, testSet, type="outer", split=FALSE)
The aggregate
funciton provided in a base installation of R can also be useful when harmonizing disparate data sets.
Identifying whether a fossil data set has any close modern analogs within a calibration set (i.e. evaluating a transfer function) can be done using the analog
function provided by the analogue
package.
The analogue
package also contains a vignette with a detailed analog matching example that follows the approach described in Flower et al. 19975 and Simpson et al. 20056.
library(analogue)
vignette("analogue_methods")
Regression trees can be constructed using the rpart
package.
Davies Test
Locally Estimated Scatterplot Smoothing (LOESS) can be performed using the loess
function provided by the bast stats
package.
Generalized Additive Models (or GAMs) can be constructed using the mgcv
package.
GAMs are increasingly applied in paleolimnological analyses because:
Simpson (2018)7 provides a detailed introduction to the use of GAMs and their construction using R (the supplementary information for the article contains the annotated R code used to perform the described analyses).
The dplyr
package provides several functions that act intuitively as verbs for data manipulation.
These functions include: filter
, select
, mutate
, arrange
, and summarise
.
%>%
The %>%
command provided by the magrittr
package functions similarly to the Unix pipe “|” command. It “pipes” the output from one function directly into the next (as its first argument), allowing intermediate steps to be bypassed, reducing the need for repetition in the code.
Note that due to the special characters in %>%
, pulling up it’s help file requires:
?magrittr::`%>%`
A simple example:
x <- c(5,4,3,2,1)
x %>% mean
## [1] 3
A more complicated example using %>%
in combination with some of the data frame manipulation verbs provided by dplyr
:
Import the vwhoQuinlan2010
dataset from jezioro
, and convert it to a data frame.
data(vwhoQuinlan2010)
vwhoQuinlan2010 <- as.data.frame(vwhoQuinlan2010)
Calculate the mean VWHO of those lakes (rows) where TANYSL <10 and MICROP >40.
filter(vwhoQuinlan2010, MICROP>30) %>% filter(TANYSL<10) %>% summarise(mean(VWHO))
## mean(VWHO)
## 1 4.6
ggplot2
ggplot2
is a data visualization package that implements concepts outlined in ‘The Grammar of Graphics’ a book ‘about grammatical rules for creating perceivable graphs’.
The syntax used to produce plots with ggplot2
is quite different from that of the base graphics system. For example:
library(ggplot2)
activities <- read.csv(file="data/ggplot2_activities.csv", header =T)
ggplot(activities, aes(x=Lake, y=Pb210, ymin=Pb210-Pb210err, ymax=Pb210+Pb210err)) +
theme(legend.position=c(0.90, 0.85), axis.text.x=element_text(size=8)) +
geom_point(aes(reorder(Lake, -Pb210))) +
geom_errorbar(aes(reorder(Lake, -Pb210), colour=Interval)) +
ylab("Total 210Pb Activity (Bq/kg)")
The ggplot2
package is able to construct very complex figures (click here for an extensive list of examples), and there are extensive resources available detailing its use. However, at PEARL we occasionally need to break some of its rules (e.g multiple y-axes), thus ggplot2
is not the best tool for producing stratigraphies that show both depth and age.
With the appropriate shapefiles, basic maps can be generated using ggplot2
and rgdal
(Bindings for the ‘Geospatial’ Data Abstraction Library).
library(ggplot2)
library(rgdal)
For example, to add a few points to an outline of Ontario:
ggplot2
.ontario<-readOGR("data/ontarioShapefile/Ont2.shp", layer="Ont2")
ontario_df <- fortify(ontario)
sites <- data.frame(name=c("queens", "lake"), decLatitude=c(44.22, 45.18), decLongitude=c(-76.50, -78.83), stringsAsFactors = FALSE)
geom_polygon
and add the points with geom_point
. Note that by default this will use a Mercator projection centered on the Prime Meridian.map <- ggplot() +
geom_polygon(data = ontario_df, aes(x = long, y = lat, group = group), fill="grey40") +
geom_point(data=sites, aes(x=decLongitude, y=decLatitude))
print(map)
coord_map
. For example, to change the previous map to a Lambert projection o a Lambert projection with a true scale between latitudes of 40 and 70 degrees N:#The projection can be changed using 'coord_map'
map <- map +
coord_map("lambert", lat0=40, lat1=60)
print(map)
Stratigraphies can be produced using the strat.plot
function provided by the rioja
package.
Word clouds are a neat way to present and visualize text data. The following example demonstrates generating word frequency tables (along with text stemming and stem completetion) necessary to produce word clouds.
There are several ways to produce a word cloud using R; here functions from the packages tm
are used to mine text data, SnowballC
to stem text, and wordcloud2
to generate the word clouds. If not already installed on your system, all three packages are available from CRAN, and can be installed with:
install.packages(c("wordcloud2", "tm", "SnowballC"))
Once the packages are installed, make them available for use by loading them:
library(wordcloud2)
library(tm)
library(SnowballC)
Word clouds require words, so import your text data into the R environment:
rawText <- readLines(file=file.choose())
The following example uses the first three paragraphs from the introduction of Volume 1 of the DPER seres as input data.
rawText <- "Paleolimnology, the interpretation of past conditions and processes in lake basins, is a multidisciplinary science whose roots extend back nearly two centuries. Despite this long history of development, the science has seen a surge of interest and application over the past decade. Today paleolimnology assumes a pivotal role in paleoclimatic and global change investigations, many fields of environmental science, and hydrocarbon and mineral resource exploration and exploitation. Associated with this dramatic increase in researchactivity involving lake sediments, there has been an equally rapid advance in the techniques and methods used by paleolimnologists. The objective of this volume is to provide a state-of-the-art summary of the major field methods, chronological techniques, and concepts used in the study of large-scale lacustrine basin analysis. This and the other techniques volumes in this series build on the foundation provided by previous compilations of paleoenvironmental techniques, such as Kummel & Raup (1965), Bouma (1969), Carver (1971), Berglund (1986), Gray (1988), Tucker (1988), Warner (1990), and Rutter & Catto (1995), many of which continue to serve as essential handbooks. However, the development of new and different methods for studying lake sediments over the past decade, as well as advancements and modifications to old methods, have provided impetus for a new series of monographs for this rapidly expanding topic. Three additional books from this series deal with other components of paleolimnology. Volume 2 (Last & Smol, 2001) focuses on the vast array of physical, mineralogical, and geochemical parameters that can be applied to the interpretation of lake histories. Volumes 3 and 4 (Smol et al., 2001a, b) address the great range of biological techniques that have been and continue to be such an important aspect of many paleolimnological efforts. Although chapters in each of these books discuss the quantitative aspects of lake sediment interpretations, a separate volume on statistical and data handling approaches in paleolimmnology is currently in preparation (Birks et al., in preparation). Our intent with this series of volumes is to provide sufficient methodological and technical detail to allow both practitioners and newcomers to the area of paleolimnology to use them as handbooks/laboratory manuals and as primary research texts. Although we recognize that the study of pre-Quaternary lakes is a very rapidly growing component in the vast arena of paleolimnology, this volume is directed primarily towards those involved in Quaternary lacustrine deposits and Quaternary environmental change. Nonetheless, many of the techniques and approaches are applicable to other time frames. Finally, we anticipate that many of the techniques discussed in these volumes apply equally well to marine, estuarine, and other depositional environments, although we have not specifically targeted non-lacustrine settings."
To produce a word cloud, a word frequency table is required (i.e. a two column table with the first column listing each word present in the input text, and the second listing the number of times it appears). This type of text mining can be done using tools provided by the tm
package.
Some preprocessing of the input text is necessary. The tm
package uses a data structure known as a corpus. So, first collapse the rawText
data object into a single character string, then convert into a corpus using the Corpus
function provided by the tm
package.
rawText <- paste(unlist(rawText, use.names=FALSE), collapse=" ")
rawText <- Corpus(VectorSource(rawText))
Next, clean the text data with the corpus using tm_map
to convert all characters to lower case, as well as remove any special characters, punctuation, stop words or extra white space.
cleanCorpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(tolower))
##Replace all instances of “/”, “@” and “|” with a space:
replaceWithSpace <- content_transformer(function(x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, replaceWithSpace, "/")
corpus <- tm_map(corpus, replaceWithSpace, "@")
corpus <- tm_map(corpus, replaceWithSpace, "\\|")
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
cleanText <- cleanCorpus(rawText)
Note, that you may want to prevent some additional words from appearing in the word cloud, beyond those contained in the tm
stop word list. If so, add them to the character vector cutWords
.
cutWords <- c("can", "will", "also")
cleanText <- tm_map(cleanText, removeWords, cutWords)
Now the TermDocumentMatrix
function within the tm
package can be used to produce a frequency table for each word present in cleanText
:
createFreqTable <- function(corpus){
freqTable <- TermDocumentMatrix(corpus)
freqTable <- as.matrix(freqTable)
freqTable <- sort(rowSums(freqTable),decreasing=TRUE)
freqTable <- data.frame(word = names(freqTable), freq=freqTable)
return(freqTable)
}
cleanFreqTable <- createFreqTable(cleanText)
A quick examination of the frequency table will reveal that individual members of word families such as ‘volume’ and ‘volumes’ or ‘provide’ and ‘provided’ are tabulated separately. Thus, the frequency table could be simplified by reducing members of each word family to their root and then combining each instance. This process is known as stemming the document.
head(cleanFreqTable, 40)
To generate a stemmed version of cleanText
use the stemDocument
function provided by tm
and then generate a new frequency table using our createFreqTable
function from earlier:
stemText <- tm_map(cleanText, stemDocument)
stemFreqTable <- createFreqTable(stemText)
However, it is unlikely that you would want to use the actual roots in your word cloud (i.e. the root of the ‘techique’ word family is ‘techniqu’, etc.):
head(stemFreqTable, 40)
The next step is to complete the roots using the most frequent occurrence in the input text of the inflected/derived words. This is done using another function provided by tm
, stemCompletion
; however, it works upon character vectors, so both cleanText
and stemText
must be converted first:
corpus2CharVect <- function(corpus){
corpus <- unlist(corpus[[1]][1], use.names=FALSE)
corpus <- strsplit(corpus, " ")
corpus <- unlist(corpus, use.names=FALSE)
return(corpus)
}
cleanWords <- corpus2CharVect(cleanText)
stemWords <- corpus2CharVect(stemText)
Complete the contents of stemWords
using cleanWords
as the dictionary, and convert the output to a corpus, so that the final frequency table can be created.
completeText <- stemCompletion(stemWords, cleanWords, type="prevalent")
completeText <- Corpus(VectorSource(completeText))
completeFreqTable <- createFreqTable(completeText)
head(completeFreqTable, 40)
The frequency table of the stemmed and completed text will be the input for the wordcloud2
function. The visual appearance of the word cloud can be modified using specific arguments to change size/shape/colour etc. Also note, that if the most common words do not appear, it may be becasue the default sizes may be too large for the plot window, this can be fixed by reducing the value give to the size argument.
?wordcloud2
wordcloud2(data=completeFreqTable, size=.5, ellipticity = 0.65, shape="circle")
R Markdown is an authoring framework that allows a single file to both save and execute code, and also generate high quality formatted reports with the rmarkdown
package.
This document is maintained in a single file written in R Markdown called RGuide.rmd
, the html file is then generated with:
library(rmarkdown)
render("labGuide.rmd")
Hill MO (1973) Diversity and evenness: a unifying notation and its consequences. Ecology 54: 427-432↩︎
Blaauw M, Christen JA (2011) Flexible paleoclimate age-depth models using an autoregressive gamma process. Bayesian Analysis 6: 457-474↩︎
Quinlan R, Smol JP (2001) Chironomid-based inference models for estimating end-of-summer hypolimnetic oxygen from south-central Ontario shield lakes. Freshwater Biology 46: 1529-1551↩︎
Quinlan R, Smol JP (2010) Use of Chaoborus subfossil mandibles in models for inferring past hypolimnetic oxygen. Journal of Paleolimnology 44: 43-50↩︎
Flower RJ, Juggins S and Battarbee RW (1997) Matching diatom assemblages in lake sediment cores and modern surface sediment samples: the implications for lake conservation and restoration with special reference to acidified systems. Hydrobiologia 344: 27–40↩︎
Simpson GL, Shilland EM, Winterbottom JM and Keay J (2005) Defining reference conditions for acidified waters using a modern analogue approach. Environmental Pollution 137: 119–133↩︎
Simpson GL (2018) Modelling Palaeoecological Time Series Using Generalised Additive Models. Frontiers in Ecology and Evolution doi:10.3389/fevo.2018.00149↩︎