Data Visualisation Using the Package ggplot2

The package graphics

R has numerous plotting functions in the base package graphics. Whilst they can be useful for quickly drawing up a plot, customising them can prove highly challenging.

?plot
?boxplot
?hist

Go to the Examples at the bottom of each help page and copy a few lines

The package ggplot2 gives much more flexibility and power. However, it has a unique syntax and approach where we plot by adding layers of relevant information, based on the concepts outlined in The Grammar of Graphics. As a result, we won’t explore the base graphics capabilities of R any further as ggplot2 is generally far superior. ggplot2 is also one of the core tidyverse packages.

The package ggplot2

Start a new script for this topic called DataVisualisation.R and add the usual first line.

library(tidyverse)

We’ll also use the transport dataset from our previous session. This time, we’ll just overwrite the height column with the same values in m instead of cm. We’ll also remove X1 and add the BMI information directly as we’re loading the data. Hopefully you can see the usefulness of the %>% from this example. This way we never need to edit our raw data files, but perform all manipulation in R in a completely reproducible manner. This strategy also allows for additional data to be provided as more experiments or data points are added.

Remember to load all packages at the top of your R script for easier reference in the future!

csvFile <- file.path("~", "data", "intro_r", "transport.csv")
file.exists(csvFile)
data <- read_csv(csvFile) %>%
  mutate(height = height/100, BMI = weight / height^2) %>%
  select(-X1)
data

The main ggplot2 function is ggplot(). In this first stage we first define the aesthetic mappings using aes(). These simply define what is plotted on which axis, and what defines the colour/shape etc. When calling this function alone no data will be plotted, and we get a blank plot area only.

ggplot(data, aes(x = weight, y = height))

Adding geometry

After defining the aesthetic mappings, we:

  1. Tell R that “more is to come” by adding a + symbol at the end of the line
  2. Add the required geometry using various geom_...() functions

Let’s start by plotting points, for which we need geom_point().

ggplot(data, aes(x = weight, y = height)) +
  geom_point()

There are numerous aesthetics specifically available for geom_point(), so let’s check the help page.

?geom_point

We can colour points, so let’s do this based on the transport column. Note that if we add this during the call to ggplot() this will be passed down to every subsequent geom_* that we call.

ggplot(data, aes(x = weight, y = height, colour = transport)) +
  geom_point()

We can also specify the shape to represent another variable, such as gender.

ggplot(data, aes(x = weight, y = height, colour = transport, shape = gender)) +
  geom_point()

Let’s leave the general aesthetics in ggplot(), with the colour and the shape aesthetics being specifically assigned to the points. Whilst this looks the same on the plot, this will help as we build up the complexity of the plot.

ggplot(data, aes(x = weight, y = height)) +
  geom_point(aes(colour = transport, shape = gender))

Anything set outside of the aes() function are general across all points. Here we can make all points larger. Size is a relative scale and doesn’t specifically represent a value in pixels, mm or anything similar.

ggplot(data, aes(x = weight, y = height)) +
  geom_point(aes(colour = transport, shape = gender), size = 3)

Layering Multiple Geometry Types

In addition to the points, we can easily add a smoothed line. This defaults to a loess fit, so let’s select a regression line straight away by setting the method argument to "lm".

ggplot(data, aes(x = weight, y = height)) +
  geom_point(aes(colour = transport, shape = gender)) +
  geom_smooth(method = "lm")

By default, the standard errors for the line are represented by shading, and we can remove these bands by setting se = FALSE.

ggplot(data, aes(x = weight, y = height)) +
  geom_point(aes(colour = transport, shape = gender)) +
  geom_smooth(method = "lm", se = FALSE)

Axis and Legend Labels

Axis and legend labels can be added using labs(). The default values are the column names, so we can tidy those up fairly easily.

ggplot(data, aes(x = weight, y = height)) +
  geom_point(aes(colour = transport, shape = gender)) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Weight (kg)", y = "Height (m)", shape = "Gender", colour = "Transport")

Using Facets

One of the most useful features of ggplot2 is the ability to create multiple sub-plots using the function facet_wrap(). Let’s create separate plots based on the gender column. This will also plot separate regression lines.

ggplot(data, aes(x = weight, y = height)) +
  geom_point(aes(colour = transport, shape = gender)) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Weight (kg)", y = "Height (cm)", shape = "Gender", colour = "Transport") +
  facet_wrap(~gender) 

In important piece of syntax we have just seen for the first time is the use of the tilde (~). In R this is how we often define statistical models, and con often be interpreted as depends on, so in this call to facet_wrap() we are asking ggplot to draw facets that depend on the values in the gender column.

Different geoms

Enter geom_ in the Console followed by the tab key. Numerous options will appear such as geom_density() and geom_boxplot(). You may need to think carefully about what is plotted on which axis for some of these so pay careful attention.

Here’s geom_density() setting a degree of transparency to look amazing.

ggplot(data, aes(x = height, fill = gender)) +
  geom_density(alpha = 0.5)

We can easily make boxplots.

ggplot(data, aes(x = gender, y =height, fill = gender)) +
  geom_boxplot()

We could plot separate facets based on the transport method as well.

ggplot(data, aes(x = gender, y =height, fill = gender)) +
  geom_boxplot() +
  facet_wrap(~transport)

Histograms use geom_histogram()

ggplot(data, aes(x = height)) +
  geom_histogram(bins = 20)

These tend to look terrible out of the box and need a little tweaking though.

ggplot(data, aes(x = height)) +
  geom_histogram(bins = 20, fill = "grey50", colour = "black")

Combining With dplyr

We can use dplyr (or any other) function to summarise our data before plotting, then pipe (%>%) into ggplot() without modifying our original data object. As we saw last time, this is loaded by default when we called library(tidyverse).

Let’s start working towards a barplot with error bars, which will use geom_bar(). geom_bar() has a slightly unusual set of default arguments, so we will need to provide the argument stat = "identity". This simply tells geom_bar() to use the values provided and not to try summing anything before plotting.

data %>%
  group_by(transport, gender) %>%
  summarise(mn_height = mean(height), sd_height = sd(height)) %>%
  ggplot(aes(x = transport, y = mn_height, fill = transport)) +
  geom_bar(stat = "identity") +
  facet_wrap(~gender) +
  guides(fill = FALSE)

To add error bars we now just use geom_errorbar()

data %>%
  group_by(transport, gender) %>%
  summarise(mn_height = mean(height), sd_height = sd(height)) %>%
  ggplot(aes(x = transport, y = mn_height, fill = transport)) +
  geom_bar(stat = "identity") +
  geom_errorbar(
    aes(ymin = mn_height - sd_height, ymax = mn_height + sd_height),
    width = 0.6
  ) +
  facet_wrap(~gender) +
  guides(fill = FALSE)

Notice that geom_errorbar() needed a lot of information, so we spread our code out across multiple lines. This is a fairly useful trick for making your code more readable at a later point in time.

Bonus Material

Please only attempt this section if you have more than an hour left of the practical session. Otherwise, this is simply here as a helpful guide for future reference.

Setting Themes

ggplot2 uses the idea of themes to control the overall appearance. The default has white grid-lines and a grey background, with a good alternative being theme_bw().

ggplot(data, aes(x = gender, y = height, fill = gender)) +
  geom_boxplot() +
  theme_bw()

Try some of the alternatives like theme_void() and theme_classic()

The theme() function is also where you set axis.text, legend and a whole raft of other attributes. The syntax often uses elements to set an attribute, which can take a while to come to grips with. Let’s check the help page.

?theme

As an example, changing text within themes uses element_text()

?element_text

Let’s change the text on the x-axis of our last plot. Hopefully this gives you enough ideas on how to customise your own plots.

ggplot(data, aes(x = gender, y = height, fill = gender)) +
  geom_boxplot() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
What do hjust and vjust do in the above code? They control the horizontal and vertical adjustment of the text, using the direction of the letters as the reference point. hjust = 1 means right alignment of the text in the horizontal direction, whilst vjust = 0.5 means centre alignment in the vertical direction. Both these values take vales between 0 & 1.

Legend Positions

We can also move the legend to multiple places:

ggplot(data, aes(x = gender, y = height, fill = gender)) +
  geom_boxplot() +
  theme_bw() +
  theme(legend.position = "bottom")

Or we can use co-ordinates to place it inside the plotting region. This method treats the plot area as a grid of width = height = 1.

ggplot(data, aes(x = gender, y = height, fill = gender)) +
  geom_boxplot() +
  theme_bw() +
  theme(legend.position = c(0.85, 0.15))

Colours and Axes

The scale of axes or fill colours can also be set manually to be reverse or on the log10 scale.

ggplot(data, aes(x = gender, y = height, fill = gender)) +
  geom_boxplot() +
  theme_bw() +
  theme(legend.position = c(0.85, 0.15)) +
  scale_y_continuous(limits = c(1.5, 2)) +
  scale_fill_manual(values = c("grey70", "lightblue"))

In the last line of the above, we’ve also set the colour manually. Once you get used to this plotting approach, it’s very easy to customise and make great looking plots.