To return to the previous page click here or use the back button on your browser.

Inspecting Data

Once we have our data loaded into R, the first thing we should do is check that it looks right.

The toothData file actually contains data about the effects of Vitamin C on tooth growth in Guinea Pigs, using two delivery methods, and three dose levels. The main variable of interest is the length of the teeth, contained in the column len. The delivery was done using two different types of Vitamin C supplements, and these are detailed in the supp column. The dose levels contained in each supplement are then recorded in the column dose, and these represent levels of 0.5, 1 and 2.0mg.

We can look at the first few rows using the function head()

head(toothData)
##    len supp dose
## 1  4.2   VC  Low
## 2 11.5   VC  Low
## 3  7.3   VC  Low
## 4  5.8   VC  Low
## 5  6.4   VC  Low
## 6 10.0   VC  Low

We can obtain an overall summary of the data using the function summary()

summary(toothData)
##       len        supp      dose   
##  Min.   : 4.20   OJ:30   Low :20  
##  1st Qu.:13.07   VC:30   Med :20  
##  Median :19.25           High:20  
##  Mean   :18.81                    
##  3rd Qu.:25.27                    
##  Max.   :33.90

Or we could count the rows to find how many samples there were.

nrow(toothData)
## [1] 60

If you found 60 rows, everything has loaded correctly so we can move on.

Plotting the data using ggplot2

Installing ggplot2

R makes visually inspecting the data a relatively straight-forward task, and in particular, the package ggplot2 has some powerful plotting functions. If you don’t have this package already installed, you can simply install this using the following command:

install.packages("ggplot2")

This is the simple method for installing packages from the main R package repository, known as CRAN. A package, is essentially a collection of functions for some type of analysis. To then load the functions that ggplot2 contains, we load the entire library of functions using:

library("ggplot2")

ggplot2 works in a slightly unusual way for an R package, in that we build up layers (or lines) of code, which can give us enormous power & flexibility in the way we plot our data. In the first line, we specify the R object (e.g. toothData) that we are plotting, and then we specify which value goes on which axis, as part of the plotting aesthetics (aes()). In the following lines, we will plot the dose levels on the x-axis, and the tooth length on the y axis.

By adding the + at the end of the first line, we are telling R that there is another line of code to come, and this is where we add the plotting “geometry”. Here, the geometry we are choosing is a boxplot and as this is our last layer, we don’t need the + at the end of this line.

ggplot(toothData, aes(x = dose, y = len)) +
  geom_boxplot()

By specifying the additional parameter to fill the boxes using the supplement method (fill = supp), the boxes will now be separated based on this parameter, at each dose level.

ggplot(toothData, aes(x = dose, y = len, fill = supp)) +
  geom_boxplot()

There are four options we can set for the plotting aesthetics, giving us up to four columns we can use in the plots.

  1. x: the column to be displayed on the x-axis
  2. y: the column to be displayed on the y-axis
  3. fill: the column to be used to fill the shapes we are using to plot.
  4. colour: the column to be used to change the outline of the shapes.

There are also many different types of plot geometry we can easily generate using ggplot. Instead of boxplots, we could have plotted the raw data points, using the supplement method as the colour:

ggplot(toothData, aes(x = dose, y = len, colour = supp)) +
  geom_point()

In the next example, we are looking at the general distribution of the entire dataset as a histogram.

ggplot(toothData, aes(x = len)) +
  geom_histogram()

Instead of a histogram, we could also plot a smoothed density.

ggplot(toothData, aes(x = len)) +
  geom_density()

We could also plot the densities for each Vitamin C supplement method. The additional parameter being passed to the density here, controls the transparency of the colours used to shade each density.

ggplot(toothData, aes(x = len, fill = supp)) +
  geom_density(alpha = 0.5)

ggplot2 also has the ability to separate plots based on a categorical variable, using an additional parameter where we break the plots unto “facets”. Here we are using the R syntax for “is a function of”, which is specified using the tilde ~. By leaving a blank LHS, we are implying that everything is a function of dose, so that the plots will be generated separating on this variable.

ggplot(toothData, aes(x = len, fill = supp)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~dose)

We can produce very detailed plots using multiple lines of code, which unfortunately we don’t have enough time to fully explore today.

ggplot(toothData, aes(x = dose, y = len, fill = supp)) +
  geom_boxplot() +
  labs(x = "Dose Level", y = "Tooth Length") +
  theme_bw()

What did the command labs(...) do?

What effect did the line theme_bw() have?

Exploring the data using dplyr

The R package dplyr makes exploring data.frame objects very simple, and contains numerous functions for this purpose.

If you don’t have it installed, then use the same procedure as for ggplot2.

library(dplyr)

We could now filter the data instead of using subset. This behaves like the Auto Filter in Excel, so we can chain multiple filters together

filter(toothData, dose == "Low")
filter(toothData, dose == "Low", supp == "VC")

In R the syntax for not equal is the same as most programming languages.

filter(toothData, dose != "Low", supp == "VC")

Again, note that we are not creating any new R objects in the above. We are just printing the results into the Console.

Grouping Data

The package dplyr responds very well to a new R trick called the magrittr or pipe (%>%), where we take the output of one process and place it as the input to the next process. An alternative way to perform the previous step would be to us the code

toothData %>%
  filter(dose != "Low", supp == "VC")

This becomes extremely powerful when wanting to perform a string of steps on a data object

toothData %>%
  group_by(supp, dose) %>%
  summarise(mn = mean(len), sd = sd(len), n = n())

This is a very powerful package, which really takes a few hours to explore fully. In combination with ggplot2 you can thoroughly explore very complex datasets.

Back