Loading All Packages

library(readr)
library(dplyr)
library(tibble)
library(magrittr)
library(edgeR)
library(ggplot2)
library(ggbio)
library(biomaRt)
library(stringr)
library(reshape2)
library(magrittr)
library(ggrepel)
library(limma)
library(Glimma)
library(pheatmap)
library(ggbio)
library(GenomicRanges)
library(GenomeInfoDb)
library(biovizBase)

Data Preparation

Today’s RNA-Seq data

Today’s data was obtained from the GEO dataset GSE89057 from an experiment titled: “RNA-seq of SOX5 overexpressing primary human neuronal progenitors”

Full protocols can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89057

We will simply be using the raw counts for each gene, and differential expression analysis will not form a significant part of today’s session. Our main focus will be to explore some different options for visualising data using the R packages ggplot2 and ggbio

The astute amongst you may notice that alignments were performed using GRCh37.73. However, as we will be working with gene-level summaries, we can use GRCh38 with no significant difficulties.

Loading the data

First we need to load in the sample meta-data

mData <- read_delim("data/GSE89057_metaData.txt", delim = "\t")

Now we can load in the count data, then have a glance at the first few lines.

counts <- read.table("data/GSE89057_HTSeq_Unnormalized_Counts.txt", sep = "\t", row.names = 1) 
head(counts)

The rownames here are the gene ids, but we might like to tidy those up a little by removing the suffix.

ensemblIDs <- str_extract(rownames(counts),  "ENSG[0-9]+") 
rownames(counts) <- ensemblIDs

Getting Some Gene Annotations

Before moving on, we can add some summary information about the genes. We’ll use the package biomaRt to download some annotation data. First we’ll need to form an R object called mart which connects to the Biomart database hosted at Ensembl.

mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")

Next we’ll specify the key attributes about each gene that we’re interested in, such as:

the common name
the chromosome, strand, start and end positions
the gene biotype

attr2Get <- c("ensembl_gene_id", "external_gene_name", 
              "chromosome_name", "strand", "start_position", "end_position", 
              "gene_biotype")

We can use the function getBM() to download this data directly into our R workspace. This may take a little while to extract then send the information around the world.

geneData <- getBM(attr2Get, filters = "ensembl_gene_id", values = ensemblIDs, mart = mart)

Unfortunately, it looks like some genes are missing and our downloaded data won’t match our count data exactly.

length(ensemblIDs)
nrow(geneData)

We can resolve this using the function left_join() from the package dplyr. This will take the first data_frame as the reference, then add a row filled with NA values for the EnsemblIDs not returned by the function getBM() as they will be missing from the geneData object.

geneData <- left_join(x = data_frame(ensembl_gene_id = ensemblIDs),
                      y = geneData)

Creating a `DGEList` object

Form a DGEList, ignoring some of the columns in the meta data that will already be included as art of the standard DGEListstructure.

dgeList <- DGEList(counts, 
                   samples = dplyr::select(mData, -Condition, -SampleID), 
                   group = mData$Condition,
                   genes = geneData)
dgeList <- calcNormFactors(dgeList)

The Library sizes & total read numbers don’t match, this will likely be due to:

Total_Reads counting read pairs as two separate reads
Some reads not aligning
Some reads aligning multiple times

Alignment and generation of read summaries was performed by the authors prior to uploading and is not of great importance for today.

Some Initial Plots

Plotting Read Totals using a Bar-Plot

A common plot to form might be a simple barplot, showing the total read counts for each sample. We’ll step through each of the plotting stages in the package ggplot2 to understand the key concepts first.

The plotting functions in ggplot2 require data to be in an R object type known as a data.frame, which looks similar to a spreadsheet. These objects have a series of columns, each of which is filled with the same type of values (i.e. they are known as vectors in R lingo).

The samples component of the previous dgeList object is a data.frame, so we can start with this. The first step is to pass this object to the main function ggplot().

Introducing Plotting Aesthetics

At this point we specify what we’d like as the main plotting aesthetics, and these can include what column is to be shown on the x-axis, the y-axis, as well as any columns that can be used to group the values by outline colour, fill colours or plotting shape.

To start with, we can use the Accession ID along the x-axis and the Library Size (lib.size)as the height of the bars. We can fill the bars based on the group column.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size, fill = group))

Choosing a type of geometry for the plot

This initial command simply lays out the plotting area for us, but doesn’t plot the data. To generate an actual plot, ggplot2 requires us to specify the plotting geometry using a function called a geom which we add to the initial plot layout. There are many types of geom functions, but here we’ll start with geom_bar(). By default this tries to add any like values, so we’ll need to use the additional command stat ="identity" to tell the function to just plot the actual numbers we give it.

Note that this now requires as to use a + symbol at the end of the initial function. This tells R there is more to come, and that we are adding layers to this object.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size, fill = group)) +
  geom_bar(stat = "identity")

Changing Axis and Legend Titles

The title on the x-axis looks OK, but we can make the titles of the legend and the y-axis look a little nicer by adding capital letters. ggplot2 does this by using the labs() function which we can add as an additional layer, with a + sign after the call to geom_bar().

ggplot(dgeList$samples, aes(x = Accession, y = lib.size, fill = group)) +
  geom_bar(stat = "identity") +
  labs(y = "Library Size", fill = "Group")

We might even like to plot those library sizes in millions, which is very easy to do using ggplot2. This can be specified in the initial layout by telling ggplot() to just divide the lib.sizes column by one million (1e06). We’ll also change the axis label to reflect this.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
  geom_bar(stat = "identity") +
  labs(y = "Library Size (millions)", fill = "Group")

Using themes

Whilst some people like the default grey background, many don’t and this can be simply removed using a default plotting theme called theme_bw(). Once we’ve removed this, we’ll then start to tweak a few more aspects of the plot using this idea of a theme() as well. This is how many of the visual aspects are defined in ggplot2.

Clear the background, and note that this also adds a plot outline, and changes the axis labels and titles to black.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
  geom_bar(stat = "identity") +
  labs(y = "Library Size (millions)", fill = "Group") +
  theme_bw()

On a laptop, those values on the x-axis can be hard to read. To change these we’ll use the theme() function, so let’s look at this first to see what aspects of the plot we can change (there’s a lot here).

?theme

The aspect we need to change is called axis.text.x, and a range of parameters for the text can be changed. These are wrapped in a function called element_text(), so let’s look at this too before we dig any further.

?element_text

Here you can see that we’re able to change the font family, the font face, the font colour (with a u!) and many more. We’ll use this to rotate the labels by 90$^{\circ}$ so we’ll set the angle of the text.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
  geom_bar(stat = "identity") +
  labs(y = "Library Size (millions)", fill = "Group") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

The use of elements

In the theme() function plotting aspects which are text based are always set using element_text(), whilst those which are line based are set using element_line(). Things like the plot outline are set using element_rect(), whilst the entire plotting aspect can be removed using element_blank(). If we’d used theme(axis.text.x = element_blank()) in the above, the x-axis labels would have been removed. If you’re racing ahead, try it..

Introducting Facets

The last plotting trick that ggplot2 uses is the ability to subset the plot into multiple facets, or sub-sections within the same plot. We could use the group column to break the previous plot into the two treatment groups using the facet_wrap() function. This uses the common R syntax of ~group which can be interpreted as is dependent on group.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
  geom_bar(stat = "identity") +
  labs(y = "Library Size (millions)", fill = "Group") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) +
  facet_wrap(~group)

In this case, ggplot2 has tried to keep the x-axis consistent across both facets, but these can be set as free from each other using scales = "free_x" inside the function.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
  geom_bar(stat = "identity") +
  labs(y = "Library Size (millions)", fill = "Group") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) +
  facet_wrap(~group, scales = "free_x")

Guides

Now we’ve done this, we might like to hide the legend, which is done using the guides() function.

ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
  geom_bar(stat = "identity") +
  labs(y = "Library Size (millions)", fill = "Group") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) +
  facet_wrap(~group, scales = "free_x") +
  guides(fill = FALSE)

OK! Now we’ve covered a huge amount of the terminology for ggplot2, we’ll keep using this and try a few different types of plots.

Boxplots For Each Sample

Another way to look at this data, might be to make a boxplot of the gene counts within each sample. This is also simple using ggplot().

However, remember that we need to give a single column to the y-axis in the plotting aesthetics, so we’ll need to put all counts into a single column. We can do this on the fly using the function melt() from the package reshape2, which literally melts the data into a single column.

We’ll need to use the counts element of the dgeList object here, which is an R data type known as a matrix. This differs from a data.frame in that all columns have the same type of data, which in this instance is all integer values.

Note that we’ve used a useful feature known as the magrittr (%>%) which takes the output of one function, and places it into the first position of the next function. This is equivalent to a pipe in bash (|) for those who are familiar with this. Using this saves the creation of multiple similar objects in our R workspace and keeps them much less cluttered. (See https://en.wikipedia.org/wiki/The_Treachery_of_Images) to be in on this hilarious programmers joke.

Let’s have a sneak peek at how this works.

dgeList$counts %>%
  melt(varnames = c("EnsemblID", "Sample"), value.name = "Count") %>%
  head

Note that now we have a data.frame (i.e. two columns are text, whilst one is an integer), and that now our counts are all in a single column. We can also use the magrittr to place our molten data into the ggplot() function. After we’ve done this, we just keep adding layers like before, but this time our geometry will be geom_boxplot().

dgeList$counts %>%
  melt(varnames = c("EnsemblID", "Sample"), value.name = "Count") %>%
  ggplot(aes(x = Sample, y = Count)) +
  geom_boxplot()

As well as using our previous tricks to remove the grey background and rotate the axis text, we can change the scale of the y axis to be on the log10 scale.

dgeList$counts %>%
  melt(varnames = c("EnsemblID", "SampleID"), value.name = "Count") %>%
  ggplot(aes(x = SampleID, y = Count)) +
  geom_boxplot() +
  scale_y_log10()

That warning message you’ll see is just letting you know that some zero-valued entries were removed as log10(0) = -$\infty$.

Adding Information from the Meta-Data

Here we’ve started with just the counts, and after using melt() have ended up with three columns: a) the EnsemblID, b) the SampleID and c) the Count of aligned reads for that gene. This doesn’t give us enough information to colour our boxes using the treatment groups, but we know this information is in both the original mData object and the samples element of dgeList. We could get this information from either, so let’s use the left_join() approach we used earlier to get the information from the mData object. Note that we cleverly named our column SampleID to match the column in the mData object.

Let’s check the results from mData first using the head command, then we’ll make the plot

dgeList$counts %>%
  melt(varnames = c("EnsemblID", "SampleID"), value.name = "Count") %>%
  left_join(mData) %>%
  head

Note how it’s merged the two data.frame objects using the common column name. The function left_join() takes the first object as the reference (the output of melt()) and matches all the values from the second object to this. As this has effectively added the Accession column as well, we could use this as a nicer axis label.

dgeList$counts %>%
  melt(varnames = c("EnsemblID", "SampleID"), value.name = "Count") %>%
  left_join(mData) %>%
  ggplot(aes(x = Accession, y = Count, fill = Condition)) +
  geom_boxplot() +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

What do you think the `vjust = 0.5` command did in the above?

MDS plots

A common plot in RNA-Seq analysis in the MDS (or multi-dimensional scaling) plot. By default, most people use the function plotMDS() from edgeR.

plotMDS(dgeList)

We can use ggplot2 to tidy this up, by saving the output of plotMDS() as an R object. This will contain the key plotting values hidden away as a matrix, which we can pull out and use to customise our plot.

mds <- plotMDS(dgeList)

Firs we’ll convert this to a data.frame on the fly using the %>% symbol, and a couple of other handy functions to set the column names, then add the rownames as a column. Once we’ve done that, we’ll use the left_join() strategy to incorporate the metadata again. Let’s have a look first.

mds@.Data[[3]] %>%
  as.data.frame() %>%
  set_colnames(c("Dim1", "Dim2")) %>%
  rownames_to_column("SampleID") %>%
  left_join(mData)

OK, now we can plot this using geom_point()as our geometry. We’ll also increase the point size for a nicer plot, as well as manually specifying the colours.

mds@.Data[[3]] %>%
  as.data.frame() %>%
  set_colnames(c("Dim1", "Dim2")) %>%
  rownames_to_column("SampleID") %>%
  left_join(mData) %>%
  ggplot(aes(x = Dim1, y = Dim2, colour = Condition)) +
  geom_point(size = 3) + 
  scale_colour_manual(values = c("black", "red")) +
  theme_bw()

It appears that one of our Control samples is inconsistent with the others, but which one is it? We can add labels to find this out, using a second layer of geometry via geom_text()

mds@.Data[[3]] %>%
  as.data.frame() %>%
  set_colnames(c("Dim1", "Dim2")) %>%
  rownames_to_column("SampleID") %>%
  left_join(mData) %>%
  ggplot(aes(x = Dim1, y = Dim2, colour = Condition)) +
  geom_point(size = 3) +
  geom_text(aes(label = Accession)) +
  scale_colour_manual(values = c("black", "red")) +
  theme_bw()

OK, that’s ugly! There’s a variation called geom_text_repel() in the package ggrepel

mds@.Data[[3]] %>%
  as.data.frame() %>%
  set_colnames(c("Dim1", "Dim2")) %>%
  rownames_to_column("SampleID") %>%
  left_join(mData) %>%
  ggplot(aes(x = Dim1, y = Dim2, colour = Condition)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = Accession)) +
  scale_colour_manual(values = c("black", "red")) +
  theme_bw()

Now we have a better looking plot, and we clearly know which sample we have to investigate further.

Bonus Interactive MDS Plots

In addition to nicely formatting our MDS plot, the package Glimma enables you to create an MDS plot to provide an interactive output. This will automatically open an html file in your browser, and will also write this file to your working directory.

glMDSPlot(dgeList, labels = dgeList$samples$Accession, groups = dgeList$samples$group)

Hovering above the points will show the information about each sample. By clicking on the barplot on the right of the page, you can select the component which appears on the x-axis, and the subsequent component will appear on the y-axis.

Differential Expression

Now we’ve inspected our data prior to finding differentially expression genes, let’s use limma-voom to obtain differentially-expressed genes and estimates of fold-change (i.e. logFC). First we’ll need to create a design matrix, and then we’ll fit the precision weights with voom(), the linear model for each gene with lmFit() and moderated $t$-statistics using eBayes(). Here we’ll use the magrittr to perform this in one easy series of commands.

des <- model.matrix(~group, data = dgeList$samples)
colnames(des) <- c("Intercept", "SOX5")
fit <- dgeList %>%
  voom(design = des) %>%
  lmFit(design = des) %>%
  eBayes()

Now place the results in a separate object, using the coefficient SOX5 (i.e. the second column of the design matrix).

results <- topTable(fit, coef = "SOX5", n = length(dgeList))
head(results)

Making Our Own Volcano Plot.

An alternative to the default volcano plot is to use -log10() to transform the $p$-values (in the column P.Value in the results object), and plot this on the y-axis, with the estimated fold-change on the x-axis. We can do this using ggplot(), and in this initial call we’ve set the default point colour to grey30 and made the points slightly transparent (alpha = 0.5)

results %>%
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(colour = "grey30", alpha = 0.5) +
  theme_bw()

Colouring DE Genes

Now we might want to colour the differentially expressed genes. Here we’ll create an extra column on the fly, denoting a significantly DE gene as one with an $FDR$-adjusted $p$-value $<0.01$. Notice that we’ve added the colour strictly as part of the aesthetics for the points. The reasons for this will become clear in a few plots time.

results %>%
  mutate(DE = adj.P.Val < 0.01) %>%
  ggplot(aes(x = logFC, y = -log10(P.Value), colour = DE)) +
  geom_point(aes(colour = DE), alpha = 0.5) +
  scale_colour_manual(values = c("grey30", "red")) +
  theme_bw()

That’s a lot of genes, so let’s include a filter such that the estimated logFC is beyond $\pm1$.

results %>%
  mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
  ggplot(aes(x = logFC, y = -log10(P.Value), colour = DE)) +
  geom_point(aes(colour = DE), alpha = 0.5) +
  scale_colour_manual(values = c("grey30", "red")) +
  theme_bw()

And we can also add a label to some of the top genes. Let’s just add labels to the top 10. To do this, we’ve supplied a new object to the geom_text_repel() function, which is just the first 10 lines of the results object. As this won’t have the column DE, this is why we made this colouring specific to the geom_point() function a few lines back. If we’d left this as a global parameter, we’d get an error here.

results %>%
  mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(aes(colour = DE), alpha = 0.5) +
  scale_colour_manual(values = c("grey30", "red")) +
  geom_text_repel(data = results[1:12,],
                  aes(label = external_gene_name)) +
  theme_bw()

We could also change the colour of the labels if we’d like. In this final plot, we’ll also add lines for a logFC of $\pm1$.

results %>%
  mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(aes(colour = DE), alpha = 0.5) +
  scale_colour_manual(values = c("grey30", "red")) +
  geom_text_repel(data = results[1:12,],
                  aes(label = external_gene_name),
                  colour = "red") +
  geom_vline(xintercept = c(-1, 1), colour = "red", linetype = 2) +
  theme_bw()

Colouring Gene Subtypes

Instead of labelling the top DE genes, maybe we would like to see which lincRNA genes are differentially expressed. Increasing the transparency of the main points can help these stand out. Note also, that we’ve simply overlaid these genes using the filtered results object in exactly the same way that we’ve added the labels.

results %>%
  mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
  ggplot(aes(x = logFC, y = -log10(P.Value))) +
  geom_point(aes(colour = DE), alpha = 0.3) +
  scale_colour_manual(values = c("grey30", "red")) +
  geom_point(data = filter(results, 
                                gene_biotype == "lincRNA",
                                adj.P.Val < 0.01,
                                abs(logFC) > 1),
                  colour = "blue") +
  geom_text_repel(data = filter(results, 
                                gene_biotype == "lincRNA",
                                adj.P.Val < 0.01,
                                abs(logFC) > 1),
                  aes(label = external_gene_name),
                  colour = "blue") +
  geom_vline(xintercept = c(-1, 1), colour = "red", linetype = 2) +
  theme_bw()

Making a heatmap

Let’s make a heatmap of these DE lincRNA genes. Firstly, we’ll just create a data.frame with the information for this subset of genes

deLincRNA <- results %>%
  filter(gene_biotype == "lincRNA",
         adj.P.Val < 0.01, 
         abs(logFC) > 1) %>%
  dplyr::select(contains("gene"))

Now we’ll transform the raw counts to $log(CPM)$ (counts per million) and use the list of DE lincRNAs to extract just the data we need. Once we’ve done this, we can pass the results to the function pheatmap() from the package `pheatmap.

dgeList %>%
  cpm(log = TRUE) %>%
  extract(deLincRNA$ensembl_gene_id,) %>%
  pheatmap()

This has used the default pheatmap colour scheme, which isn’t too bad, but we can tweak this if we’d like. Unfortunately, the method for generating these colours can be a bit tricky so below is a simplified version. What we’re doing in the below is specifying the colours for low, mid and high values, then using the function colorRampPalette() to create a vector of 100 colours moving gradually through the three as specified.

heatCols <- colorRampPalette(c("blue", "white", "red"))(100)

The results are using the three channels of rgb which are encoded in hexadecimal values. The first two digits after the hash represent the red channel, then the next two represent the green channel with the final two digits representing the blue channel. As you can see, the first value created in heatCols is all blue (#0000FF), whilst the last is all red (#FF0000).

Now we can use these colours for our heatmap.

dgeList %>%
  cpm(log = TRUE) %>%
  extract(deLincRNA$ensembl_gene_id,) %>%
  pheatmap(color = heatCols)

As we can see, dendrograms have automatically been placed on the left and top of our heatmap and we can remove these to leave our rows and columns in the original order.

dgeList %>%
  cpm(log = TRUE) %>%
  extract(deLincRNA$ensembl_gene_id,) %>%
  pheatmap(color = heatCols,
           cluster_rows = FALSE, 
           cluster_cols = FALSE)

Personally, I like this dataset to be clustered so it shows genes and samples behaving similarly so let’s turn that back on. However, let’s change the rownames to be the common gene name, and the column names to be the GEO Accession ID.

dgeList %>%
  cpm(log = TRUE) %>%
  extract(deLincRNA$ensembl_gene_id,) %>%
  set_rownames(deLincRNA$external_gene_name) %>%
  set_colnames(dgeList$samples$Accession) %>%
  pheatmap(color = heatCols)

Now that we’ve used the Accession IDs, it’s not so clear which sample is which. We can add a colour guide in pheatmap() using our metadata. This is done by passing a data.frame to the argument annotation_col. However, the rownames of this data.frame must exactly match the column names of our data.matrix.

The colours for these annotations can also be set manually using the argument annotation_colors. For this argument, we need to provide a list with a named component for each annotation type (here we’ve specified Condition in the colAnnotations object). Each level of the annotation must also be specified in the colours by name.

colAnnotations <- data.frame(Condition = mData$Condition, row.names = mData$Accession)
annotColours <- list(Condition = c(Control = "black", SOX5 = "green"))
dgeList %>%
  cpm(log = TRUE) %>%
  extract(deLincRNA$ensembl_gene_id,) %>%
  set_rownames(deLincRNA$external_gene_name) %>%
  set_colnames(dgeList$samples$Accession) %>%
  pheatmap(color = heatCols,
           annotation_col = colAnnotations,
           annotation_colors = annotColours)

This way we can see that each sample type is clustering correctly.

Separators can also be used to break the heatmap based on the branches of the dendrogram. In the following we’ve specified to break the heatmap into two blocks of columns (cutree_cols = 2), and to group the rows into four blocks (cutree_rows = 4).

These block sizes were determined manually, by simple inspection and should be chosen in a way that makes clear sense. This can help give clear indications as which samples are behaving alike, and which genes are also behaving alike.

dgeList %>%
  cpm(log = TRUE) %>%
  extract(deLincRNA$ensembl_gene_id,) %>%
  set_rownames(deLincRNA$external_gene_name) %>%
  set_colnames(dgeList$samples$Accession) %>%
  pheatmap(color = heatCols,
           annotation_col = colAnnotations,
           annotation_colors = annotColours,
           cutree_cols = 2,
           cutree_rows = 4)

Finally, the width and height of each cell can be set manually to fit the figure for your paper. This value is defined in “points” so may take a bit of experimentation when preparing for publication.

dgeList %>%
  cpm(log = TRUE) %>%
  extract(deLincRNA$ensembl_gene_id,) %>%
  set_rownames(deLincRNA$external_gene_name) %>%
  set_colnames(dgeList$samples$Accession) %>%
  pheatmap(color = heatCols,
           cellwidth = 20,
           annotation_col = colAnnotations,
           annotation_colors = annotColours,
           cutree_cols = 2,
           cutree_rows = 4)

Another bonus Interactive Plot

First we need to define our results using the standard practices of limma. This will produce a column for each column of the design matrix with -1, 0 or 1 indicating whether a gene has been classed as down-regulated, unchanged or up-regulated respectively. The first column here refers to the mean expression & is meaningless in today’s experiment. (Formally it’s testing for an expression level of zero in the control samples.) The second column shows how many DE genes we have due to SOX5 over-expression.

sigDE <- decideTests(fit, p.value = 0.01, lfc = 1)
summary(sigDE)

Now we can launch an interactive plot. However a few things are noteworthy in the following code:

We need to specify the raw count matrix from dgeList as this can be used to manually inspect values in the interactive plot

By default this function requires a column called GeneID so we have to rename the column ensembl_gene_id. Here we’ve just done this on the fly.
We’ll use the group information in the column dgeList$samples$group to colour our samples.
DE status is indicated using the argument status = sigDE[,"SOX5"].

glMDPlot(fit, 
         counts = dgeList$counts, 
         anno = dplyr::rename(dgeList$genes, GeneID = ensembl_gene_id),
         groups = dgeList$samples$group,
         status = sigDE[,"SOX5"]
         )

Using `ggbio`

Introduction to GRanges

Some other types of plot which people find useful are Manhattan plots, for showing data along chromosomes, and circos plots. For these plot types, the package ggbio can be used, and data needs to be defined using GenomicRanges (or GRanges) objects.

At the foundation of a GRanges object is another object known as a Seqinfo object, which contains the key summary information about each chromosome, or sequence that we are plotting. Let’s download this from the UCSC for hg38, as that’s the genome for some sample data we’ll use in a minute. (Please ignore any warning about NCBI seqlevel was set to NA)

hg38 <- fetchExtendedChromInfoFromUCSC("hg38")

If this download fails, load the file from your data directory using the following code.

hg38 <- read_delim("data/chromInfo.txt", delim = "\t", col_names = FALSE) %>%
  set_names(c("UCSC_seqlevel", "UCSC_seqlength", "ref")) %>%
  dplyr::select(-ref) %>%
  mutate(NCBI_seqlevel = if_else(UCSC_seqlevel %in% paste0("chr", 1:22), 
                                 gsub("chr", "", UCSC_seqlevel), NA_character_),
         isCircular = FALSE) %>%
  arrange(as.integer(NCBI_seqlevel))

Now let’s form a seqinfo object for just the autosomes, but we’ll use the values 1:22 instead of crhr1:chr22 for our chromosome names.

autosomeSeq <-  Seqinfo(seqnames = hg38$NCBI_seqlevel[1:22],
                       seqlengths = hg38$UCSC_seqlength[1:22], 
                       isCircular = rep(FALSE, 22), 
                       genome = "hg38")
autosomeSeq

For our sample SNP data, we’ll use some that should be installed as part of the biovizBase package

snp <- read_delim("data/snps.txt", delim = "\t")
head(snp)

Here we have the location and identifier for each SNP, as well as some frequency information a $\chi^2$ statistic and $p$-value. We’ll define each SNP as a range of nucleotides within the genome, except with a width of one. For genes and exons defined as GRanges the width would correspond to the appropriate number of nucleotides, but for SNPs it’s only a single base. To form a GRanges object, it’s also best to arrange these by genomic position. After sorting we’ll also remove those for which we have no $p$-value.

snpGR <- snp %>%
  arrange(CHR, BP) %>%
  filter(!is.na(P)) %>%
  mutate(Sig = P < 0.001) %>%
  as.data.frame() %>%
  transformDfToGr(seqnames = "CHR", start = "BP", width = 1)
seqinfo(snpGR) <- autosomeSeq
snpGR

In this last step we added the correct sequence lengths to the snpGR object. Otherwise these values would have corresponded to the highest value in the BP column for each chromosome.

Karyogram Plots

Let’s check where out SNPs are using plotKaryogram() from the package ggbio. (Ignore any error about valid.GenomicRanges.seqinfo…)

plotKaryogram(snpGR)

This is an extended ggplot2 object, so we can modify using our existing strategies.

plotKaryogram(snpGR) +
  labs(x = "SNP Positions")

Manhattan Plots

To make a Manhattan Plot we simply call the function plotGrandLinear() and we can again transform the $p$-values using the -log10() strategy on the fly.

plotGrandLinear(snpGR, aes(y = -log10(P)))

This plot also has the same basic features as one created using ggplot2 so we can remove the background with theme_bw() and add a cutoff line to indicate significance

plotGrandLinear(snpGR, aes(y = -log10(P)), cutoff = 3) +
  guides(colour = FALSE) +
  theme_bw() +
  theme(panel.grid.minor = element_blank()) +
  labs(x = "Chromosome")

Circos Plots

To build all the layers of a Circos plot, we’ll need to build multiple GRanges objects. First we’ll start with a GRanges objects that just defines the chromosomes.

chrGR <- GRanges(autosomeSeq)

Firstly we’ll just defined the layout

ggbio() +
  layout_circle(chrGR, geom = "ideo", fill = "gray70", radius = 60, trackWidth = 10) +
  layout_circle(chrGR, geom = "text", aes(label = seqnames), radius = 75)

Now we’ll setup the results from our SOX5 dataset.

resultsGR <- filter(results, chromosome_name %in% 1:22) %>%
  mutate(strand = if_else(strand == 1, "+", "-"),
         sigDE = abs(logFC) > 1 & adj.P.Val < 0.05) %>%
  arrange(chromosome_name, start_position) %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE,
                           seqinfo = autosomeSeq,
                           start.field = "start_position",
                           end.field = "end_position")

And we could add a layer with the significant SNPs.

ggbio() +
  layout_circle(snpGR, geom = "ideo", fill = "gray70", radius = 60, trackWidth = 10) +
  layout_circle(chrGR, geom = "text", aes(label = seqnames), radius = 70) +
  layout_circle(resultsGR, geom = "line", aes(y = logFC, colour = sigDE), 
                radius = 40, trackWidth = 20) +
  layout_circle(subset(snpGR, Sig), geom = "rect", radius = 30, trackWidth = 10)

And make some links between the significant SNPs

linked <- subset(snpGR, Sig)
values(linked)$link2 <- linked[c(3, 4, 1, 2),]

ggbio() +
  layout_circle(snpGR, geom = "ideo", fill = "gray70", radius = 60, trackWidth = 10) +
  layout_circle(chrGR, geom = "text", aes(label = seqnames), radius = 70) +
  layout_circle(resultsGR, geom = "line", aes(y = logFC, colour = sigDE), 
                radius = 40, trackWidth = 20) +
  layout_circle(subset(snpGR, Sig), geom = "rect", colour = "red", 
                radius = 30, trackWidth = 10) +
  layout_circle(linked, geom = "link", linked.to = "link2",radius = 20)

Data Visualisation

Steve Pederson

2 December 2016

Loading All Packages

Data Preparation

Today’s RNA-Seq data

Loading the data

Getting Some Gene Annotations

Creating a `DGEList` object

Some Initial Plots

Plotting Read Totals using a Bar-Plot

Introducing Plotting Aesthetics

Choosing a type of geometry for the plot

Changing Axis and Legend Titles

Using themes

The use of elements

Introducting Facets

Guides

Boxplots For Each Sample

Adding Information from the Meta-Data

What do you think the `vjust = 0.5` command did in the above?

MDS plots

Bonus Interactive MDS Plots

Differential Expression

Making Our Own Volcano Plot.

Colouring DE Genes

Colouring Gene Subtypes

Making a heatmap

Another bonus Interactive Plot

Using `ggbio`

Introduction to GRanges

Karyogram Plots

Manhattan Plots

Circos Plots

Data Visualisation

Steve Pederson

2 December 2016

Loading All Packages

Data Preparation

Today’s RNA-Seq data

Loading the data

Getting Some Gene Annotations

Creating a DGEList object

Some Initial Plots

Plotting Read Totals using a Bar-Plot

Introducing Plotting Aesthetics

Choosing a type of geometry for the plot

Changing Axis and Legend Titles

Using themes

The use of elements

Introducting Facets

Guides

Boxplots For Each Sample

Adding Information from the Meta-Data

What do you think the vjust = 0.5 command did in the above?

MDS plots

Bonus Interactive MDS Plots

Differential Expression

Making Our Own Volcano Plot.

Colouring DE Genes

Colouring Gene Subtypes

Making a heatmap

Another bonus Interactive Plot

Using ggbio

Introduction to GRanges

Karyogram Plots

Manhattan Plots

Circos Plots

Creating a `DGEList` object

What do you think the `vjust = 0.5` command did in the above?

Using `ggbio`