library(readr)
library(dplyr)
library(tibble)
library(magrittr)
library(edgeR)
library(ggplot2)
library(ggbio)
library(biomaRt)
library(stringr)
library(reshape2)
library(magrittr)
library(ggrepel)
library(limma)
library(Glimma)
library(pheatmap)
library(ggbio)
library(GenomicRanges)
library(GenomeInfoDb)
library(biovizBase)
Today’s data was obtained from the GEO dataset GSE89057
from an experiment titled: “RNA-seq of SOX5 overexpressing primary human neuronal progenitors”
Full protocols can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89057
We will simply be using the raw counts for each gene, and differential expression analysis will not form a significant part of today’s session. Our main focus will be to explore some different options for visualising data using the R
packages ggplot2
and ggbio
The astute amongst you may notice that alignments were performed using GRCh37.73
. However, as we will be working with gene-level summaries, we can use GRCh38
with no significant difficulties.
First we need to load in the sample meta-data
mData <- read_delim("data/GSE89057_metaData.txt", delim = "\t")
Now we can load in the count data, then have a glance at the first few lines.
counts <- read.table("data/GSE89057_HTSeq_Unnormalized_Counts.txt", sep = "\t", row.names = 1)
head(counts)
The rownames here are the gene ids, but we might like to tidy those up a little by removing the suffix.
ensemblIDs <- str_extract(rownames(counts), "ENSG[0-9]+")
rownames(counts) <- ensemblIDs
Before moving on, we can add some summary information about the genes. We’ll use the package biomaRt
to download some annotation data. First we’ll need to form an R
object called mart
which connects to the Biomart database hosted at Ensembl.
mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
Next we’ll specify the key attributes about each gene that we’re interested in, such as:
attr2Get <- c("ensembl_gene_id", "external_gene_name",
"chromosome_name", "strand", "start_position", "end_position",
"gene_biotype")
We can use the function getBM()
to download this data directly into our R
workspace. This may take a little while to extract then send the information around the world.
geneData <- getBM(attr2Get, filters = "ensembl_gene_id", values = ensemblIDs, mart = mart)
Unfortunately, it looks like some genes are missing and our downloaded data won’t match our count data exactly.
length(ensemblIDs)
nrow(geneData)
We can resolve this using the function left_join()
from the package dplyr
. This will take the first data_frame
as the reference, then add a row filled with NA
values for the EnsemblIDs not returned by the function getBM()
as they will be missing from the geneData
object.
geneData <- left_join(x = data_frame(ensembl_gene_id = ensemblIDs),
y = geneData)
DGEList
objectForm a DGEList
, ignoring some of the columns in the meta data that will already be included as art of the standard DGEList
structure.
dgeList <- DGEList(counts,
samples = dplyr::select(mData, -Condition, -SampleID),
group = mData$Condition,
genes = geneData)
dgeList <- calcNormFactors(dgeList)
The Library sizes & total read numbers don’t match, this will likely be due to:
Total_Reads
counting read pairs as two separate readsAlignment and generation of read summaries was performed by the authors prior to uploading and is not of great importance for today.
A common plot to form might be a simple barplot, showing the total read counts for each sample. We’ll step through each of the plotting stages in the package ggplot2
to understand the key concepts first.
The plotting functions in ggplot2
require data to be in an R
object type known as a data.frame
, which looks similar to a spreadsheet. These objects have a series of columns, each of which is filled with the same type of values (i.e. they are known as vectors in R
lingo).
The samples
component of the previous dgeList
object is a data.frame
, so we can start with this. The first step is to pass this object to the main function ggplot()
.
At this point we specify what we’d like as the main plotting aesthetics, and these can include what column is to be shown on the x
-axis, the y
-axis, as well as any columns that can be used to group the values by outline colour
, fill
colours or plotting shape
.
To start with, we can use the Accession
ID along the x
-axis and the Library Size (lib.size
)as the height of the bars. We can fill the bars based on the group
column.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size, fill = group))
This initial command simply lays out the plotting area for us, but doesn’t plot the data. To generate an actual plot, ggplot2
requires us to specify the plotting geometry using a function called a geom
which we add to the initial plot layout. There are many types of geom
functions, but here we’ll start with geom_bar()
. By default this tries to add any like values, so we’ll need to use the additional command stat ="identity"
to tell the function to just plot the actual numbers we give it.
Note that this now requires as to use a +
symbol at the end of the initial function. This tells R
there is more to come, and that we are adding layers to this object.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size, fill = group)) +
geom_bar(stat = "identity")
The title on the x
-axis looks OK, but we can make the titles of the legend and the y
-axis look a little nicer by adding capital letters. ggplot2
does this by using the labs()
function which we can add as an additional layer, with a +
sign after the call to geom_bar()
.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library Size", fill = "Group")
We might even like to plot those library sizes in millions, which is very easy to do using ggplot2
. This can be specified in the initial layout by telling ggplot()
to just divide the lib.sizes
column by one million (1e06
). We’ll also change the axis label to reflect this.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library Size (millions)", fill = "Group")
Whilst some people like the default grey background, many don’t and this can be simply removed using a default plotting theme called theme_bw()
. Once we’ve removed this, we’ll then start to tweak a few more aspects of the plot using this idea of a theme()
as well. This is how many of the visual aspects are defined in ggplot2
.
Clear the background, and note that this also adds a plot outline, and changes the axis labels and titles to black.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library Size (millions)", fill = "Group") +
theme_bw()
On a laptop, those values on the x
-axis can be hard to read. To change these we’ll use the theme()
function, so let’s look at this first to see what aspects of the plot we can change (there’s a lot here).
?theme
The aspect we need to change is called axis.text.x
, and a range of parameters for the text can be changed. These are wrapped in a function called element_text()
, so let’s look at this too before we dig any further.
?element_text
Here you can see that we’re able to change the font family
, the font face
, the font colour
(with a u
!) and many more. We’ll use this to rotate the labels by 90\(^{\circ}\) so we’ll set the angle
of the text.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library Size (millions)", fill = "Group") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
In the theme()
function plotting aspects which are text based are always set using element_text()
, whilst those which are line based are set using element_line()
. Things like the plot outline are set using element_rect()
, whilst the entire plotting aspect can be removed using element_blank()
. If we’d used theme(axis.text.x = element_blank())
in the above, the x
-axis labels would have been removed. If you’re racing ahead, try it..
The last plotting trick that ggplot2
uses is the ability to subset the plot into multiple facets, or sub-sections within the same plot. We could use the group
column to break the previous plot into the two treatment groups using the facet_wrap()
function. This uses the common R
syntax of ~group
which can be interpreted as is dependent on group
.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library Size (millions)", fill = "Group") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(~group)
In this case, ggplot2
has tried to keep the x
-axis consistent across both facets, but these can be set as free from each other using scales = "free_x"
inside the function.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library Size (millions)", fill = "Group") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(~group, scales = "free_x")
Now we’ve done this, we might like to hide the legend, which is done using the guides()
function.
ggplot(dgeList$samples, aes(x = Accession, y = lib.size/1e06, fill = group)) +
geom_bar(stat = "identity") +
labs(y = "Library Size (millions)", fill = "Group") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(~group, scales = "free_x") +
guides(fill = FALSE)
OK! Now we’ve covered a huge amount of the terminology for ggplot2
, we’ll keep using this and try a few different types of plots.
Another way to look at this data, might be to make a boxplot of the gene counts within each sample. This is also simple using ggplot()
.
However, remember that we need to give a single column to the y
-axis in the plotting aesthetics, so we’ll need to put all counts into a single column. We can do this on the fly using the function melt()
from the package reshape2
, which literally melts the data into a single column.
We’ll need to use the counts
element of the dgeList
object here, which is an R
data type known as a matrix
. This differs from a data.frame
in that all columns have the same type of data, which in this instance is all integer
values.
Note that we’ve used a useful feature known as the magrittr
(%>%
) which takes the output of one function, and places it into the first position of the next function. This is equivalent to a pipe in bash (|) for those who are familiar with this. Using this saves the creation of multiple similar objects in our R
workspace and keeps them much less cluttered. (See https://en.wikipedia.org/wiki/The_Treachery_of_Images) to be in on this hilarious programmers joke.
Let’s have a sneak peek at how this works.
dgeList$counts %>%
melt(varnames = c("EnsemblID", "Sample"), value.name = "Count") %>%
head
Note that now we have a data.frame
(i.e. two columns are text, whilst one is an integer), and that now our counts are all in a single column. We can also use the magrittr
to place our molten data into the ggplot()
function. After we’ve done this, we just keep adding layers like before, but this time our geometry will be geom_boxplot()
.
dgeList$counts %>%
melt(varnames = c("EnsemblID", "Sample"), value.name = "Count") %>%
ggplot(aes(x = Sample, y = Count)) +
geom_boxplot()
As well as using our previous tricks to remove the grey background and rotate the axis text, we can change the scale of the y
axis to be on the log10
scale.
dgeList$counts %>%
melt(varnames = c("EnsemblID", "SampleID"), value.name = "Count") %>%
ggplot(aes(x = SampleID, y = Count)) +
geom_boxplot() +
scale_y_log10()
That warning message you’ll see is just letting you know that some zero-valued entries were removed as log10(0) = -
\(\infty\).
Here we’ve started with just the counts, and after using melt()
have ended up with three columns: a) the EnsemblID
, b) the SampleID
and c) the Count
of aligned reads for that gene. This doesn’t give us enough information to colour our boxes using the treatment groups, but we know this information is in both the original mData
object and the samples
element of dgeList
. We could get this information from either, so let’s use the left_join()
approach we used earlier to get the information from the mData
object. Note that we cleverly named our column SampleID
to match the column in the mData
object.
Let’s check the results from mData
first using the head
command, then we’ll make the plot
dgeList$counts %>%
melt(varnames = c("EnsemblID", "SampleID"), value.name = "Count") %>%
left_join(mData) %>%
head
Note how it’s merged the two data.frame
objects using the common column name. The function left_join()
takes the first object as the reference (the output of melt()
) and matches all the values from the second object to this. As this has effectively added the Accession
column as well, we could use this as a nicer axis label.
dgeList$counts %>%
melt(varnames = c("EnsemblID", "SampleID"), value.name = "Count") %>%
left_join(mData) %>%
ggplot(aes(x = Accession, y = Count, fill = Condition)) +
geom_boxplot() +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
vjust = 0.5
command did in the above?A common plot in RNA-Seq analysis in the MDS (or multi-dimensional scaling) plot. By default, most people use the function plotMDS()
from edgeR
.
plotMDS(dgeList)
We can use ggplot2
to tidy this up, by saving the output of plotMDS()
as an R
object. This will contain the key plotting values hidden away as a matrix, which we can pull out and use to customise our plot.
mds <- plotMDS(dgeList)
Firs we’ll convert this to a data.frame
on the fly using the %>%
symbol, and a couple of other handy functions to set the column names, then add the rownames as a column. Once we’ve done that, we’ll use the left_join()
strategy to incorporate the metadata again. Let’s have a look first.
mds@.Data[[3]] %>%
as.data.frame() %>%
set_colnames(c("Dim1", "Dim2")) %>%
rownames_to_column("SampleID") %>%
left_join(mData)
OK, now we can plot this using geom_point()
as our geometry. We’ll also increase the point size for a nicer plot, as well as manually specifying the colours.
mds@.Data[[3]] %>%
as.data.frame() %>%
set_colnames(c("Dim1", "Dim2")) %>%
rownames_to_column("SampleID") %>%
left_join(mData) %>%
ggplot(aes(x = Dim1, y = Dim2, colour = Condition)) +
geom_point(size = 3) +
scale_colour_manual(values = c("black", "red")) +
theme_bw()
It appears that one of our Control samples is inconsistent with the others, but which one is it? We can add labels to find this out, using a second layer of geometry via geom_text()
mds@.Data[[3]] %>%
as.data.frame() %>%
set_colnames(c("Dim1", "Dim2")) %>%
rownames_to_column("SampleID") %>%
left_join(mData) %>%
ggplot(aes(x = Dim1, y = Dim2, colour = Condition)) +
geom_point(size = 3) +
geom_text(aes(label = Accession)) +
scale_colour_manual(values = c("black", "red")) +
theme_bw()
OK, that’s ugly! There’s a variation called geom_text_repel()
in the package ggrepel
mds@.Data[[3]] %>%
as.data.frame() %>%
set_colnames(c("Dim1", "Dim2")) %>%
rownames_to_column("SampleID") %>%
left_join(mData) %>%
ggplot(aes(x = Dim1, y = Dim2, colour = Condition)) +
geom_point(size = 3) +
geom_text_repel(aes(label = Accession)) +
scale_colour_manual(values = c("black", "red")) +
theme_bw()
Now we have a better looking plot, and we clearly know which sample we have to investigate further.
In addition to nicely formatting our MDS plot, the package Glimma
enables you to create an MDS plot to provide an interactive output. This will automatically open an html file in your browser, and will also write this file to your working directory.
glMDSPlot(dgeList, labels = dgeList$samples$Accession, groups = dgeList$samples$group)
Hovering above the points will show the information about each sample. By clicking on the barplot on the right of the page, you can select the component which appears on the x
-axis, and the subsequent component will appear on the y
-axis.
Now we’ve inspected our data prior to finding differentially expression genes, let’s use limma-voom
to obtain differentially-expressed genes and estimates of fold-change (i.e. logFC). First we’ll need to create a design matrix, and then we’ll fit the precision weights with voom()
, the linear model for each gene with lmFit()
and moderated \(t\)-statistics using eBayes()
. Here we’ll use the magrittr
to perform this in one easy series of commands.
des <- model.matrix(~group, data = dgeList$samples)
colnames(des) <- c("Intercept", "SOX5")
fit <- dgeList %>%
voom(design = des) %>%
lmFit(design = des) %>%
eBayes()
Now place the results in a separate object, using the coefficient SOX5
(i.e. the second column of the design matrix).
results <- topTable(fit, coef = "SOX5", n = length(dgeList))
head(results)
An alternative to the default volcano plot is to use -log10()
to transform the \(p\)-values (in the column P.Value
in the results
object), and plot this on the y
-axis, with the estimated fold-change on the x
-axis. We can do this using ggplot()
, and in this initial call we’ve set the default point colour to grey30
and made the points slightly transparent (alpha = 0.5
)
results %>%
ggplot(aes(x = logFC, y = -log10(P.Value))) +
geom_point(colour = "grey30", alpha = 0.5) +
theme_bw()
Now we might want to colour the differentially expressed genes. Here we’ll create an extra column on the fly, denoting a significantly DE gene as one with an \(FDR\)-adjusted \(p\)-value \(<0.01\). Notice that we’ve added the colour strictly as part of the aesthetics for the points. The reasons for this will become clear in a few plots time.
results %>%
mutate(DE = adj.P.Val < 0.01) %>%
ggplot(aes(x = logFC, y = -log10(P.Value), colour = DE)) +
geom_point(aes(colour = DE), alpha = 0.5) +
scale_colour_manual(values = c("grey30", "red")) +
theme_bw()
That’s a lot of genes, so let’s include a filter such that the estimated logFC is beyond \(\pm1\).
results %>%
mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
ggplot(aes(x = logFC, y = -log10(P.Value), colour = DE)) +
geom_point(aes(colour = DE), alpha = 0.5) +
scale_colour_manual(values = c("grey30", "red")) +
theme_bw()
And we can also add a label to some of the top genes. Let’s just add labels to the top 10. To do this, we’ve supplied a new object to the geom_text_repel()
function, which is just the first 10 lines of the results
object. As this won’t have the column DE
, this is why we made this colouring specific to the geom_point()
function a few lines back. If we’d left this as a global parameter, we’d get an error here.
results %>%
mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
ggplot(aes(x = logFC, y = -log10(P.Value))) +
geom_point(aes(colour = DE), alpha = 0.5) +
scale_colour_manual(values = c("grey30", "red")) +
geom_text_repel(data = results[1:12,],
aes(label = external_gene_name)) +
theme_bw()
We could also change the colour of the labels if we’d like. In this final plot, we’ll also add lines for a logFC of \(\pm1\).
results %>%
mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
ggplot(aes(x = logFC, y = -log10(P.Value))) +
geom_point(aes(colour = DE), alpha = 0.5) +
scale_colour_manual(values = c("grey30", "red")) +
geom_text_repel(data = results[1:12,],
aes(label = external_gene_name),
colour = "red") +
geom_vline(xintercept = c(-1, 1), colour = "red", linetype = 2) +
theme_bw()
Instead of labelling the top DE genes, maybe we would like to see which lincRNA
genes are differentially expressed. Increasing the transparency of the main points can help these stand out. Note also, that we’ve simply overlaid these genes using the filtered results
object in exactly the same way that we’ve added the labels.
results %>%
mutate(DE = adj.P.Val < 0.01 & abs(logFC) > 1) %>%
ggplot(aes(x = logFC, y = -log10(P.Value))) +
geom_point(aes(colour = DE), alpha = 0.3) +
scale_colour_manual(values = c("grey30", "red")) +
geom_point(data = filter(results,
gene_biotype == "lincRNA",
adj.P.Val < 0.01,
abs(logFC) > 1),
colour = "blue") +
geom_text_repel(data = filter(results,
gene_biotype == "lincRNA",
adj.P.Val < 0.01,
abs(logFC) > 1),
aes(label = external_gene_name),
colour = "blue") +
geom_vline(xintercept = c(-1, 1), colour = "red", linetype = 2) +
theme_bw()
Let’s make a heatmap of these DE lincRNA
genes. Firstly, we’ll just create a data.frame
with the information for this subset of genes
deLincRNA <- results %>%
filter(gene_biotype == "lincRNA",
adj.P.Val < 0.01,
abs(logFC) > 1) %>%
dplyr::select(contains("gene"))
Now we’ll transform the raw counts to \(log(CPM)\) (counts per million) and use the list of DE lincRNAs to extract just the data we need. Once we’ve done this, we can pass the results to the function pheatmap()
from the package `pheatmap.
dgeList %>%
cpm(log = TRUE) %>%
extract(deLincRNA$ensembl_gene_id,) %>%
pheatmap()
This has used the default pheatmap
colour scheme, which isn’t too bad, but we can tweak this if we’d like. Unfortunately, the method for generating these colours can be a bit tricky so below is a simplified version. What we’re doing in the below is specifying the colours for low, mid and high values, then using the function colorRampPalette()
to create a vector of 100 colours moving gradually through the three as specified.
heatCols <- colorRampPalette(c("blue", "white", "red"))(100)
The results are using the three channels of rgb
which are encoded in hexadecimal values. The first two digits after the hash represent the red
channel, then the next two represent the green
channel with the final two digits representing the blue
channel. As you can see, the first value created in heatCols
is all blue (#0000FF
), whilst the last is all red (#FF0000
).
Now we can use these colours for our heatmap.
dgeList %>%
cpm(log = TRUE) %>%
extract(deLincRNA$ensembl_gene_id,) %>%
pheatmap(color = heatCols)
As we can see, dendrograms have automatically been placed on the left and top of our heatmap and we can remove these to leave our rows and columns in the original order.
dgeList %>%
cpm(log = TRUE) %>%
extract(deLincRNA$ensembl_gene_id,) %>%
pheatmap(color = heatCols,
cluster_rows = FALSE,
cluster_cols = FALSE)
Personally, I like this dataset to be clustered so it shows genes and samples behaving similarly so let’s turn that back on. However, let’s change the rownames to be the common gene name, and the column names to be the GEO Accession ID.
dgeList %>%
cpm(log = TRUE) %>%
extract(deLincRNA$ensembl_gene_id,) %>%
set_rownames(deLincRNA$external_gene_name) %>%
set_colnames(dgeList$samples$Accession) %>%
pheatmap(color = heatCols)
Now that we’ve used the Accession IDs, it’s not so clear which sample is which. We can add a colour guide in pheatmap()
using our metadata. This is done by passing a data.frame
to the argument annotation_col
. However, the rownames of this data.frame
must exactly match the column names of our data.matrix.
The colours for these annotations can also be set manually using the argument annotation_colors
. For this argument, we need to provide a list
with a named component for each annotation type (here we’ve specified Condition
in the colAnnotations
object). Each level of the annotation must also be specified in the colours by name.
colAnnotations <- data.frame(Condition = mData$Condition, row.names = mData$Accession)
annotColours <- list(Condition = c(Control = "black", SOX5 = "green"))
dgeList %>%
cpm(log = TRUE) %>%
extract(deLincRNA$ensembl_gene_id,) %>%
set_rownames(deLincRNA$external_gene_name) %>%
set_colnames(dgeList$samples$Accession) %>%
pheatmap(color = heatCols,
annotation_col = colAnnotations,
annotation_colors = annotColours)
This way we can see that each sample type is clustering correctly.
Separators can also be used to break the heatmap based on the branches of the dendrogram. In the following we’ve specified to break the heatmap into two blocks of columns (cutree_cols = 2
), and to group the rows into four blocks (cutree_rows = 4
).
These block sizes were determined manually, by simple inspection and should be chosen in a way that makes clear sense. This can help give clear indications as which samples are behaving alike, and which genes are also behaving alike.
dgeList %>%
cpm(log = TRUE) %>%
extract(deLincRNA$ensembl_gene_id,) %>%
set_rownames(deLincRNA$external_gene_name) %>%
set_colnames(dgeList$samples$Accession) %>%
pheatmap(color = heatCols,
annotation_col = colAnnotations,
annotation_colors = annotColours,
cutree_cols = 2,
cutree_rows = 4)
Finally, the width and height of each cell can be set manually to fit the figure for your paper. This value is defined in “points” so may take a bit of experimentation when preparing for publication.
dgeList %>%
cpm(log = TRUE) %>%
extract(deLincRNA$ensembl_gene_id,) %>%
set_rownames(deLincRNA$external_gene_name) %>%
set_colnames(dgeList$samples$Accession) %>%
pheatmap(color = heatCols,
cellwidth = 20,
annotation_col = colAnnotations,
annotation_colors = annotColours,
cutree_cols = 2,
cutree_rows = 4)
First we need to define our results using the standard practices of limma
. This will produce a column for each column of the design matrix with -1, 0
or 1
indicating whether a gene has been classed as down-regulated, unchanged or up-regulated respectively. The first column here refers to the mean expression & is meaningless in today’s experiment. (Formally it’s testing for an expression level of zero in the control samples.) The second column shows how many DE genes we have due to SOX5
over-expression.
sigDE <- decideTests(fit, p.value = 0.01, lfc = 1)
summary(sigDE)
Now we can launch an interactive plot. However a few things are noteworthy in the following code:
dgeList
as this can be used to manually inspect values in the interactive plotGeneID
so we have to rename the column ensembl_gene_id
. Here we’ve just done this on the fly.dgeList$samples$group
to colour our samples.status = sigDE[,"SOX5"]
.glMDPlot(fit,
counts = dgeList$counts,
anno = dplyr::rename(dgeList$genes, GeneID = ensembl_gene_id),
groups = dgeList$samples$group,
status = sigDE[,"SOX5"]
)
ggbio
Some other types of plot which people find useful are Manhattan plots, for showing data along chromosomes, and circos plots. For these plot types, the package ggbio
can be used, and data needs to be defined using GenomicRanges
(or GRanges
) objects.
At the foundation of a GRanges
object is another object known as a Seqinfo
object, which contains the key summary information about each chromosome, or sequence that we are plotting. Let’s download this from the UCSC for hg38, as that’s the genome for some sample data we’ll use in a minute. (Please ignore any warning about NCBI seqlevel was set to NA)
hg38 <- fetchExtendedChromInfoFromUCSC("hg38")
If this download fails, load the file from your data
directory using the following code.
hg38 <- read_delim("data/chromInfo.txt", delim = "\t", col_names = FALSE) %>%
set_names(c("UCSC_seqlevel", "UCSC_seqlength", "ref")) %>%
dplyr::select(-ref) %>%
mutate(NCBI_seqlevel = if_else(UCSC_seqlevel %in% paste0("chr", 1:22),
gsub("chr", "", UCSC_seqlevel), NA_character_),
isCircular = FALSE) %>%
arrange(as.integer(NCBI_seqlevel))
Now let’s form a seqinfo
object for just the autosomes, but we’ll use the values 1:22
instead of crhr1:chr22
for our chromosome names.
autosomeSeq <- Seqinfo(seqnames = hg38$NCBI_seqlevel[1:22],
seqlengths = hg38$UCSC_seqlength[1:22],
isCircular = rep(FALSE, 22),
genome = "hg38")
autosomeSeq
For our sample SNP data, we’ll use some that should be installed as part of the biovizBase
package
snp <- read_delim("data/snps.txt", delim = "\t")
head(snp)
Here we have the location and identifier for each SNP, as well as some frequency information a \(\chi^2\) statistic and \(p\)-value. We’ll define each SNP as a range of nucleotides within the genome, except with a width of one. For genes and exons defined as GRanges
the width would correspond to the appropriate number of nucleotides, but for SNPs it’s only a single base. To form a GRanges object, it’s also best to arrange these by genomic position. After sorting we’ll also remove those for which we have no \(p\)-value.
snpGR <- snp %>%
arrange(CHR, BP) %>%
filter(!is.na(P)) %>%
mutate(Sig = P < 0.001) %>%
as.data.frame() %>%
transformDfToGr(seqnames = "CHR", start = "BP", width = 1)
seqinfo(snpGR) <- autosomeSeq
snpGR
In this last step we added the correct sequence lengths to the snpGR
object. Otherwise these values would have corresponded to the highest value in the BP
column for each chromosome.
Let’s check where out SNPs are using plotKaryogram()
from the package ggbio
. (Ignore any error about valid.GenomicRanges.seqinfo
…)
plotKaryogram(snpGR)
This is an extended ggplot2
object, so we can modify using our existing strategies.
plotKaryogram(snpGR) +
labs(x = "SNP Positions")
To make a Manhattan Plot we simply call the function plotGrandLinear()
and we can again transform the \(p\)-values using the -log10()
strategy on the fly.
plotGrandLinear(snpGR, aes(y = -log10(P)))
This plot also has the same basic features as one created using ggplot2
so we can remove the background with theme_bw()
and add a cutoff line to indicate significance
plotGrandLinear(snpGR, aes(y = -log10(P)), cutoff = 3) +
guides(colour = FALSE) +
theme_bw() +
theme(panel.grid.minor = element_blank()) +
labs(x = "Chromosome")
To build all the layers of a Circos plot, we’ll need to build multiple GRanges
objects. First we’ll start with a GRanges
objects that just defines the chromosomes.
chrGR <- GRanges(autosomeSeq)
Firstly we’ll just defined the layout
ggbio() +
layout_circle(chrGR, geom = "ideo", fill = "gray70", radius = 60, trackWidth = 10) +
layout_circle(chrGR, geom = "text", aes(label = seqnames), radius = 75)
Now we’ll setup the results from our SOX5 dataset.
resultsGR <- filter(results, chromosome_name %in% 1:22) %>%
mutate(strand = if_else(strand == 1, "+", "-"),
sigDE = abs(logFC) > 1 & adj.P.Val < 0.05) %>%
arrange(chromosome_name, start_position) %>%
makeGRangesFromDataFrame(keep.extra.columns = TRUE,
seqinfo = autosomeSeq,
start.field = "start_position",
end.field = "end_position")
And we could add a layer with the significant SNPs.
ggbio() +
layout_circle(snpGR, geom = "ideo", fill = "gray70", radius = 60, trackWidth = 10) +
layout_circle(chrGR, geom = "text", aes(label = seqnames), radius = 70) +
layout_circle(resultsGR, geom = "line", aes(y = logFC, colour = sigDE),
radius = 40, trackWidth = 20) +
layout_circle(subset(snpGR, Sig), geom = "rect", radius = 30, trackWidth = 10)
And make some links between the significant SNPs
linked <- subset(snpGR, Sig)
values(linked)$link2 <- linked[c(3, 4, 1, 2),]
ggbio() +
layout_circle(snpGR, geom = "ideo", fill = "gray70", radius = 60, trackWidth = 10) +
layout_circle(chrGR, geom = "text", aes(label = seqnames), radius = 70) +
layout_circle(resultsGR, geom = "line", aes(y = logFC, colour = sigDE),
radius = 40, trackWidth = 20) +
layout_circle(subset(snpGR, Sig), geom = "rect", colour = "red",
radius = 30, trackWidth = 10) +
layout_circle(linked, geom = "link", linked.to = "link2",radius = 20)