Introduction

This session follow directly on from the previous session, picking up with GRanges objects and digging a little deeper. Please work in the same folder as Wednesday.

Last time we used an EnsDb object to import GRanges objects directly which corresponded to genes, transcripts, promoters etc. Let’s look at them a little more deeply, starting by building one, seeing how they interact with data.frame objects, then moving into the package plyranges.

If time permits, we’ll even use them for visualisations.

Genomic Ranges

The basics

Let’s start by loading the package GenomicRanges as well as the tidyverse. (You might notice we have a lot of conflicts now, so we’ll probably end up using namespaces a fair bit)

library(GenomicRanges)
library(magrittr)
library(tidyverse)

To form a GRanges object, we can actually use really simple coding strategies. As per usual, for a lot of the code, we won’t save objects into our R Environment, but will just have a look at a how a few things work.

GRanges("1:2-3:+")

Here we’ve created a simple object (without actually saving it) by passing a character vector of length 1. The function GRanges is smart enough to understand that we usually follow the chromosome with a :, separate our start & end points with a -, and assign strandedness following the range with another :.

Knowing that we can handle a character vector, we can create multiple ranges at the same time.

c(a = "1:2-3:+", b = "2:4-6:-") %>%
    GRanges()

Here we’ve passed a named vector and these are passed to the GRanges object.

c(a = "1:2-3:+", b = "2:4-6:-") %>%
    GRanges() %>%
    names()

Unlike a data.frame where the names are the column names, for a GRanges object the ranges are our features of interest, so these appear a little like rownames, but they are clearly not.

c(a = "1:2-3:+", b = "2:4-6:-") %>%
    GRanges() %>%
    rownames()

This makes intuitive sense because for a data.frame we’re often thinking about a statistical model so will have predictor or response variables in each column, whilst this is a very different situation here.

Seqinfo objects

If we don’t provide a Seqinfo object, `GRanges automatically makes one for us based on our provided data.

c(a = "1:2-3:+", b = "2:4-6:-") %>%
    GRanges() %>%
    seqinfo()

Clearly this strategy can be useful for demonstrations, or coding on the fly, but can we add a Seqinfo object? Before we try it, let’s make one. You might also notice that I’ve just used Seqinfo with an upper case S. This is common practice in S4 programming, where formal S4 classes begin with an upper-case letter. In general, functions and S3 objects begin with lower case letters.

mySeqInfo <- Seqinfo(
    seqnames = as.character(1:2),
    seqlengths = c(100, 50),
    genome = "myToy.v1"
)

The only essential component of a Seqinfo object is the sequence (or chromosome) names. This must be a character vector. In the above, we’ve also set lengths just for fun, as well as giving our genome build an informative name.

To extract information from these objects, the two most useful functions are seqlevels() and seqlengths. If you really care, genome() can be helpful too.

seqlevels(mySeqInfo)
seqlengths(mySeqInfo)

We can now pass this object to the function GRanges() as the argument seqinfo.

c(a = "1:2-3:+", b = "2:4-6:-") %>%
    GRanges(seqinfo = mySeqInfo) %>%
    seqinfo()

When we’re constructing our own ranges, forming a Seqinfo object and explicitly specifying this is very wise. This way we can ensure that every object we create will be compatible with the others.

Let’s save this now that we have it in good shape, with a well structured Seqinfo object as the foundation.

gr <- c(a = "1:2-3:+", b = "2:4-6:-") %>%
    GRanges(seqinfo = mySeqInfo)

Accessing other elements

If we wish to get our chromosome names we call them using the function seqnames().

seqnames(gr)

This returns a factor-Rle which is something we’ve not seen before. Rle stands for Run-Length Encoded and is a very efficient way of storing long, repetitive vectors, and this one contains a factor.

To demonstrate an Rle, if we had 10 repeated values, followed by another 12 repeated values, we could save this as a vector with 4 numbers instead of a vector with 22 numbers.

Here’s a normal character vector

rep(c("a", "b"), times = c(10, 12))

Here’s the same thing as an Rle.

rep(c("a", "b"), times = c(10, 12)) %>%
    Rle()

Hopefully this makes it clear that we have the values "a" and "b" for 10 and 12 repeats respectively. Now we’re only using 4 numbers. This is very useful for genomic information where we may have hundreds or thousands of values just containing the same chromosome names over & over. For example:

rep("chr1", 10000) %>% Rle()

This time, we used two values to represent 10,000 values. This is very memory efficient for genomic data.

The other thing you may have noticed is that our original seqnames(gr) returned a factor-Rle. This means we can return our actual chromosome (or sequence) names using seqlevels().

seqlevels(gr)

IRanges and subsetting

While we’re looking at the objects in detail, you’ll notice that the GRanges class builds on the IRanges (Integer Ranges) class, and this is how the ranges are actually stored.

ranges(gr)

Here you can see we have a start, end and width which are just integers. There’s no genome information here. We can get these individual vectors out using the functions start(), end() and width().

start(gr)
end(gr)
width(gr)

You might also notice that the width is inclusive of the start and end points, which again makes since. In our object we have ranges that encompass 2nt or 3nt.

As a final step, we can subset our ‘larger’ object by using the [] approach.

gr[1]
gr["a"]

Building GRanges objects from a data.frame

Often we start with a data.frame or other structure which is easy to coerce into a data.frame. This also gives us a chance to incorporate metadata columns, which are completely flexible in their format.

Let’s make a tibble which we’ll coerce into a GRanges object. The columns we’ll specify are the sequence name, the start position, the width of the range, as well as two random columns called score and GC. These represent possible test statistics and the GC content. After we’ve made the tibble, we’ll set rownames as these will be passed the GRanges object, so we will need to take off the pretty wrapping paper of a tibble and just coerce to a boring old data.frame.

set.seed(101) # Manually set the random seed
df <- tibble(
    seqnames = rep(c("1", "2"), each = 5),
    start = seq(2, 10, by = 2) %>% rep(times = 2),
    end = start + 0:9,
    strand = sample(c("+", "-"), 10, TRUE),
    score = rnorm(10),
    GC = runif(10)
) %>%
    as.data.frame() %>%
    set_rownames(letters[seq_len(nrow(.))])
df

To create a GRanges object from this we just pass it to makeGRangesFromDataFrame(). Because we’ve used sensible column names we don’t need to tell the function which column contains what, but sometimes, you do need to do that.

gr2 <- df %>%
    makeGRangesFromDataFrame(
        keep.extra.columns = TRUE,
        seqinfo = mySeqInfo
    )
gr2

Notice that this automatically coerced our seqnames and strand to an Rle. As well as that, we also have metadata (GC-content and our statistical results), which we can extract simply using the mcols() function.

mcols(gr2)

Alternatively we can just extract the ranges, the GRanges, or any other information.

ranges(gr2)
granges(gr2)
strand(gr2)
start(gr2)
names(gr2)
gr2[1:3, "GC"]
length(gr2)

More advanced tricks

We can merge overlapping ranges using reduce. Unfortunately, this function also exists in one of the tidyverse packages, so we’ll have to call using the namespace.

GenomicRanges::reduce(gr2)

Did that give you the output you expected? If not, can you figure out why not?

We can also split these into a GRangesList, which is a formal class built on a list, but that must contain GRanges objects in every position.

gr2 %>%
    split(f = seqnames(.))

This is a vary handy structure as most of the functions designed for a GRanges object can be directly applied to this list. Given that this is often how we obtain exon/transcript structures, these are very common objects.

Try and describe in your RMarkdown what the following functions do

shift(gr2, 10)
flank(gr2, 1)
gaps(gr2)

This one might take some very careful thought to understand

resize(gr2, width = 5, fix = 'start')
resize(gr2, width = 3, fix = "center")

plyranges

A (mainly) Australian package which was release 2-3 years ago is called plyranges. Stuart’s philosophy when writing this package was to bring the tidyverse philosophy into the S4 and GenomicRanges world. He reasoned that with dplyr, you always provide a tibble as input and return a tibble as output. This should be viable for GenomicRanges as well, and this enables the magrittr to be used.

For most of the functions in the package, he uses an underscore to separate words within function names, as we’ve seen with the tidyverse.

library(plyranges)

The first set of function in this package are the shift_ functions, where we can move our ranges. Can you spot the difference in these two functions?

gr2 %>% shift_left(1)
gr2 %>% shift_downstream(1)

In keeping with dplyr, we can now use the filter operation, which operates on values in the mcols slot.

gr2 %>% filter(GC > 0.5)

If we wish to compare ranges, we can also use another set of GenomicRanges to subset the first.

gr2 %>% filter_by_overlaps(gr)
gr2 %>% filter_by_non_overlaps(gr)

And we can group ranges, just like we group a tibble.

gr2 %>%
    group_by(seqnames) %>%
    filter(score == max(score))
gr2  %>%
    group_by(seqnames) %>%
    summarise(GC = mean(GC))

(Notice that in this last one, we didn’t return a GRanges. This is one of the few exceptions in the package)

We can also use join functions which behave similarly to those in dplyr. Again, try to explain what each of these does in your markdown.

gr %>% join_nearest(gr2)
gr %>% join_nearest_downstream(gr2)
gr %>% join_nearest_upstream(gr2)
gr %>% join_overlap_intersect(gr2)
gr %>% join_overlap_inner_directed(gr2)
gr %>% join_overlap_inner_within(gr2)

This is a package under active development so many useful features are still being added.

Visualisation of Gene Models

opts_chunk$set(
    eval = FALSE
)

Now that we’re comfortable with GRanges objects, these are very helpful for visualisation within R, using the package GViz. We’ll shift to a different dataset from here, and if you find any package, or dataset is missing from your VM, try using BiocManager::install() to install the required package(s). The first of these is BSgenome.Hsapiens.UCSC.hg19, so please (just using the Console) enter BiocManager::install("BSgenome.Hsapiens.UCSC.hg19"). This will take a minute or two, as we’re installing all of the sequence information for the human genome. When asked to update packages, please choose n in the interests of time.

library(Gviz)
library(BSgenome.Hsapiens.UCSC.hg19)

For this section, we’re going to start by plotting some CpG Islands, which are GC rich regions which are often studied for their methylation status. These come with GViz as an object called cpgIslands

data("cpgIslands")
cpgIslands

This is a GRanges object which we can now use for plotting. GViz works as a series of layers, but in a very different fashion to ggplot2, so we’ll using a different set of strategies here. This time, the layers effectively refer to different types of genomic features which appear in a fashion not dissimilar to facets under ggplot

To use this approach, we first create each layer as a track, and for this one, we’ll use the GRanges object to create an AnnotationTrack(). From there, we can simply call the workhorse function plotTracks(). (If you see an error, it’s harmless but I’m not sure what it means.)

cpg <- AnnotationTrack(cpgIslands, name="CpG")
plotTracks(cpg)

A common trick when visualising genomic data is to include what’s known as an ideogram. This is a “cartoon” representation of one chromosome, and often serves as a visual anchor for people. Let’s define a few genomic values for our plot, which we’ll make using chromosome 7 from the human genome build hg19, then we’ll make our IdeogramTrack. (This may take a moment or two to setup)

gen <- "hg19"
chr <- "chr7"
ideo <- IdeogramTrack(genome = gen, chromosome = chr)
plotTracks(list(ideo, cpg))

Notice that as we’re now using more than one track, we pass this to plotTracks() as a list. The ranges of the object cpg are used to show the genomic location on the ideogram.

One more annotation which can be informative is a GenomeAxisTrack(), which gives us the actual positions within the genomic region being displayed.

ax <- GenomeAxisTrack()
plotTracks(list(ideo, ax, cpg))

As we can see, the tracks are plotted in order, so let’s formally create a list which we can add tracks to as we go.

trackList <- list(
    ideogram = ideo,
    axis = ax,
    CpG_Islands = cpg
)
plotTracks(trackList)

The data we’re actually really interested in plotting today is contained in an object called geneModels. This is (inexplicably) provided as a data.frame, but these are clearly derived from a GRanges object. Let’s convert it straight away.

data("geneModels")
gm <- geneModels %>% 
    makeGRangesFromDataFrame(keep.extra.columns = TRUE)
genome(gm) <- gen

Now we can set this up as an additional track

trackList$genes <- GeneRegionTrack(
    range = gm, 
    chromosome = chr, 
    name ="Gene Model"
)
plotTracks(trackList)

We’re also able to show sequence data, but this is only useful when zoomed right in.

seqtrack <- SequenceTrack(Hsapiens, chromosome = chr)
plotTracks(
    c(trackList, seqtrack), 
    from=26591822, to=26591852, 
    cex=0.8
    )
gm

Modifying Parameters

So far, we’ve just used the default plotting parameters, but there are many modifications we can make. An obvious one might be to show the gene names. We know these are in the mcols slot in the GRanges objects, in the column symbol, so let’s add these

trackList$genes <- GeneRegionTrack(
    range = gm, 
    chromosome = chr, 
    name = "Gene Model",
    transcriptAnnotation = "symbol", 
    background.title = "blue"
)
trackList$CpG_Islands <- AnnotationTrack(
    range = cpgIslands, 
    name="CpG",
    background.title = "blue"
    )
plotTracks(trackList)

This is really as much as we’ll have time for, but other tracks incorporating numeric variables, such as read counts can be added as well.