TxDb Objects

27 November 2016

AnnotationData

Four common classes of annotation

Object type	contents
OrgDb	gene based information
BSgenome	genome sequence
TxDb	transcriptome ranges
OrganismDb	composite information

`TxDb` Objects

These are the objects with the transcriptome information

Saved using GRanges classes
Derived heavily from the GenomicRanges & IRanges packages
The key idea is to refer the the genome using ranges to define locations

Workspace Setup

library(BiocInstaller)
biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

txdb

This will load all the package dependencies as well

`GRanges` objects

Let's look at a GRanges object

Note that our txdb object used EntrezGene Ids

ids <- c(BRCA1="672", PTEN="5728")
genes(txdb, filter=list(gene_id=ids))

`Rle` vectors

Run Length Encoding format vectors

More memory efficient way to store positional information
highly efficient for long regions of "no information", or
also efficient for data with long stretches of repeats

`Rle` vectors

rle : Part of the base R package

Rle : S4Vectors version

Rle is used extensively in GenomicRanges

x <- c(1, 0, 0, 0, 1, 1, 2, 0, 0)
Rle(x)

Creating a `GRanges` object

gr <- GRanges(seqnames=Rle(c("chr1", "chrMT"), c(2, 4)),
              ranges=IRanges(15:20, 20),
              strand=rep(c("+", "-", "*"), 2))

Print the object by typing gr

The essential components are:

seqnames & ranges
If strand is omitted, the value * is added

Working with a `GRanges` object

Try these commands:

seqnames(gr)
strand(gr)
ranges(gr)
seqinfo(gr)
length(gr)
gr[1]
width(gr)
start(gr)

seqinfo() returns an object with a formal class
Seqinfo objects contain metadata about each sequence

Adding more information

names(gr) <- paste0("Rng", LETTERS[1:length(gr)])

We can assign names to the ranges:

Could be exons, genes, SNPs, CDS or any other feature

Now look at the object again

Adding more information

We can also add some key information about the sequences

seqlengths(gr) <- c(5e6, 1.5e5)
isCircular(gr) <- c(FALSE, TRUE)
genome(gr) <- c("madeUp.v1")
seqinfo(gr)

Adding more information

GRanges objects also have columns for metadata

Let's add:

Some \(p\)-values from a hypothesis test
Alternative names for the Chromosomes

mcols(gr) <- data.frame(score = 10^(-rexp(6)),
                        altChr = rep(c("G001", "G002"), 
                                     times=c(2, 4)))

Subsetting `GRanges` objects

Try these commands:

gr[1:3]
gr[1:2, 1]
subset(gr, score < 0.05)
subset(gr, width==1)
subset(gr, start > 18)
subset(gr, start > 18 | width ==5)
table(gr$altChr)
summary(mcols(gr)[,"score"])

`GRangesList`

GRanges objects can also be extended to GRangesList objects

exByGn <- exonsBy(txdb, "gene")
length(exByGn)

exByGn

`GRangesList`

As well as the exonsBy() methods, other methods include

transcriptsBy(), cdsBy(), threeUTRsByTranscript() + more

In the current example exons are listed by gene, but can also be listed by exon, cds or tx

`GRangesList`

These behave like normal list objects in R

Try these commands

exByGn[[1]]
exByGn$`1`
exByGn[1:2]
sapply(exByGn[1:10], 
       function(x){length(subset(x, width<100))}) 
unlist(exByGn[1:5])

Ask if you're unsure about what any of the above commands do

`GenomicFeatures`

Also contains other useful methods

promoters(txdb, upstream=100, downstream=50,
          columns = c("tx_name", "gene_id"))

library(mirbase.db)
microRNAs(txdb)[1:3]

Combining Data Sources

Converting From EntrezGene to Ensembl

In our object exByGn, we have a list of genes and their ranges

Named using EntrezGene IDs
Let's get EnsemblIDs and the biotype
Setup a biomaRt connection

library(biomaRt)
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
                dataset = "hsapiens_gene_ensembl")

Converting From EntrezGene to Ensembl

Now we collect the genes and define the attributes we want

entrezIDs <- names(exByGn)
attr <- c("entrezgene", "ensembl_gene_id", "external_gene_name", "gene_biotype")
results <- getBM(attributes = attr,
                 filters = "entrezgene",
                 values = entrezIDs,
                 mart= mart)

Converting From EntrezGene to Ensembl

How did we go?

summary(entrezIDs %in% results$entrezgene)
table(table(results$entrezgene))

Converting From EntrezGene to Ensembl

Let's just keep the first mapping

library(dplyr)
results <- results %>%
  distinct(entrezgene, .keep_all = TRUE) %>%
  mutate(entrezgene = as.character(entrezgene))

We also might need to convert those IDs back to characters

Converting From EntrezGene to Ensembl

Now we can just use the function left_join() from dplyr

merged <- data_frame(entrezgene = entrezIDs) %>%
  left_join(results)

AnnotationData

TxDb Objects

Workspace Setup

GRanges objects

Rle vectors

Rle vectors

Creating a GRanges object

Working with a GRanges object

Try these commands:

Adding more information

Adding more information

Adding more information

Subsetting GRanges objects

Try these commands:

GRangesList

GRangesList

GRangesList

Try these commands

GenomicFeatures

Combining Data Sources

Converting From EntrezGene to Ensembl

Converting From EntrezGene to Ensembl

Converting From EntrezGene to Ensembl

Converting From EntrezGene to Ensembl

Converting From EntrezGene to Ensembl

`TxDb` Objects

`GRanges` objects

`Rle` vectors

`Rle` vectors

Creating a `GRanges` object

Working with a `GRanges` object

Subsetting `GRanges` objects

`GRangesList`

`GRangesList`

`GRangesList`

`GenomicFeatures`