Four common classes of annotation
| Object type | contents |
|---|---|
| OrgDb | gene based information |
| BSgenome | genome sequence |
| TxDb | transcriptome ranges |
| OrganismDb | composite information |
27 November 2016
Four common classes of annotation
| Object type | contents |
|---|---|
| OrgDb | gene based information |
| BSgenome | genome sequence |
| TxDb | transcriptome ranges |
| OrganismDb | composite information |
TxDb ObjectsThese are the objects with the transcriptome information
GRanges classesGenomicRanges & IRanges packageslibrary(BiocInstaller)
biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
This will load all the package dependencies as well
GRanges objectsLet's look at a GRanges object
Note that our txdb object used EntrezGene Ids
ids <- c(BRCA1="672", PTEN="5728") genes(txdb, filter=list(gene_id=ids))
Rle vectorsRun Length Encoding format vectors
Rle vectorsrle : Part of the base R package
Rle : S4Vectors version
Rle is used extensively in GenomicRanges
x <- c(1, 0, 0, 0, 1, 1, 2, 0, 0) Rle(x)
GRanges objectgr <- GRanges(seqnames=Rle(c("chr1", "chrMT"), c(2, 4)),
ranges=IRanges(15:20, 20),
strand=rep(c("+", "-", "*"), 2))
Print the object by typing gr
The essential components are:
seqnames & rangesstrand is omitted, the value * is addedGRanges objectseqnames(gr) strand(gr) ranges(gr) seqinfo(gr) length(gr) gr[1] width(gr) start(gr)
seqinfo() returns an object with a formal classSeqinfo objects contain metadata about each sequencenames(gr) <- paste0("Rng", LETTERS[1:length(gr)])
We can assign names to the ranges:
Now look at the object again
We can also add some key information about the sequences
seqlengths(gr) <- c(5e6, 1.5e5)
isCircular(gr) <- c(FALSE, TRUE)
genome(gr) <- c("madeUp.v1")
seqinfo(gr)
GRanges objects also have columns for metadata
Let's add:
mcols(gr) <- data.frame(score = 10^(-rexp(6)),
altChr = rep(c("G001", "G002"),
times=c(2, 4)))
GRanges objectsgr[1:3] gr[1:2, 1] subset(gr, score < 0.05) subset(gr, width==1) subset(gr, start > 18) subset(gr, start > 18 | width ==5) table(gr$altChr) summary(mcols(gr)[,"score"])
GRangesListGRanges objects can also be extended to GRangesList objects
exByGn <- exonsBy(txdb, "gene") length(exByGn)
exByGn
GRangesListAs well as the exonsBy() methods, other methods include
transcriptsBy(), cdsBy(), threeUTRsByTranscript() + moreIn the current example exons are listed by gene, but can also be listed by exon, cds or tx
GRangesListThese behave like normal list objects in R
exByGn[[1]]
exByGn$`1`
exByGn[1:2]
sapply(exByGn[1:10],
function(x){length(subset(x, width<100))})
unlist(exByGn[1:5])
Ask if you're unsure about what any of the above commands do
GenomicFeaturesAlso contains other useful methods
promoters(txdb, upstream=100, downstream=50,
columns = c("tx_name", "gene_id"))
library(mirbase.db) microRNAs(txdb)[1:3]
In our object exByGn, we have a list of genes and their ranges
EnsemblIDs and the biotypebiomaRt connectionlibrary(biomaRt)
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl")
Now we collect the genes and define the attributes we want
entrezIDs <- names(exByGn)
attr <- c("entrezgene", "ensembl_gene_id", "external_gene_name", "gene_biotype")
results <- getBM(attributes = attr,
filters = "entrezgene",
values = entrezIDs,
mart= mart)
How did we go?
summary(entrezIDs %in% results$entrezgene) table(table(results$entrezgene))
Let's just keep the first mapping
library(dplyr) results <- results %>% distinct(entrezgene, .keep_all = TRUE) %>% mutate(entrezgene = as.character(entrezgene))
We also might need to convert those IDs back to characters
Now we can just use the function left_join() from dplyr
merged <- data_frame(entrezgene = entrezIDs) %>% left_join(results)