302: Annotations

22 July 2016

Getting Annotation Information

Annotation

Make up a significant proportion of Bioconductor Packages
Often seen as the end point of analysis
For networks/pathways it's the starting point

`biomaRt`

The package biomaRt is based on the web interface at http://www.ensembl.org/biomart/martview

library(biomaRt)
allMarts <- listMarts()

These are the possible data sources (i.e. marts) we can access

`biomaRt`

Each mart has multiple datasets

mart <- useMart("ENSEMBL_MART_ENSEMBL")
ensDatasets <- listDatasets(mart)
library(dplyr)
filter(ensDatasets, grepl("sapiens", dataset))

`biomaRt`

We can just go straight there by selecting the dataset within useMart()

mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
                dataset = "hsapiens_gene_ensembl")

NB: This is exactly the same procedure as the windows on the web GUI

`biomaRt`

Now the mart & dataset have been selected

The main query function is getBM()

?getBM

This will give the requested data directly into a data.frame

`biomaRt`

Attributes and Filters

The two main pieces of data

attributes are the values we are looking for
filter along with values are our search queries

To find what attributes can be downloaded from our mart

martAttributes <- listAttributes(mart)

These are possible pieces of information we can return (dim(martAttributes))

`biomaRt`

Some attributes may contain large amounts of data
We can use filters to restrict the information
e.g. we may have only a few genes of interest

martFilters <- listFilters(mart)

`biomaRt`

Example 1

Let's get all the gene names on Chromosome 1
NB: We need to specify the filter, and give the filter values separately
We need to specify the mart argument every time

genes <- getBM(attributes=c("hgnc_symbol", "entrezgene"), 
               filters = "chromosome_name", 
               values = "1", mart = mart)
head(genes)

`biomaRt`

Example 2

ids <- c("ENSG00000134460", "ENSG00000163599")
attr <-  c("ensembl_gene_id", "ensembl_transcript_id")
test <- getBM(attributes = attr,
              filters = "ensembl_gene_id",
              values = ids,
              mart = mart)

Repeat the above without asking for the gene_id back

How could we also get the chromosome, strand, start & end positions in the above query

`biomaRt`

We could condense each set of transcripts into a CharacterList and make a DataFrame…

We can set multiple filters:

The values must be supplied as a list (read the help page)

`biomaRt` and `dplyr`

Here's a problem

?select

We now have more than one function called select

How will R know which one to use"

A mart is an S4 object, a data.frame is an S3 object

`biomaRt` and `dplyr`

This is a well known problem

The specific version of a function can be called by using the package name

Known as the namespace
dplyr::select() or biomaRt::select()

Annotation Hub

AnnotationData

This session relies heavily on material from

Annotation Resources

Authors: Marc RJ Carlson, Herve Pages, Sonali Arora, Valerie Obenchain, and Martin Morgan

Presented at BioC2015 July 20-22, Seattle, WA

https://github.com/mrjc42/BiocAnnotRes2015

AnnotationData

Four common classes of annotation

Object type	contents
OrgDb	gene based information
BSgenome	genome sequence
TxDb	transcriptome ranges
OrganismDb	composite information

AnnotationHub

library(AnnotationHub)
ah <- AnnotationHub()

This is a relatively new & sensibly named package
We can access & find numerous annotation types
Uses SQL-type methods
Creating this object will create a cache with the latest metadata from each data source

Annotation Hub

Get a summary:

ah

This is another S4 object

3 important components: $dataprovider, $species & $rdataclass
additional components listed under additional mcols() can also be accessed with the $

Annotation Hub

We can find the data providers

unique(ah$dataprovider)

Or the different data classes in the hub

unique(ah$rdataclass)

Annotation Hub

We can find the species with annotations

sp <- unique(ah$species)
head(sp)
length(sp)

Annotation Hub

We can query for matches to any term, e.g. to look for rabbit (Oryctolagus cuniculus) annotation sources

query(ah, "Oryctolagus")

We can create smaller AnnotationHub objects, which we could then search again

Annotation Hub

We can subset easily

subset(ah, rdataclass=="GRanges")

Or if we know we want the GRanges annotations for the rabbit

subset(query(ah, "Oryctolagus"), rdataclass=="GRanges")

Annotation Hub

Or we can combine multiple search queries

Fetch the rabbit annotations, which are GRanges objects derived from Ensembl

query(ah, 
      pattern=c("Oryctolagus", "GRanges", "Ensembl"))

Annotation Hub

We can find the metadata for the whole object, or any subset we've created

meta <- mcols(ah)
meta

Annotation Hub

There's even a GUI

display(ah)

You may need to resize the Viewer window
Return an AnnotationHub() to R by selecting a row & clicking the button

Annotation Hub

Once we have the specific annotation we're interested in: - subset using the name & the double bracket method - this loads the AnnotationData object into your workspace

gr <- ah[["AH51056"]]
gr

orgDB Objects

Organsim DB objects

Follow more of an SQL type search pattern

Getting Annotation Information

Annotation

biomaRt

biomaRt

biomaRt

biomaRt

biomaRt

Attributes and Filters

biomaRt

biomaRt

Example 1

biomaRt

Example 2

biomaRt

biomaRt and dplyr

biomaRt and dplyr

Annotation Hub

AnnotationData

AnnotationData

AnnotationHub

Annotation Hub

Annotation Hub

Annotation Hub

Annotation Hub

Annotation Hub

Annotation Hub

Annotation Hub

Annotation Hub

Annotation Hub

orgDB Objects

`biomaRt`

`biomaRt`

`biomaRt`

`biomaRt`

`biomaRt`

`biomaRt`

`biomaRt`

`biomaRt`

`biomaRt`

`biomaRt` and `dplyr`

`biomaRt` and `dplyr`