22 July 2016
biomaRtThe package biomaRt is based on the web interface at http://www.ensembl.org/biomart/martview
library(biomaRt) allMarts <- listMarts()
These are the possible data sources (i.e. marts) we can access
biomaRtEach mart has multiple datasets
mart <- useMart("ENSEMBL_MART_ENSEMBL")
ensDatasets <- listDatasets(mart)
library(dplyr)
filter(ensDatasets, grepl("sapiens", dataset))
biomaRtWe can just go straight there by selecting the dataset within useMart()
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl")
NB: This is exactly the same procedure as the windows on the web GUI
biomaRtNow the mart & dataset have been selected
getBM()?getBM
This will give the requested data directly into a data.frame
biomaRt The two main pieces of data
attributes are the values we are looking forfilter along with values are our search queriesTo find what attributes can be downloaded from our mart
martAttributes <- listAttributes(mart)
These are possible pieces of information we can return (dim(martAttributes))
biomaRtmartFilters <- listFilters(mart)
biomaRt genes <- getBM(attributes=c("hgnc_symbol", "entrezgene"),
filters = "chromosome_name",
values = "1", mart = mart)
head(genes)
biomaRt ids <- c("ENSG00000134460", "ENSG00000163599")
attr <- c("ensembl_gene_id", "ensembl_transcript_id")
test <- getBM(attributes = attr,
filters = "ensembl_gene_id",
values = ids,
mart = mart)
Repeat the above without asking for the gene_id back
How could we also get the chromosome, strand, start & end positions in the above query
biomaRtWe could condense each set of transcripts into a CharacterList and make a DataFrame…
We can set multiple filters:
biomaRt and dplyrHere's a problem
?select
We now have more than one function called select
How will R know which one to use"
A mart is an S4 object, a data.frame is an S3 object
biomaRt and dplyrThis is a well known problem
The specific version of a function can be called by using the package name
namespacedplyr::select() or biomaRt::select()This session relies heavily on material from
Annotation Resources
Authors: Marc RJ Carlson, Herve Pages, Sonali Arora, Valerie Obenchain, and Martin Morgan
Presented at BioC2015 July 20-22, Seattle, WA
Four common classes of annotation
| Object type | contents |
|---|---|
| OrgDb | gene based information |
| BSgenome | genome sequence |
| TxDb | transcriptome ranges |
| OrganismDb | composite information |
library(AnnotationHub) ah <- AnnotationHub()
SQL-type methodsGet a summary:
ah
This is another S4 object
$dataprovider, $species & $rdataclassadditional mcols() can also be accessed with the $We can find the data providers
unique(ah$dataprovider)
Or the different data classes in the hub
unique(ah$rdataclass)
We can find the species with annotations
sp <- unique(ah$species) head(sp) length(sp)
We can query for matches to any term, e.g. to look for rabbit (Oryctolagus cuniculus) annotation sources
query(ah, "Oryctolagus")
We can create smaller AnnotationHub objects, which we could then search again
We can subset easily
subset(ah, rdataclass=="GRanges")
Or if we know we want the GRanges annotations for the rabbit
subset(query(ah, "Oryctolagus"), rdataclass=="GRanges")
Or we can combine multiple search queries
Fetch the rabbit annotations, which are GRanges objects derived from Ensembl
query(ah,
pattern=c("Oryctolagus", "GRanges", "Ensembl"))
We can find the metadata for the whole object, or any subset we've created
meta <- mcols(ah) meta
There's even a GUI
display(ah)
AnnotationHub() to R by selecting a row & clicking the buttonOnce we have the specific annotation we're interested in: - subset using the name & the double bracket method - this loads the AnnotationData object into your workspace
gr <- ah[["AH51056"]] gr
Organsim DB objects