22 July 2016
Four common classes of annotation
| Object type | contents |
|---|---|
| OrgDb | gene based information |
| BSgenome | genome sequence |
| TxDb | transcriptome ranges |
| OrganismDb | composite information |
BSgenome Objectslibrary("BSgenome")
available.genomes()[1:4]
BSgenome ObjectsBiostrings package ("BS")library("BSgenome")BiostringsFour important object classes:
BString - Can hold any character stringsDNAString - Restricted to DNA charactersRNAString - Restricted to RNA charactersAAString - Restricted to Amino Acid alphabetBString objectsb <- BString("I am a BString object")
b
length(b)
as.character(b)
BString objectssubseq(b, start = 5) subseq(b, start = -5, width=3) reverse(b) Views(b, start=c(1:5), end = 9)
DNAString objectsDNA_ALPHABET
DNA_BASES IUPAC_CODE_MAP[DNA_ALPHABET[1:15]]
DNAString objectsd <- DNAString("TTGAAAA-CTC-N")
d
length(d)
d[3:6]
DNAString objectsreverseComplement(d) reverse(d) alphabetFrequency(d) alphabetFrequency(d, baseOnly=TRUE) Views(d, start = 3:0, end = 5) Views(d, start = 1:5, width=6) subseq(d, start = -4, width = 3)
RNAString objectsVery similar to DNAString objects, but with a different alphabet
RNA_ALPHABET
RNA_BASES
RNAString objectsRNA_GENETIC_CODE
r <- RNAString(d)
r
r == d
alphabetFrequency(r, baseOnly = TRUE)
letterFrequency(r, c("A", "U"))
AAString objectsRestricted to the Amino Acid alphabet
Functions specific to DNAString/RNAString objects won't work
AA_ALPHABET
AA_STANDARD
AA_PROTEINOGENIC
a <- AAString("MARKSLEMSIR*")
seqtype(a)
seqtype(d)
XStringSet ObjectsAll of the above strings can be extended to: - BStringSet, DNAStringSet, RNAStringSet, AAStringSet
x <- c("CTC-NACCAGTAT", "TTGA", "TACCTAGAG")
width(x)
dss <- DNAStringSet(x)
The function width() is imported from the dependency BiocGenerics
DNAStringSet Objectsnchar(dss)
width(dss)
names(dss)
names(dss) <- paste0("String", 1:3)
dss
subseq(dss, start=-4)
dss[1]
dss[[1]]
BSgenome objectsBiocInstaller::biocLite("BSgenome.Hsapiens.UCSC.hg19")
library("BSgenome.Hsapiens.UCSC.hg19")
ls(2)
This package is > 600Mb
After loading, we can just call Hsapiens as the BSgenome object
BSgenomeseqNms <- seqnames(Hsapiens) seqNms[1:5] seqLens <- seqlengths(Hsapiens) seqLens[1:5]
BSgenomeHsapiens length(Hsapiens) getSeq(Hsapiens, seqNms[1:2]) Hsapiens[["chr1"]] Hsapiens$chr1
NB: Apart from the [[ method, you always need to call sequences by name
BSgenomeContain XString objects:
- extensive suite of functions for searching & pattern matching
pattern1 <- DNAString("ACGGACCTAAT")
matchPattern(pattern1, Hsapiens$chr1)
vmatchPattern(pattern1,
getSeq(Hsapiens, seqNms[1:5]))
vcountPattern(pattern1,
getSeq(Hsapiens, seqNms[1:5]),
max.mismatch = 1)
findPalindromes(Hsapiens$chr1, min.armlength = 50)
BSgenomeThese objects are also well setup to interact with TxDb objects