303: Sequence Data

22 July 2016

BSgenome Objects

AnnotationData

Four common classes of annotation

Object type	contents
OrgDb	gene based information
BSgenome	genome sequence
TxDb	transcriptome ranges
OrganismDb	composite information

`BSgenome` Objects

library("BSgenome")
available.genomes()[1:4]

These objects hold genome sequence information
Currently 84 genomes available
Mainly UCSC & NCBI derived

`BSgenome` Objects

Mainly human & model organisms
Built around the Biostrings package ("BS")
Special methods & classes for long sequences
Loaded by default when we called library("BSgenome")

`Biostrings`

Four important object classes:

BString - Can hold any character strings
DNAString - Restricted to DNA characters
RNAString - Restricted to RNA characters
AAString - Restricted to Amino Acid alphabet

`BString` objects

b <- BString("I am a BString object")
b
length(b)
as.character(b)

`BString` objects

Try these commands:

subseq(b, start = 5)
subseq(b, start = -5, width=3)
reverse(b)
Views(b, start=c(1:5), end = 9)

`DNAString` objects

Restricted to DNA characters

DNA_ALPHABET

DNA_BASES
IUPAC_CODE_MAP[DNA_ALPHABET[1:15]]

`DNAString` objects

d <- DNAString("TTGAAAA-CTC-N")
d
length(d)
d[3:6]

`DNAString` objects

Try these commands:

reverseComplement(d)
reverse(d)
alphabetFrequency(d)
alphabetFrequency(d, baseOnly=TRUE)
Views(d, start = 3:0, end = 5)
Views(d, start = 1:5, width=6)
subseq(d, start = -4, width = 3)

`RNAString` objects

Very similar to DNAString objects, but with a different alphabet

RNA_ALPHABET

RNA_BASES

`RNAString` objects

Try these commands

RNA_GENETIC_CODE
r <- RNAString(d)
r
r == d
alphabetFrequency(r, baseOnly = TRUE)
letterFrequency(r, c("A", "U"))

`AAString` objects

Restricted to the Amino Acid alphabet

Functions specific to DNAString/RNAString objects won't work

Try these commands

AA_ALPHABET      
AA_STANDARD       
AA_PROTEINOGENIC
a <- AAString("MARKSLEMSIR*")
seqtype(a)
seqtype(d)

`XStringSet` Objects

All of the above strings can be extended to: - BStringSet, DNAStringSet, RNAStringSet, AAStringSet

x <- c("CTC-NACCAGTAT", "TTGA", "TACCTAGAG")
width(x)
dss <- DNAStringSet(x)

The function width() is imported from the dependency BiocGenerics

`DNAStringSet` Objects

Try these commands

nchar(dss)
width(dss)
names(dss)
names(dss) <- paste0("String", 1:3)
dss
subseq(dss, start=-4)
dss[1]
dss[[1]]

Back to the `BSgenome` objects

BiocInstaller::biocLite("BSgenome.Hsapiens.UCSC.hg19")
library("BSgenome.Hsapiens.UCSC.hg19")
ls(2)

This package is > 600Mb

After loading, we can just call Hsapiens as the BSgenome object

`BSgenome`

seqNms <- seqnames(Hsapiens)
seqNms[1:5]
seqLens <- seqlengths(Hsapiens)
seqLens[1:5]

`BSgenome`

Some commands to try

Hsapiens
length(Hsapiens)
getSeq(Hsapiens, seqNms[1:2])
Hsapiens[["chr1"]]
Hsapiens$chr1

NB: Apart from the [[ method, you always need to call sequences by name

`BSgenome`

Contain XString objects:
- extensive suite of functions for searching & pattern matching

Some more interesting commands

pattern1 <- DNAString("ACGGACCTAAT")
matchPattern(pattern1, Hsapiens$chr1)
vmatchPattern(pattern1, 
              getSeq(Hsapiens, seqNms[1:5]))
vcountPattern(pattern1, 
              getSeq(Hsapiens, seqNms[1:5]),
              max.mismatch = 1) 
findPalindromes(Hsapiens$chr1, min.armlength = 50)

`BSgenome`

These objects are also well setup to interact with TxDb objects

BSgenome Objects

AnnotationData

BSgenome Objects

BSgenome Objects

Biostrings

BString objects

BString objects

Try these commands:

DNAString objects

DNAString objects

DNAString objects

Try these commands:

RNAString objects

RNAString objects

Try these commands

AAString objects

Try these commands

XStringSet Objects

DNAStringSet Objects

Try these commands

Back to the BSgenome objects

BSgenome

BSgenome

Some commands to try

BSgenome

Some more interesting commands

BSgenome

`BSgenome` Objects

`BSgenome` Objects

`Biostrings`

`BString` objects

`BString` objects

`DNAString` objects

`DNAString` objects

`DNAString` objects

`RNAString` objects

`RNAString` objects

`AAString` objects

`XStringSet` Objects

`DNAStringSet` Objects

Back to the `BSgenome` objects

`BSgenome`

`BSgenome`

`BSgenome`

`BSgenome`