knitr::opts_chunk$set(
echo = TRUE,
results = "hide",
fig.show = "hide",
message = FALSE,
warning = FALSE
)
Today we’ll look at identifying transcripts which are not included in the reference transcriptome. This is a short-read specific technique and will use the hisat2 -> stringtie
workflow.
Firstly, we’ll need to go through the usual process of setting up a new R Project, and please start today’s R Project in the ~/transcriptomics
directory and name the project/folder week12
Once you have this setup, go to the bash
Terminal in R Studio and enter the following code to download today’s data. This should take 1-2 min as it is about 2Gb. The first two lines setup the download, whilst the next two lines will extract the tarball and remove the original file.
cd ~/transcriptomics/week12
wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
tar -xzvf chrX_data.tar.gz
rm chrX_data.tar.gz
This gives us the following structure
tree chrX_data
genes/chrX.gtf
genome/chrX.fa
guevadis_phenodata.csv
hisat2
indexes which we’ll use for alignment are in indexes/
samples
As we’ll be primarily in bash
for today’s session, we’ll need to activate our transcriptomics
conda environment. We also need to install two key tools for today’s session: 1) The aligner hisat2
and 2) The tool stringtie
which we will use to identify new transcripts.
conda activate transcriptomics
conda install -c bioconda hisat2 stringtie gffcompare gffread
Choose y
when prompted
In previous sessions we’ve used the splice-aware aligner STAR which is one of the most popular RNA-Seq aligners. An alternative is the aligner hisat2
and the two are essentially similar. We’ve downloaded the indexed ‘genome’ (i.e. chrX) for today’s smaller dataset so we can just jump straight into the alignments.
First we’ll create a folder to write our alignments into:
mkdir chrX_data/bam
Now we can write a script to generate the alignments. In the interests of time, we’ll skip any checking steps, but please make sure you check your file paths and all capitalisation. Unfortunately, hisat2
outputs to SAM format only. Whilst we could write to stdout
and redirect to a BAM
for R1 in chrX_data/samples/*1.fastq.gz
do
echo -e "Found ${R1}"
R2=${R1%_1.fastq.gz}_2.fastq.gz
echo -e "The R2 file should be ${R2}"
## Create output filenames
ALNOUT=chrX_data/bam/$(basename ${R1%_1.fastq.gz}.sam)
echo -e "Alignments will be written to:\n${ALNOUT}"
# Align using hisat2
hisat2 \
-p 2 \
--dta \
-x chrX_data/indexes/chrX_tran \
-1 ${R1} \
-2 ${R2} \
-S ${ALNOUT}
# Sort the alignments and output as bam after sorting
echo -e "Sorting ${ALNOUT}"
samtools sort \
-@ 2 \
-o ${ALNOUT%.sam}.bam \
${ALNOUT}
# Index the alignments
echo -e "Indexing ${ALNOUT%.sam}.bam"
samtools index ${ALNOUT%.sam}.bam
# Remove the original SAM
rm ${ALNOUT}
done
An important thing about these alignments is that the option --dta
was set. This changes the way hisat2
creates alignments, and in particular requires a longer anchor sequence across spliced alignments. This will help reduce the number of potentially spurious alignments that may lead to the identification of poorly-supported transcripts.
Now that we have our alignments we can use them to generate a set of transcripts which are not represented in the reference. The reference as supplied with our data is given in genes/chrX.gtf
head chrX_data/genes/chrX.gtf
In this file we have 1) the chromosome, 2) reference, 3) feature type, 4) start/end/score/strand with the final few columns containing the gene id, transcript id and other key information. We’ll use this file as a reference for known transcripts as this can help identify transcripts which are expressed at low levels.
To create our ‘assembly’, let’s first create a directory
mkdir chrX_data/assembly
Now we can loop through our bam files to form a sample-specific assembly
for BAM in chrX_data/bam/*bam
do
echo -e "Creating assembly for ${BAM}"
LB=$(basename ${BAM%_chrX.bam})
echo -e "The output will be given the label ${LB}"
GTF=chrX_data/assembly/$(basename ${BAM%.bam}.gtf)
echo -e "The individual assembly will be written to\n${GTF}"
stringtie \
${BAM} \
-l ${LB} \
-p 2 \
-G chrX_data/genes/chrX.gtf \
-o ${GTF}
done
Now we have a new gtf file for each sample let’s quickly inspect these
head chrX_data/assembly/ERR188044_chrX.gtf
Notice that now we have StringTie
as the reference in the second column and once we check the tags in the final column we can see gene ids associated with this sample along with coverage and other measures of expression. Let’s check the next file and see how they line up
head chrX_data/assembly/ERR188104_chrX.gtf
This has found another set of transcripts, which are unique to this sample. However, we can see from the first one that it existed in the previous reference gtf.
The best practice is to merge transcripts across all samples, but let’s just quickly check out our first new transcripts in the sample ERR188044. We can use Gviz
and a few other tools to do these.
Most people would simply load the reference gtf and a stringtie gtf into IGV Browser, which is a standalone java-based browser for inspecting alignments, variants, gene models etc. It’s a bit too difficult to run from the VMs, so we’ll use R instead, which is a bit slower to setup, but enables far easier automation across multiple regions.
Start a new R Markdown:
---
title: "StringTie Visualisations"
author: "Some Brilliant Student"
date: "3^rd^ June, 2020"
output: html_document
editor_options:
chunk_output_type: console
---
library(tidyverse)
library(Gviz)
library(plyranges)
library(rtracklayer)
library(GenomicAlignments)
library(Rsamtools)
library(edgeR)
First let’s set up an R object for the bam files, known as a BamFileList
. This won’t load in any alignments, but sets up a connection to the directory on our disk. We can then use this to form a SeqInfo
object which can form the cornerstone of our other GenomicRanges
objects.
bfl <- list.files("chrX_data/bam/", pattern = "bam$", full.names = TRUE) %>%
BamFileList()
sq <- seqinfo(bfl)
Now we can load in our reference gtf. We’re only really interested in exons, so let’s restrict our import to just these relevant features.
refGR <- import.gff("chrX_data/genes/chrX.gtf", feature.type = "exon")
seqinfo(refGR) <- sq
refGR
For our comparison, we’ll just use the StringTie gtf from our first sample.
stGR <- import.gff("chrX_data/assembly/ERR188044_chrX.gtf", feature.type = "exon")
seqinfo(stGR) <- sq
Instead of loading the entire chromosome, let’s define a range to inspect.
filtRng <- GRanges("chrX:250000-304000")
seqinfo(filtRng) <- sq
Let’s have a look at what we have in our reference
refGR %>% find_overlaps(filtRng)
As you can see, this range contains the transcripts for the gene PLCXD1. Now we can check our StringTie output
stGR %>% find_overlaps(filtRng)
This output shows us that:
Let’s compare these using Gviz, however we know that for a GeneRegionTrack()
, Gviz
is a bit fussy about the columns. Let’s rearrange these for easier plotting.
mcols(refGR) <- mcols(refGR) %>%
as.data.frame() %>%
as_tibble() %>%
group_by(transcript_id) %>%
mutate(
exon = cumsum(type == "exon")
) %>%
ungroup() %>%
dplyr::select(
type,
gene = gene_id, exon , transcript = transcript_id,
symbol = gene_name
)
mcols(stGR) <- mcols(stGR) %>%
as.data.frame() %>%
as_tibble() %>%
mutate(
symbol = case_when(
is.na(ref_gene_name) ~ gene_id,
!is.na(ref_gene_name) ~ ref_gene_name
),
tx = case_when(
is.na(ref_gene_name) ~ transcript_id,
!is.na(ref_gene_name) ~ reference_id
)
) %>%
dplyr::select(
type,
gene = gene_id, exon = exon_number, transcript = tx,
symbol
)
Now we’ve tidied that up, let’s compare
gtrack <- GenomeAxisTrack(as(seqinfo(bfl), "GRanges"))
refTrack <- GeneRegionTrack(
range = refGR,
transcriptAnnotation = "transcript",
name = "Reference GTF"
)
stTrack <- GeneRegionTrack(
range = stGR,
transcriptAnnotation = "transcript",
name = "StringTie GTF"
)
plotTracks(
list(gtrack, refTrack, stTrack),
from = start(filtRng),
to = end(filtRng)
)
So it appears we potentially have a novel transcript unrelated to any gene, along with one supported RefSeq transcript of the second gene (PLCXD1) and two novel transcripts. Let’s add alignments to the plot to see who much supporting evidence we have.
atrack <- AlignmentsTrack(
range = bfl$ERR188044_chrX.bam$path,
chromosome = seqnames(sq)
)
plotTracks(
list(gtrack, atrack, stTrack),
from = start(filtRng),
to = end(filtRng)
)
Perhaps zooming in on our novel gene might be easier.
plotTracks(
list(gtrack, atrack, stTrack),
from = start(filtRng),
to = 267000
)
Now let’s try PLCXD1
plotTracks(
list(gtrack, atrack, stTrack),
from = 276000,
to = end(filtRng)
)
As you may see, some of these transcripts seem dubious, whilst others seem more reasonable. By default, for a single sample we only need one spliced read to generate a novel spliced-transcript. We can adjust this using the -c
argument. As you can see in the stringtie
manual there are multiple options for immediately throwing away spurious transcripts.
stringtie -h
The strategy we’re following today is to have an inclusive approach at the single sample level, but then set a higher bar to get over when we merge all of the individual GTF files.
Let’s now merge all of the individual assemblies and repeat our checking. The data we downloaded already contains a file called mergelist.txt
, however we have written our sample assemblies into the folder assembly
so we need to recreate this file. You can use bash
for this if you’re more comfortable, or you can use R
. Here’s the R
version, and you can just run this in the Console if you prefer.
list.files("chrX_data/assembly", pattern = "^ERR", full.names = TRUE) %>%
write_lines("chrX_data/mergelist.txt")
The final scripted bash
operation is now quite simple.
stringtie \
--merge \
-p 2 \
-G chrX_data/genes/chrX.gtf \
-o chrX_data/assembly/stringtie_merged.gtf \
chrX_data/mergelist.txt
head chrX_data/assembly/stringtie_merged.gtf
Let’s repeat the above process of comparing transcripts
stMergedGR <- import.gff("chrX_data/assembly/stringtie_merged.gtf", feature.type = "exon")
seqinfo(stMergedGR) <- sq
mcols(stMergedGR) <- mcols(stMergedGR) %>%
as.data.frame() %>%
as_tibble() %>%
mutate(
gene_name = case_when(
is.na(gene_name) ~ gene_id,
!is.na(gene_name) ~ gene_name
)
) %>%
dplyr::select(
type,
gene = gene_id, exon = exon_number, transcript = transcript_id,
symbol = gene_name
)
stMergedTrack <- GeneRegionTrack(
range = stMergedGR,
transcriptAnnotation = "transcript",
name = "StringTie Merged GTF"
)
plotTracks(
list(gtrack, refTrack, stMergedTrack),
from = start(filtRng),
to = end(filtRng)
)
Clearly we have even more transcripts, so how do we measure these across an entire dataset? Let’s use the bash
tool gffcompare
.
mkdir chrX_data/gffcompare
gffcompare \
-r chrX_data/genes/chrX.gtf \
-o chrX_data/gffcompare/merged \
chrX_data/assembly/stringtie_merged.gtf
The file chrX_data/gffcompare/merged.stats
is the one we want to inspect
cat chrX_data/gffcompare/merged.stats
## # gffcompare v0.11.2 | Command line was:
## #gffcompare -r chrX_data/genes/chrX.gtf -o chrX_data/gffcompare/merged chrX_data/assembly/stringtie_merged.gtf
## #
##
## #= Summary for dataset: chrX_data/assembly/stringtie_merged.gtf
## # Query mRNAs : 4464 in 1394 loci (4095 multi-exon transcripts)
## # (659 multi-transcript loci, ~3.2 transcripts per locus)
## # Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)
## # Super-loci w/ reference transcripts: 1065
## #-----------------| Sensitivity | Precision |
## Base level: 100.0 | 77.8 |
## Exon level: 98.3 | 74.9 |
## Intron level: 100.0 | 79.1 |
## Intron chain level: 100.0 | 45.3 |
## Transcript level: 99.2 | 46.7 |
## Locus level: 98.9 | 75.6 |
##
## Matching intron chains: 1856
## Matching transcripts: 2085
## Matching loci: 1074
##
## Missed exons: 0/8804 ( 0.0%)
## Novel exons: 1751/12160 ( 14.4%)
## Missed introns: 0/7946 ( 0.0%)
## Novel introns: 732/10046 ( 7.3%)
## Missed loci: 0/1086 ( 0.0%)
## Novel loci: 329/1394 ( 23.6%)
##
## Total union super-loci across all input datasets: 1394
## 4464 out of 4464 consensus transcripts written in chrX_data/gffcompare/merged.annotated.gtf (0 discarded as redundant)
The values for Sensitivity and Precision need a little explaining and a detailed description is available at the gffcompare homepage In short:
Let’s try a more stringent merge, where we require higher coverage to consider a transcript as a true novel transcript. The most simple filtering step might be to only consider transcripts with two or more supporting reads from a given sample
stringtie \
--merge \
-p 2 \
-c 2 \
-G chrX_data/genes/chrX.gtf \
-o chrX_data/assembly/stringtie_merged_cov2.gtf \
chrX_data/mergelist.txt
gffcompare \
-r chrX_data/genes/chrX.gtf \
-o chrX_data/gffcompare/merged_cov2 \
chrX_data/assembly/stringtie_merged_cov2.gtf
cat chrX_data/gffcompare/merged_cov2.stats
## # gffcompare v0.11.2 | Command line was:
## #gffcompare -r chrX_data/genes/chrX.gtf -o chrX_data/gffcompare/merged_cov2 chrX_data/assembly/stringtie_merged_cov2.gtf
## #
##
## #= Summary for dataset: chrX_data/assembly/stringtie_merged_cov2.gtf
## # Query mRNAs : 3925 in 1302 loci (3549 multi-exon transcripts)
## # (614 multi-transcript loci, ~3.0 transcripts per locus)
## # Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)
## # Super-loci w/ reference transcripts: 1071
## #-----------------| Sensitivity | Precision |
## Base level: 100.0 | 80.1 |
## Exon level: 98.4 | 79.9 |
## Intron level: 100.0 | 83.8 |
## Intron chain level: 100.0 | 52.3 |
## Transcript level: 99.2 | 53.1 |
## Locus level: 98.9 | 81.4 |
##
## Matching intron chains: 1856
## Matching transcripts: 2085
## Matching loci: 1074
##
## Missed exons: 0/8804 ( 0.0%)
## Novel exons: 1204/11319 ( 10.6%)
## Missed introns: 0/7946 ( 0.0%)
## Novel introns: 476/9485 ( 5.0%)
## Missed loci: 0/1086 ( 0.0%)
## Novel loci: 231/1302 ( 17.7%)
##
## Total union super-loci across all input datasets: 1302
## 3925 out of 3925 consensus transcripts written in chrX_data/gffcompare/merged_cov2.annotated.gtf (0 discarded as redundant)
As you can see, there has been a significant reduction in novel exons and introns with no loss of Sensitivity
stCov2GR <- import.gff("chrX_data/assembly/stringtie_merged_cov2.gtf", feature.type = "exon")
seqinfo(stCov2GR) <- sq
mcols(stCov2GR) <- mcols(stCov2GR) %>%
as.data.frame() %>%
as_tibble() %>%
mutate(
gene_name = case_when(
is.na(gene_name) ~ gene_id,
!is.na(gene_name) ~ gene_name
)
) %>%
dplyr::select(
type,
gene = gene_id, exon = exon_number, transcript = transcript_id,
symbol = gene_name
)
stCov2Track <- GeneRegionTrack(
range = stCov2GR,
transcriptAnnotation = "transcript",
name = "StringTie Coverage2 GTF"
)
plotTracks(
list(gtrack, refTrack, stCov2Track, stMergedTrack),
from = start(filtRng),
to = end(filtRng)
)
Be aware, strange things can happen.
filtRng <- subset(stCov2GR, gene == "MSTRG.11") %>%
range()
stCov2GR %>%
subset(gene == "MSTRG.11") %>%
mcols() %>%
as.data.frame()
plotTracks(
list(gtrack, refTrack, stCov2Track),
from = start(filtRng) - 5e3,
to = end(filtRng)
)
A key point to be aware of is that mostly, we’ve identified novel transcripts within pre-defined genes. If we quantify at the gene-level we may not see much change for these genes.
If we’re interested in transcript-level expression, we can create new sequences for each transcript and export a complete transcriptome.
gffread \
-w chrX_data/genome/transcripts.fa \
-g chrX_data/genome/chrX.fa \
chrX_data/assembly/stringtie_merged_cov2.gtf
From here, we could create an index for kallisto or salmon and use these tools to estimate transcript-level expression.
kallisto index -i chrX_data/genome/transcripts.idx chrX_data/genome/transcripts.fa
mkdir chrX_data/kallisto
for R1 in chrX_data/samples/*1.fastq.gz
do
echo -e "Found ${R1}"
R2=${R1%_1.fastq.gz}_2.fastq.gz
echo -e "The R2 file should be ${R2}"
SAMP=$(basename ${R1%_chrX_1.fastq.gz})
kallisto quant \
-i chrX_data/genome/transcripts.idx \
-o chrX_data/kallisto/${SAMP} \
-b 50 \
-t 2 \
${R1} ${R2}
done
counts <- list.dirs("chrX_data/kallisto") %>%
str_subset(pattern = "ERR") %>%
catchKallisto()
As we can see this is done quite easily and the raw estimates of transcript-level expression (i.e. pseudo-counts) are in the element counts$counts
.
head(counts$counts)
Analysis at the transcript-level is an unsolved problem in bioinformatics as using pseudo-counts leads to a variance structure which is not yet understood. One recommended approach is to divide by a transcript-level overdispersion estimate, as advised in the catchKallisto()
help page.
An alternative is to simply sum the counts across all transcripts within a gene. Given the issues we’ve already seen above, this is not as clear-cut as one might expect. After running stringtie
it’s very possible that transcripts from distinct genes have been merged into a single transcript.
tr2Gene <- mcols(stCov2GR) %>%
as.data.frame() %>%
as_tibble() %>%
distinct(gene, transcript)
dge <- counts$counts %>%
as.data.frame() %>%
rownames_to_column("transcript") %>%
pivot_longer(
cols = starts_with("chrX"),
names_to = "sample",
values_to = "counts"
) %>%
mutate(sample = basename(sample)) %>%
left_join(tr2Gene) %>%
group_by(gene, sample) %>%
summarise(counts = sum(counts)) %>%
pivot_wider(
id_cols = gene,
names_from = sample,
values_from = counts
) %>%
as.data.frame() %>%
column_to_rownames("gene") %>%
DGEList() %>%
calcNormFactors()
From here, we can treat this as a conventional DGE analysis, except we may have some new genes.