knitr::opts_chunk$set(
  echo = TRUE,
  results = "hide",
  fig.show = "hide",
  message = FALSE,
  warning = FALSE
)

Introduction

Today we’ll look at identifying transcripts which are not included in the reference transcriptome. This is a short-read specific technique and will use the hisat2 -> stringtie workflow.

Setup

Firstly, we’ll need to go through the usual process of setting up a new R Project, and please start today’s R Project in the ~/transcriptomics directory and name the project/folder week12

Once you have this setup, go to the bash Terminal in R Studio and enter the following code to download today’s data. This should take 1-2 min as it is about 2Gb. The first two lines setup the download, whilst the next two lines will extract the tarball and remove the original file.

cd ~/transcriptomics/week12
wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
tar -xzvf chrX_data.tar.gz
rm chrX_data.tar.gz

This gives us the following structure

tree chrX_data
  • The genes in the reference transcriptome are contained in genes/chrX.gtf
  • The sequence of the reference genome is in genome/chrX.fa
  • Phenotypic information is in guevadis_phenodata.csv
  • The hisat2 indexes which we’ll use for alignment are in indexes/
  • Our raw fasta files are in samples

As we’ll be primarily in bash for today’s session, we’ll need to activate our transcriptomics conda environment. We also need to install two key tools for today’s session: 1) The aligner hisat2 and 2) The tool stringtie which we will use to identify new transcripts.

conda activate transcriptomics
conda install -c bioconda hisat2 stringtie gffcompare gffread

Choose y when prompted

Producing alignments

In previous sessions we’ve used the splice-aware aligner STAR which is one of the most popular RNA-Seq aligners. An alternative is the aligner hisat2 and the two are essentially similar. We’ve downloaded the indexed ‘genome’ (i.e. chrX) for today’s smaller dataset so we can just jump straight into the alignments.

First we’ll create a folder to write our alignments into:

mkdir chrX_data/bam

Now we can write a script to generate the alignments. In the interests of time, we’ll skip any checking steps, but please make sure you check your file paths and all capitalisation. Unfortunately, hisat2 outputs to SAM format only. Whilst we could write to stdout and redirect to a BAM

for R1 in chrX_data/samples/*1.fastq.gz
  do
    echo -e "Found ${R1}"
    R2=${R1%_1.fastq.gz}_2.fastq.gz
    echo -e "The R2 file should be ${R2}"

    ## Create output filenames
    ALNOUT=chrX_data/bam/$(basename ${R1%_1.fastq.gz}.sam)
    echo -e "Alignments will be written to:\n${ALNOUT}"

    # Align using hisat2
    hisat2 \
      -p 2 \
      --dta \
      -x chrX_data/indexes/chrX_tran \
      -1 ${R1} \
      -2 ${R2} \
      -S ${ALNOUT}
      
    # Sort the alignments and output as bam after sorting
    echo -e "Sorting ${ALNOUT}"
    samtools sort \
      -@ 2 \
      -o ${ALNOUT%.sam}.bam \
      ${ALNOUT}
      
    # Index the alignments
    echo -e "Indexing ${ALNOUT%.sam}.bam"
    samtools index ${ALNOUT%.sam}.bam
    
    # Remove the original SAM
    rm ${ALNOUT}
    
  done

An important thing about these alignments is that the option --dta was set. This changes the way hisat2 creates alignments, and in particular requires a longer anchor sequence across spliced alignments. This will help reduce the number of potentially spurious alignments that may lead to the identification of poorly-supported transcripts.

Assembling Novel Transcripts

Now that we have our alignments we can use them to generate a set of transcripts which are not represented in the reference. The reference as supplied with our data is given in genes/chrX.gtf

head chrX_data/genes/chrX.gtf

In this file we have 1) the chromosome, 2) reference, 3) feature type, 4) start/end/score/strand with the final few columns containing the gene id, transcript id and other key information. We’ll use this file as a reference for known transcripts as this can help identify transcripts which are expressed at low levels.

Running the process

To create our ‘assembly’, let’s first create a directory

mkdir chrX_data/assembly

Now we can loop through our bam files to form a sample-specific assembly

for BAM in chrX_data/bam/*bam
  do
  
    echo -e "Creating assembly for ${BAM}"
    LB=$(basename ${BAM%_chrX.bam})
    echo -e "The output will be given the label ${LB}"
    GTF=chrX_data/assembly/$(basename ${BAM%.bam}.gtf)
    echo -e "The individual assembly will be written to\n${GTF}"
    
    stringtie \
      ${BAM} \
      -l ${LB} \
      -p 2 \
      -G chrX_data/genes/chrX.gtf \
      -o ${GTF}

  done

Checking the output

Now we have a new gtf file for each sample let’s quickly inspect these

head chrX_data/assembly/ERR188044_chrX.gtf

Notice that now we have StringTie as the reference in the second column and once we check the tags in the final column we can see gene ids associated with this sample along with coverage and other measures of expression. Let’s check the next file and see how they line up

head chrX_data/assembly/ERR188104_chrX.gtf

This has found another set of transcripts, which are unique to this sample. However, we can see from the first one that it existed in the previous reference gtf.

The best practice is to merge transcripts across all samples, but let’s just quickly check out our first new transcripts in the sample ERR188044. We can use Gviz and a few other tools to do these.

Visualising New Transcripts

Most people would simply load the reference gtf and a stringtie gtf into IGV Browser, which is a standalone java-based browser for inspecting alignments, variants, gene models etc. It’s a bit too difficult to run from the VMs, so we’ll use R instead, which is a bit slower to setup, but enables far easier automation across multiple regions.

Start a new R Markdown:

---
title: "StringTie Visualisations"
author: "Some Brilliant Student"
date: "3^rd^ June, 2020"
output: html_document
editor_options: 
  chunk_output_type: console
---
library(tidyverse)
library(Gviz)
library(plyranges)
library(rtracklayer)
library(GenomicAlignments)
library(Rsamtools)
library(edgeR)

First let’s set up an R object for the bam files, known as a BamFileList. This won’t load in any alignments, but sets up a connection to the directory on our disk. We can then use this to form a SeqInfo object which can form the cornerstone of our other GenomicRanges objects.

bfl <- list.files("chrX_data/bam/", pattern = "bam$", full.names = TRUE) %>%
  BamFileList()
sq <- seqinfo(bfl)

Now we can load in our reference gtf. We’re only really interested in exons, so let’s restrict our import to just these relevant features.

refGR <- import.gff("chrX_data/genes/chrX.gtf", feature.type = "exon")
seqinfo(refGR) <- sq
refGR

For our comparison, we’ll just use the StringTie gtf from our first sample.

stGR <- import.gff("chrX_data/assembly/ERR188044_chrX.gtf", feature.type = "exon")
seqinfo(stGR) <- sq

Instead of loading the entire chromosome, let’s define a range to inspect.

filtRng <- GRanges("chrX:250000-304000")
seqinfo(filtRng) <- sq

Let’s have a look at what we have in our reference

refGR %>% find_overlaps(filtRng)

As you can see, this range contains the transcripts for the gene PLCXD1. Now we can check our StringTie output

stGR %>% find_overlaps(filtRng)

This output shows us that:

  • We have coverage across the exons
  • Stringtie IDs have been automatically given

Let’s compare these using Gviz, however we know that for a GeneRegionTrack(), Gviz is a bit fussy about the columns. Let’s rearrange these for easier plotting.

mcols(refGR) <- mcols(refGR) %>%
  as.data.frame() %>%
  as_tibble() %>%
  group_by(transcript_id) %>%
  mutate(
    exon = cumsum(type == "exon")
  ) %>%
  ungroup() %>%
  dplyr::select(
    type, 
    gene = gene_id, exon , transcript = transcript_id,
    symbol = gene_name
  )
mcols(stGR) <- mcols(stGR) %>%
  as.data.frame() %>%
  as_tibble() %>%
  mutate(
    symbol = case_when(
      is.na(ref_gene_name) ~ gene_id,
      !is.na(ref_gene_name) ~ ref_gene_name
    ),
    tx = case_when(
      is.na(ref_gene_name) ~ transcript_id,
      !is.na(ref_gene_name) ~ reference_id
    )
  ) %>%
  dplyr::select(
    type, 
    gene = gene_id, exon = exon_number, transcript = tx,
    symbol
  )

Now we’ve tidied that up, let’s compare

gtrack <- GenomeAxisTrack(as(seqinfo(bfl), "GRanges"))
refTrack <- GeneRegionTrack(
  range = refGR, 
  transcriptAnnotation = "transcript",
  name = "Reference GTF"
)
stTrack <- GeneRegionTrack(
  range = stGR, 
  transcriptAnnotation = "transcript",
  name = "StringTie GTF"
)
plotTracks(
  list(gtrack, refTrack, stTrack),
  from = start(filtRng),
  to = end(filtRng)
)

So it appears we potentially have a novel transcript unrelated to any gene, along with one supported RefSeq transcript of the second gene (PLCXD1) and two novel transcripts. Let’s add alignments to the plot to see who much supporting evidence we have.

atrack <- AlignmentsTrack(
  range = bfl$ERR188044_chrX.bam$path,
  chromosome = seqnames(sq)
)
plotTracks(
  list(gtrack, atrack, stTrack),
  from = start(filtRng),
  to = end(filtRng)
)

Perhaps zooming in on our novel gene might be easier.

plotTracks(
  list(gtrack, atrack, stTrack),
  from = start(filtRng),
  to = 267000
)

Now let’s try PLCXD1

plotTracks(
  list(gtrack, atrack, stTrack),
  from = 276000,
  to = end(filtRng)
)

As you may see, some of these transcripts seem dubious, whilst others seem more reasonable. By default, for a single sample we only need one spliced read to generate a novel spliced-transcript. We can adjust this using the -c argument. As you can see in the stringtie manual there are multiple options for immediately throwing away spurious transcripts.

stringtie -h

The strategy we’re following today is to have an inclusive approach at the single sample level, but then set a higher bar to get over when we merge all of the individual GTF files.

Merging Transcripts

Let’s now merge all of the individual assemblies and repeat our checking. The data we downloaded already contains a file called mergelist.txt, however we have written our sample assemblies into the folder assembly so we need to recreate this file. You can use bash for this if you’re more comfortable, or you can use R. Here’s the R version, and you can just run this in the Console if you prefer.

list.files("chrX_data/assembly", pattern = "^ERR", full.names = TRUE) %>%
  write_lines("chrX_data/mergelist.txt")

The final scripted bash operation is now quite simple.

stringtie \
  --merge \
  -p 2 \
  -G chrX_data/genes/chrX.gtf \
  -o chrX_data/assembly/stringtie_merged.gtf \
  chrX_data/mergelist.txt
head chrX_data/assembly/stringtie_merged.gtf

Let’s repeat the above process of comparing transcripts

stMergedGR <- import.gff("chrX_data/assembly/stringtie_merged.gtf", feature.type = "exon")
seqinfo(stMergedGR) <- sq
mcols(stMergedGR) <- mcols(stMergedGR) %>%
  as.data.frame() %>%
  as_tibble() %>%
  mutate(
    gene_name = case_when(
      is.na(gene_name) ~ gene_id,
      !is.na(gene_name) ~ gene_name
    )
  ) %>%
  dplyr::select(
    type, 
    gene = gene_id, exon = exon_number, transcript = transcript_id,
    symbol = gene_name
  )
stMergedTrack <- GeneRegionTrack(
  range = stMergedGR, 
  transcriptAnnotation = "transcript",
  name = "StringTie Merged GTF"
)
plotTracks(
  list(gtrack, refTrack, stMergedTrack),
  from = start(filtRng),
  to = end(filtRng)
)

Clearly we have even more transcripts, so how do we measure these across an entire dataset? Let’s use the bash tool gffcompare.

mkdir chrX_data/gffcompare
gffcompare \
  -r chrX_data/genes/chrX.gtf \
  -o chrX_data/gffcompare/merged \
  chrX_data/assembly/stringtie_merged.gtf

The file chrX_data/gffcompare/merged.stats is the one we want to inspect

cat chrX_data/gffcompare/merged.stats
## # gffcompare v0.11.2 | Command line was:
## #gffcompare -r chrX_data/genes/chrX.gtf -o chrX_data/gffcompare/merged chrX_data/assembly/stringtie_merged.gtf
## #
## 
## #= Summary for dataset: chrX_data/assembly/stringtie_merged.gtf 
## #     Query mRNAs :    4464 in    1394 loci  (4095 multi-exon transcripts)
## #            (659 multi-transcript loci, ~3.2 transcripts per locus)
## # Reference mRNAs :    2102 in    1086 loci  (1856 multi-exon)
## # Super-loci w/ reference transcripts:     1065
## #-----------------| Sensitivity | Precision  |
##         Base level:   100.0     |    77.8    |
##         Exon level:    98.3     |    74.9    |
##       Intron level:   100.0     |    79.1    |
## Intron chain level:   100.0     |    45.3    |
##   Transcript level:    99.2     |    46.7    |
##        Locus level:    98.9     |    75.6    |
## 
##      Matching intron chains:    1856
##        Matching transcripts:    2085
##               Matching loci:    1074
## 
##           Missed exons:       0/8804 (  0.0%)
##            Novel exons:    1751/12160    ( 14.4%)
##         Missed introns:       0/7946 (  0.0%)
##          Novel introns:     732/10046    (  7.3%)
##            Missed loci:       0/1086 (  0.0%)
##             Novel loci:     329/1394 ( 23.6%)
## 
##  Total union super-loci across all input datasets: 1394 
## 4464 out of 4464 consensus transcripts written in chrX_data/gffcompare/merged.annotated.gtf (0 discarded as redundant)

The values for Sensitivity and Precision need a little explaining and a detailed description is available at the gffcompare homepage In short:

  • Sensitivity describes how much of the initial reference has been captured
  • Precision describes how much of the detail is unique to the StringTie gtf

A More Stringent Merge

Let’s try a more stringent merge, where we require higher coverage to consider a transcript as a true novel transcript. The most simple filtering step might be to only consider transcripts with two or more supporting reads from a given sample

stringtie \
  --merge \
  -p 2 \
  -c 2 \
  -G chrX_data/genes/chrX.gtf \
  -o chrX_data/assembly/stringtie_merged_cov2.gtf \
  chrX_data/mergelist.txt
gffcompare \
  -r chrX_data/genes/chrX.gtf \
  -o chrX_data/gffcompare/merged_cov2 \
  chrX_data/assembly/stringtie_merged_cov2.gtf
cat chrX_data/gffcompare/merged_cov2.stats
## # gffcompare v0.11.2 | Command line was:
## #gffcompare -r chrX_data/genes/chrX.gtf -o chrX_data/gffcompare/merged_cov2 chrX_data/assembly/stringtie_merged_cov2.gtf
## #
## 
## #= Summary for dataset: chrX_data/assembly/stringtie_merged_cov2.gtf 
## #     Query mRNAs :    3925 in    1302 loci  (3549 multi-exon transcripts)
## #            (614 multi-transcript loci, ~3.0 transcripts per locus)
## # Reference mRNAs :    2102 in    1086 loci  (1856 multi-exon)
## # Super-loci w/ reference transcripts:     1071
## #-----------------| Sensitivity | Precision  |
##         Base level:   100.0     |    80.1    |
##         Exon level:    98.4     |    79.9    |
##       Intron level:   100.0     |    83.8    |
## Intron chain level:   100.0     |    52.3    |
##   Transcript level:    99.2     |    53.1    |
##        Locus level:    98.9     |    81.4    |
## 
##      Matching intron chains:    1856
##        Matching transcripts:    2085
##               Matching loci:    1074
## 
##           Missed exons:       0/8804 (  0.0%)
##            Novel exons:    1204/11319    ( 10.6%)
##         Missed introns:       0/7946 (  0.0%)
##          Novel introns:     476/9485 (  5.0%)
##            Missed loci:       0/1086 (  0.0%)
##             Novel loci:     231/1302 ( 17.7%)
## 
##  Total union super-loci across all input datasets: 1302 
## 3925 out of 3925 consensus transcripts written in chrX_data/gffcompare/merged_cov2.annotated.gtf (0 discarded as redundant)

As you can see, there has been a significant reduction in novel exons and introns with no loss of Sensitivity

stCov2GR <- import.gff("chrX_data/assembly/stringtie_merged_cov2.gtf", feature.type = "exon")
seqinfo(stCov2GR) <- sq
mcols(stCov2GR) <- mcols(stCov2GR) %>%
  as.data.frame() %>%
  as_tibble() %>%
  mutate(
    gene_name = case_when(
      is.na(gene_name) ~ gene_id,
      !is.na(gene_name) ~ gene_name
    )
  ) %>%
  dplyr::select(
    type, 
    gene = gene_id, exon = exon_number, transcript = transcript_id,
    symbol = gene_name
  )
stCov2Track <- GeneRegionTrack(
  range = stCov2GR, 
  transcriptAnnotation = "transcript",
  name = "StringTie Coverage2 GTF"
)
plotTracks(
  list(gtrack, refTrack, stCov2Track, stMergedTrack),
  from = start(filtRng),
  to = end(filtRng)
)

Be aware, strange things can happen.

filtRng <- subset(stCov2GR, gene == "MSTRG.11") %>%
  range()
stCov2GR %>% 
  subset(gene == "MSTRG.11") %>%
  mcols() %>%
  as.data.frame()
plotTracks(
  list(gtrack, refTrack, stCov2Track),
  from = start(filtRng) - 5e3,
  to = end(filtRng)
)

The next step

A key point to be aware of is that mostly, we’ve identified novel transcripts within pre-defined genes. If we quantify at the gene-level we may not see much change for these genes.

If we’re interested in transcript-level expression, we can create new sequences for each transcript and export a complete transcriptome.

Running kallisto

gffread \
  -w chrX_data/genome/transcripts.fa \
  -g chrX_data/genome/chrX.fa \
  chrX_data/assembly/stringtie_merged_cov2.gtf

From here, we could create an index for kallisto or salmon and use these tools to estimate transcript-level expression.

kallisto index -i chrX_data/genome/transcripts.idx chrX_data/genome/transcripts.fa
mkdir chrX_data/kallisto

for R1 in chrX_data/samples/*1.fastq.gz
  do
    echo -e "Found ${R1}"
    R2=${R1%_1.fastq.gz}_2.fastq.gz
    echo -e "The R2 file should be ${R2}"
    SAMP=$(basename ${R1%_chrX_1.fastq.gz})
    
    kallisto quant \
      -i chrX_data/genome/transcripts.idx \
      -o chrX_data/kallisto/${SAMP} \
      -b 50 \
      -t 2 \
      ${R1} ${R2}
    
  done

Importing Transcript-Level Counts

counts <- list.dirs("chrX_data/kallisto") %>%
  str_subset(pattern = "ERR") %>%
  catchKallisto()

As we can see this is done quite easily and the raw estimates of transcript-level expression (i.e. pseudo-counts) are in the element counts$counts.

head(counts$counts)

Analysis at the transcript-level is an unsolved problem in bioinformatics as using pseudo-counts leads to a variance structure which is not yet understood. One recommended approach is to divide by a transcript-level overdispersion estimate, as advised in the catchKallisto() help page.

An alternative is to simply sum the counts across all transcripts within a gene. Given the issues we’ve already seen above, this is not as clear-cut as one might expect. After running stringtie it’s very possible that transcripts from distinct genes have been merged into a single transcript.

tr2Gene <- mcols(stCov2GR) %>%
  as.data.frame() %>%
  as_tibble() %>%
  distinct(gene, transcript)
dge <- counts$counts %>%
  as.data.frame() %>%
  rownames_to_column("transcript") %>%
  pivot_longer(
    cols = starts_with("chrX"),
    names_to = "sample",
    values_to = "counts"
  ) %>%
  mutate(sample = basename(sample)) %>%
  left_join(tr2Gene) %>%
  group_by(gene, sample) %>%
  summarise(counts = sum(counts)) %>%
  pivot_wider(
    id_cols = gene,
    names_from = sample,
    values_from = counts
  ) %>%
  as.data.frame() %>%
  column_to_rownames("gene") %>%
  DGEList() %>%
  calcNormFactors()

From here, we can treat this as a conventional DGE analysis, except we may have some new genes.