Course Homepage
April 2019
This project is maintained by UofABioinformaticsHub
Before we move on to sequencing technologies, let’s have look at a few important file types & methods that you’re likely to come across.
Genome browsers are applications that provide a way to view, explore and compare genomic information in a graphical environment.
Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, and so on. Annotated data is usually from multiple diverse sources. They differ from ordinary biological databases in that they display data in a graphical format, with genome coordinates on one axis and the location of annotations indicated by a space-filling graphic to show the occurrence of genes and other features¹.
Often we’ll use a genome browser running on our local machines (IGV), but a good one to start with is the web-based UCSC browser. Click this link and you should see a slightly intimidating screen full of information.
You’ll be able to see:
Once you’ve got a handle on what’s there, locate the hide all
button and click that, which will just give the genomic region with no track information.
We can turn on a huge variety of “tracks” which contain genomic information that we may care about.
Let’s start by turning on the GENCODE transcripts again.
Under the Genes and Gene Predictions section, find the GENCODE v29 drop-down menu and click the arrow next to the word ‘hide’.
Change this to ‘full’ and hit one of the refresh
button you can see scattered across the page.
Now the transcripts will appear again in a less cluttered display.
Under the hood, the browser has used this information saved as a BED
file, which enables us to define genomic regions in a convenient tab-delimited format.
We’ll look at these in more detail soon.
As well as showing the transcript structure, we can also show simple genomic features like a SNP, so let’s look for the Variation region down the page a little, then set the Common SNPs(150) track to full as well. To make these changes appear, hit a refresh button again and the browser will now have this track showing. This is pretty crazy, so we can condense this using the pack option. Try this then the dense and squish options to see what difference they all make.
If you haven’t already tried it, you can click on any of the genomic features and you’ll be taken to a page containing all the key information about that feature. You can also drag your mouse over regions to zoom in, and can zoom out using the buttons at the top of the page. Type the name of your favourite gene into the search box and you’ll be able to find your way to that. If you can’t think of one, just enter IL2RA and you’ll be taken to a page full of choices. As we’re using GENCODE v29, look for that list about half way down and select one of the isoforms you can see. This will take you back to the browser, but just showing the region for the selected transcript.
Now we’ve had a brief exploration of the browser, let’s look at some file types which will enable us to upload custom features, and which are useful an numerous stages during analysis using NGS data.
There are two videos that go through the kinds of things you can do with the UCSC genome browser here and here. Take some time later to have a look at these.
These are a common file type for uploading your own data to the UCSC browser if you’d like to add a custom track with your own genomic features, and can also be imported into the IGV browser which we’ll explore later in the session. This format is best used for genomic regions which all represent the same type of feature (e.g. genes, promoters, sequence motifs etc). They’re also able to be used as input for numerous analytic tools, so are very useful to know about. A full description of the format is available at: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
BED files are also very commonly used for interacting with a variety of NGS-related tools. We can use these to just obtain a subset of alignments from a larger file, to restrict variant calling to specific regions, etc.
The basic structure is a tab-separated file, with a minimum of three mandatory columns giving the Chromosome (chrom
), start (chromStart
) and end (chromEnd
) positions.
In this way we can simply define genomic regions of interest that we have found in our analysis, and can visualise them.
As with the vast majority of the file types we’ll come across, each line needs to have the same number of fields, with the exception of any header lines.
Unlike other file types, header lines in bed files do not start with a comment character but can only begin with the words browser
or track
.
The header lines are important in the context of genome browser custom annotation tracks, but most external tools will not tolerate their presence.
Let’s start by forming our own bed file.
First cd ~
to change directory to home and then use nano
(nano gk.bed
) or another text editor to create a file that looks like this.
track name="FOXP3 sites"
chrX 30671901 30672803
chrX 30691567 30692445
These two regions were obtained as enriched for FOXP3 binding within the gene GK.
hg38
genome build.Position/Search Term
text box, then just click on any of the links returned by the search.hide all
and click it.Genes and Gene Predictions
, find GENCODE v29
and select full
using the drop-down menuThis should just give you a generic view of the gene GK.
Now we have this view:
manage custom tracks
button directly below the browser.gk.bed
and follow the link chrX
This will give a additional track on the browser which shows these two regions.
extensible format, but can just say that or move elsewhere see here!
In addition to the mandatory columns, there are 9 optional fields which are able to be added. The order of these is fixed and these are:
name
: The name of the ‘feature’score
: An integer between 0 and 1000. This can be used to control the the darkness of the greyscale for each region.strand
: Can only take the values ‘.’, ‘+’ or ‘-‘thickStart
: If you wish to have a feature with thick & thin sections, such as exons & introns, this sets these valuesthickEnd
: Same as above…itemRGB
: Set the colour of the feature in RGB format. This must be three integers between 0 and 255, separated by commas. These correspond directly to the amount of red, green or blue so 255,0,0
would correspond to red at maximum, with blue and green off. This also requires itemRgb="On"
to be supplied in the header line.blockCount
, blockSizes
and blockStarts
: These are used to define exons (or other sub-regions) within a larger region.Let’s add some colours to our two FOXP3 regions.
browser position chrX:30671000-30693000
track name="FOXP3 sites" itemRgb="On"
chrX 30671901 30672803 Site1 0 . 30671901 30672803 255,0,0
chrX 30691567 30692445 Site2 0 . 30691567 30692445 0,0,255
Once you’ve uploaded this, right-click the track on the browser and make sure that you have the track set to full
.
Notice that we didn’t bother with the final three columns.
There can be a little confusion about GFF and GTF files and these share some similarities with BED files. GFF (General Feature Format) files have version 2 and version 3 formats, which are slightly different. Today, we’ll just look at GTF (General Transfer Format) files, which are best considered as GFF2.2, as restrictions are placed on the type of entries that can be placed in some columns.
While BED files are generally for showing all the locations of a single type of feature, multiple feature types can be specified within one of these files. Again, like BED files, fields are tab-separated with no line provided which gives the column names. These are fixed by design, and as such explicit column names are not required.
""
:
Notice that there’s no real way to represent our FOXP3 sites as a GTF file!
This format is really designed for gene-centric features as seen in the 3rd column.
An example is given below.
Also note that header rows are not controlled, but must start with the comment character #
# Data taken from http://mblab.wustl.edu/GTF22.html
381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1";
381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1";
381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1";
381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1";
381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1";
Note: People variously use GFF and GTF to talk about GFF version 2, and GFF to talk about GFF version 3. GFF2 is not compatible with GFF3, so make sure you have the correct file format if you are given a GFF file. There are conversion tools available to inter-convert them, they are rarely reliable.
The most hated format is a VCF file, which stands for Variant Call Format, but is more accurately known as Very Confusing Format.
Again, the general structure is header rows (beginning with the double comment symbol ##
), followed by tab-separated columns with the actual data.
In this case, column names are provided directly about the data in a line starting with a single comment character (#
).
While a flexible format, it is heavily structured with abbreviations and symbols with important meaning, e.g. phased genotypes are separated by |
, while unphased ones are separated by /
.
The example is taken from the file specification in the VCFv4.2 specification, and we could spend an enormous amount of time unpacking this example.
Important things to note are:
FORMAT
column defines the format of subsequent columnsFORMAT
columnMost of us have seen these, and the basic format is very simple.
Information about a sequence is placed after a >
symbol, and these can occur throughout the file, indicating the start of a new sequence.
Following the description line are lines of sequence data typically 50 to 80 characters long.
Sequence data can be DNA, RNA or Amino Acid data and may be upper or lower case.
>HSGLTH1 Human theta 1-globin gene fragment
CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACCTAAGA
CGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGTCCCGG
GATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGACCCCTG
GCCGGTCCGCGCAGGCGCAGCGGGGTCGCAGGGCGCGGCGGGTTCCAGCGCGGGGATGGCGCTGTCCGCGGAGGA
CCGGGCGCTGGTGCGCGCCCTGTGGAAGAA
This is the format genomes are provided in by all genomic repositories such as Ensembl, NCBI and the UCSC. Each chromosome is specified by the header, with the entire sequence following.
These are the extension of FASTA files which we usually obtain as output from our sequencing runs. We’ll spend some time exploring these later today.
These are plain text Sequence Alignment/Map files, which we will also spend some time looking at later today. The binary version of a SAM file is known as a BAM file, and is the plain text information converted to the more computer-friendly binary format. This usually results in a size reduction of around 5-10 fold, and BAM files are able to be processed much more quickly by NGS tools. We’ll also have a good look at these during the course of the day.