Course Homepage
September 2017
This project is maintained by UofABioinformaticsHub
Above you learnt how to use a for
loop in a bash
script, and as you can see its pretty easy to run through a process one by one using a single variable.
In the task above we used:
for fasta in *.fa
do
grep -c "^>" ${fasta} >> ${fasta}.count
done
This loop takes your fasta sequence and counts the number of sequences in it.
Because every sequence starts with a header line (which also starts with “>”), you can just count the lines that start with the character >
. Easy
However, what happens when you need to iterate over two paired-fasta files? And not only that, you might need to do something specific to each file. This often happens when you have paired-end sequencing data, where you have sequence from the forward strand of the DNA insert, and a sequence from the reverse strand). Below we are going to run through an example of using paired-end fasta files.
A common task in any Bioinformatics pipeline setup, whether you’re analysing Sanger or Next-generation sequencing data, is answering the following question:
“I have a sequence, so how the hell do I find out what the hell this is??”
The answer is the Basic Local Alignment Search Tool (BLAST). BLAST is one of the most widely used tools and processes in computational biology, and its two publications that describe the process are actually both in the top 20 all time for scientific citations!!. This process uses local alignment to match a sequence against an existing sequence database. Using the large global sequence databases, from places such as NCBI’s nucleotide or protein archive (or Genbank), we can run BLAST to match any sequence.
The BLAST+ toolkit, which is available from NCBI, has the ability to run remote BLAST searches against those large databases, but for this example we are going to BLAST some paired-end sequences against its target genome.
To install blast
on Ubuntu, enter the command.
sudo apt install ncbi-blast+
If you’d like to try running the script, download the BLAST+ toolkit for your system then call an instructor over to help you install it. This may be quite challenging for Windows Users, but we’re game if you’re game.
Below is our script which:
Note how we can specify each of the paired-end reads by using variable replacement in the for loop, where ${fasta}
indicates the normal target, and ${fasta/R1/R2}
is the second target where we’ve replaced R1 with R2
#!/bin/bash
DIRECTORY=`pwd`
# Download the genome
wget -c ftp://ftp.ensemblgenomes.org/pub/plants/current/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
cd ${DIRECTORY}
# Create the blastdb
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
makeblastdb -in Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
-dbtype nucl \
-parse_seqids \
-out Athaliana_BLAST_db
for fasta in *_R1*.fa
do
# Lets run the R1 fasta files
blastn -db Athaliana_BLAST_db -query ${fasta} \
-outfmt 6 -max_target_seqs 1 -max_hsps 1 \
-out $(basename ${fasta} .fa).blastn.txt
# Lets run the R2 fasta files
blastn -db Athaliana_BLAST_db -query ${fasta/R1/R2} \
-outfmt 6 -max_target_seqs 1 -max_hsps 1 \
-out $(basename ${fasta/R1/R2} .fa).blastn.txt
done
In this section we’ll write a script for the dm6:ncrna fasta file. Briefly inspect the file before checking the script. We’re going to extract some key information from those sequence header lines.
head Drosophila_melanogaster.BDGP6.ncrna.fa
Now let’s look through the following script before we run it. There is a lot of extra information here!
#!/bin/bash
INFILE=$1
# Check the file has the suffix .fa or .fasta
SUFFIX=$(echo ${INFILE} | sed -r 's/.+(fasta|fa)$/\1/')
if [ ${SUFFIX} == "fa" ] || [ ${SUFFIX} == "fasta" ]; then
echo File has the suffix ${SUFFIX}
else
echo File does not have the suffix 'fa' or 'fasta'. Exiting with error.
exit 1
fi
# Define the output file by changing the suffix to .locations
OUTFILE=${INFILE%.${SUFFIX}}.locations
echo Output will be written to $OUTFILE
# Get the header lines which correspond to chromosomes, then collect the
# gene id, chromosome, start and end and write to the output file
egrep '^>.+chromosome' ${INFILE} | \
sed -r 's/.+BDGP6:([^:]*):([0-9]+):([0-9]+).+gene:([^ ]+).+/\4\t\1\t\2\t\3/g' \
> ${OUTFILE}
echo Done
shebang
, the first line takes a filename as input.fasta
or fa
is pulled from the filename and passed to the variable ${SUFFIX}
. What will this return if the suffix is not either of these?${SUFFIX}
is checked to make sure it contains only fa
or fasta
+ If this condition is met a message will print to stdout
+ If one of these conditions is not satisfied, the script will exit giving an error message (exit 1
)OUTFILE
) is defined by changing the suffix from whichever is provided to .locations
. The use of the %
to snip the filename and replace with another is a very useful trick
sed
sed
then captures the chromosome ([^:]*
), start ([0-9]+
), end ([0-9]+
) and gene id ([^ ]+
)
${OUTFILE}
`Save this as the file getLocations.sh
and make it executable using chmod +x
Now run it passing the .fa
file as the first argument.
./getLocations.sh Drosophila_melanogaster.BDGP6.ncrna.fa
In our final script, we’ll change this to allow specifying the file using an argument (or option). We’ll also demonstrate a file checking step.
#!/bin/bash
INFILE=""
# This version now includes the option -f for specifying the file
while getopts 'f:' opt ; do
case $opt in
f) INFILE=$OPTARG ;;
esac
done
# Check the input file exists
echo "Checking for a valid input file"
if [[ -a $INFILE ]]; then
echo "Found ${INFILE}"
else
echo "Could not find ${INFILE}. Exiting with error"
exit 1
fi
# Check the file has the suffix .fa or .fasta
SUFFIX=$(echo ${INFILE} | sed -r 's/.+(fasta|fa)$/\1/')
if [ ${SUFFIX} == "fa" ] || [ ${SUFFIX} == "fasta" ]; then
echo File has the suffix ${SUFFIX}
else
echo File does not have the suffix 'fa' or 'fasta'. Exiting with error.
exit 1
fi
# Define the output by changing the suffix to .locations
OUTFILE=${INFILE%.${SUFFIX}}.locations
echo Output will be written to $OUTFILE
# Get the header lines which correspond to chromosomes, then collect the
# gene id, chromosome, start and end and write to the output file
egrep '^>.+chromosome' ${INFILE} | \
sed -r "s/.+BDGP6:([^:]*):([0-9]+):([0-9]+).+gene:([^ ]+).+/\4\t\1\t\2\t\3/g" \
> ${OUTFILE}
echo Done
-f
. This is where the script will look for the input file.We could specify any number of options to our script.
For example, we may wish to change the BDGP6
to be more flexible for any other genome using a flag and input such as -g genomeCode
.