Instructions

Submission Format [2 marks]

This is assignment is due by 5pm, Friday 19th June.

  • Submissions must be made as a zip archive containing 2 files:
    1. Your source R Markdown Document (with an Rmd suffix)
    2. A compiled pdf, showing all code
  • All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.

We strongly advise working within the folder ~/transcriptomics/assignment6 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

  1. Select both files using the Files pane in R Studio
  2. Click export
  3. They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

  1. Using File Explorer, enter the folder containing both files
  2. Select all files simultaneously by using Ctrl + Click
  3. Right click on one of the files and select Send to > Compressed (zipped) folder
  4. Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

  1. Locate the items to zip in the Mac Finder (file system)
  2. Right-click on a file, folder, or files you want to zip
  3. Select “Compress Items”
  4. Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [10 marks]

Transcriptome assembly and genome assembly may appear similar to those who have not undertaken either process. Provide details on some of the important differences between the two, specifically detailing the unique challenges faced when performing a transcriptome assembly.

Question 2 [6 marks]

Trinity is a common tool used for de novo transcriptome assembly, whilst StringTie is commonly used for reference guided assembly. Briefly describe the key steps involved in each method.

Question 3 [10 marks]

In the practicals from Week 12, several small scripts were used. Please assemble these into a complete pipeline including checking steps and error handling where appropriate.

  • Downloading data will not be required and you can start the process directly after completion of the download and extraction of the tarball.
  • The supplied hisat2 indexes can be used without question
  • Your process should complete by generating transcript-level counts using kallisto.
  • Interpretation of any comparisons between your final stringtie-generated gtf and the reference gtf is not required.

Question 4 [12 marks]

For the data used in the Week 12 practicals, perform a gene-level differential expression analysis comparing the YRI and GBR populations using:

  1. The a) supplied reference chromosome, b) the supplied reference gtf, c) hisat2 aligned reads and d) featureCounts
  2. The custom gtf generated using our assembly and pseudo-counts produced by kallisto

Compare the two sets of results and discuss. Some of the key points to address during the discussion are the detection of any novel genes, and comparison of logFC estimates obtained under both approaches. No biological interpretation of results is required.

Please note the sample-phenotype information is included in the file chrX-data/geuvadis_phenodata.csv.

Total: 40 marks