Instructions

Submission Format [3 marks]

This is assignment is due by 5pm, Tuesday 24th March.

  • Submissions must be made as a zip archive containing 3 files:
    1. Your source R Markdown Document (with an Rmd suffix)
    2. A compiled pdf, showing all code
    3. The signed cover sheet as required by the University [NB: This is no longer required]
  • All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

We strongly advise working in the folder ~/transcriptomics/assignment1 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

  1. Select all three files using the Files pane in R Studio
  2. Click export
  3. They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

  1. Using File Explorer, enter the folder containing all 3 files
  2. Select all files simultaneously by using Ctrl + Click
  3. Right click on one of the files and select Send to > Compressed (zipped) folder
  4. Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

  1. Locate the items to zip in the Mac Finder (file system)
  2. Right-click on a file, folder, or files you want to zip
  3. Select “Compress Items”
  4. Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [8 marks]

Choose two different RNA types and contrast them with each other. Aspects to consider may be method of transcription, cellular location, post-transcriptional processing, biological function or any other aspect which you determine to be important

As expected from the question, the key points here are:

  • Which RNA polymerase is involved?
  • Are they exported from the nucleus?
  • Are they processed, and if so, how are they processed?
  • Do they play specific biological roles?
  • etc

If you have addressed these or similar relevant questions for you chosen RNA types, you will receive all 8 marks

Question 2 [3 marks]

In R, you will commonly encounter 3 types of ‘unexpected’ output. 1) Errors, 2) Warnings and 3) Messages. Describe the role of each of these and how to interpret them.

  1. Errors indicate that a process has failed and the code has not run to completion. You will need to check your data, inputs, function calls etc and involve yourself in debugging
  2. Warnings indicate that something unexpected has happened from the perspective of the process/function authors. If you did not also expect this warning, you will need to check your data, inputs, function calls etc and involve yourself in debugging. Importantly, the code will have run to completion
  3. Messages provide information about the process and/or data. No further attention is required unless suggested by the message. Examples of messages may be a notice of pending function deprecation, or a summary of the steps taken during a process.

Question 3 [8 marks]

Two possible definitions of a gene are given by the high-profile journal Nature and the US National Institute of Health. Discuss the limitations of these definitions, giving particular consideration to promoters and protein products which arise from multiple distinct locations within the genome. Two interesting discussion on this subject are available in this paper and this lecture. Feel free to use these resources, or find your own. Provide references where appropriate.

Some of the many issues to address in your discussion include:

  • How do we understand a gene in it’s historical and modern context?
  • Is it an expressed region?
  • Does it include regulatory features?
  • Is it a contiguous region of DNA?
  • Do different isoforms influence our understanding?
  • Does linkage play a role?

No-one really addressed the concept of heredity in enough detail to get the full points available. What drives heredity? Is it sequence variation? How does this play into the concept of a gene, given regulatory elements exist.

Question 4 [13 marks]

For this question, you will need a list of file names. Each student will be given a unique set so that everyone has their own unique problems to solve. This is specifically to encourage collaboration between students without any risk of plagiarism.

To obtain your own set of file names, please execute the following lines of code, using your own student number instead of the example given (`“a1234567”’).

source("https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/A1Funs.R")
makeSampleNames("a1234567")

After you have run these lines of code, you will have two objects in your workspace called sampleNames and librarySizes. These are the two objects which we will work with for the next two questions

  1. Include the above code chunk in your submission, with an informative chunk label, and using a label that does not include any white-space. [2 marks]
  2. In a plain text paragraph or sentence, use the function pander() from the package pander to present the sample names that you have using in-line code of the style `r function(objectName)` [1 mark]
  3. Using the sampleNames provided, create a tibble containing the metadata for your experiment. This tibble should be named metaData and should minimally contain the columns 1) date, 2) sex, 3) group, 4) researcher, 5) reads, and 6) sampleID. You will have to use functions from stringr and dplyr to perform this task. [7 marks]
  4. Create a table summarising the number of samples per experimental group paying attention to the spread of sample sex within each group. Use pander() to present this table in your submission, including an appropriate table caption. [3 marks]
library(pander)
library(tidyverse)
  1. Using the example code above and the mock student ID, my 9 sample names were printed using the inline code `r pander(sampleNames)`.
    This gave the output R1_03_May_2018_S1_Mut_Monique_M.fastq.gz, R1_03_May_2018_S2_Mut_Monique_M.fastq.gz, R1_03_May_2018_S3_Mut_Monique_M.fastq.gz, R1_03_May_2018_S4_WT_Monique_M.fastq.gz, R1_03_May_2018_S5_WT_Monique_M.fastq.gz, R1_03_May_2018_S6_WT_Monique_M.fastq.gz, R1_03_May_2018_S7_WT_Monique_M.fastq.gz, R1_03_May_2018_S8_WT_Monique_M.fastq.gz and R1_03_May_2018_S9_WT_Monique_m.fastq.gz.

  2. Again using the example sample names, I could see 1) the dates were all from 03_May_2018, 2) all samples were male, with one specified as a lower case m, 3) my experimental groups were Mut and WT, 4) Monique was the associated researcher for all samples, 5) all reads were R1 and 6) all samples IDs were S1 to S9. This enabled me to form a tibble using the following code

metaData <- tibble(sampleNames) %>%
  mutate(
    date = str_extract(sampleNames, "[0-9]+_May_[0-9]+"),
    sex = str_replace_all(sampleNames, ".+Monique_([Mm]).fastq.gz", "\\1"),
    sex = str_to_upper(sex),
    group = str_extract(sampleNames, "(Mut|WT)"),
    group = factor(group, levels = c("WT", "Mut")),
    researcher = str_extract(sampleNames, "Mon[a-z]+"),
    reads = str_extract(sampleNames, "R[12]"),
    sampleID = str_extract(sampleNames, "S[0-9]+")
  )
  1. I them summarised my sample groups and created a table using the following code:
metaData %>%
  group_by(group, sex) %>%
  tally() %>%
  rename_all(str_to_title) %>%
  pander(
    justify = "llr",
    caption = "Breakdown of experimental samples by group and sex of the sampled animal."
  )
Breakdown of experimental samples by group and sex of the sampled animal.
Group Sex N
WT M 6
Mut M 3

Question 5 [6 marks]

Combine your metaData object created in Question 4 with the object librarySizes and generate a barplot of the library sizes for all samples. Colour your bars by the experimental treatment group, and ensure that all axes and other labels are of a standard suitable for publication.

Do you think that any of your metadata columns may have contributed to the variation in library sizes? Provide a clear explanation. (Please note that your answer may different to any other student’s answer)

If you left the column name as sampleNames in your metadata object, you will need to tell left_join() how to manage this.

metaData %>%
  left_join(librarySizes, by = c("sampleNames" = "sampleName")) %>%
  ggplot(aes(sampleID, lib.size /1e6 , fill = group)) +
  geom_col() +
  geom_hline(yintercept = mean(librarySizes$lib.size) / 1e6, linetype = 2) +
  facet_grid(~group, scales = "free_x", space = "free_x") +
  labs(
    y = "Library Size (millions)",
    fill = "Genotype"
  ) +
  theme_bw()
*Library Sizes for our dataset, with the mean library size shown as the dashed line*

Library Sizes for our dataset, with the mean library size shown as the dashed line

In this dataset, it appears there is minimal association between treatment group and library size as the samples are approximately equally spread around the mean library size. One sample (S9) appears to have a noticeably smaller library size and may need to be checked for any contributing factors.

Results

Summary of grades for assessment 1 2020, out of a possible 10 marks
highest median
9.27 6.46