This is assignment is due by 5pm, Tuesday 24th March.
All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.
We strongly advise working in the folder ~/transcriptomics/assignment1
on your virtual machine. Using an R Project for each individual assignment is also strongly advised.
If all files required for submission are contained on your VM:
.zip
If all files are on your on your local Windows machine:
Send to > Compressed (zipped) folder
.zip
If all the files are on your *local macOS machine`:
Choose two different RNA types and contrast them with each other. Aspects to consider may be method of transcription, cellular location, post-transcriptional processing, biological function or any other aspect which you determine to be important
As expected from the question, the key points here are:
If you have addressed these or similar relevant questions for you chosen RNA types, you will receive all 8 marks
In R, you will commonly encounter 3 types of ‘unexpected’ output. 1) Errors, 2) Warnings and 3) Messages. Describe the role of each of these and how to interpret them.
Two possible definitions of a gene are given by the high-profile journal Nature and the US National Institute of Health. Discuss the limitations of these definitions, giving particular consideration to promoters and protein products which arise from multiple distinct locations within the genome. Two interesting discussion on this subject are available in this paper and this lecture. Feel free to use these resources, or find your own. Provide references where appropriate.
Some of the many issues to address in your discussion include:
No-one really addressed the concept of heredity in enough detail to get the full points available. What drives heredity? Is it sequence variation? How does this play into the concept of a gene, given regulatory elements exist.
For this question, you will need a list of file names. Each student will be given a unique set so that everyone has their own unique problems to solve. This is specifically to encourage collaboration between students without any risk of plagiarism.
To obtain your own set of file names, please execute the following lines of code, using your own student number instead of the example given (`“a1234567”’).
source("https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/A1Funs.R")
makeSampleNames("a1234567")
After you have run these lines of code, you will have two objects in your workspace called sampleNames
and librarySizes
. These are the two objects which we will work with for the next two questions
pander()
from the package pander
to present the sample names that you have using in-line code of the style `r function(objectName)`
[1 mark]sampleNames
provided, create a tibble
containing the metadata for your experiment. This tibble should be named metaData
and should minimally contain the columns 1) date, 2) sex, 3) group, 4) researcher, 5) reads, and 6) sampleID. You will have to use functions from stringr
and dplyr
to perform this task. [7 marks]pander()
to present this table in your submission, including an appropriate table caption. [3 marks]library(pander)
library(tidyverse)
Using the example code above and the mock student ID, my 9 sample names were printed using the inline code `r pander(sampleNames)`
.
This gave the output R1_03_May_2018_S1_Mut_Monique_M.fastq.gz, R1_03_May_2018_S2_Mut_Monique_M.fastq.gz, R1_03_May_2018_S3_Mut_Monique_M.fastq.gz, R1_03_May_2018_S4_WT_Monique_M.fastq.gz, R1_03_May_2018_S5_WT_Monique_M.fastq.gz, R1_03_May_2018_S6_WT_Monique_M.fastq.gz, R1_03_May_2018_S7_WT_Monique_M.fastq.gz, R1_03_May_2018_S8_WT_Monique_M.fastq.gz and R1_03_May_2018_S9_WT_Monique_m.fastq.gz.
Again using the example sample names, I could see 1) the dates were all from 03_May_2018
, 2) all samples were male, with one specified as a lower case m
, 3) my experimental groups were Mut
and WT
, 4) Monique
was the associated researcher for all samples, 5) all reads were R1 and 6) all samples IDs were S1
to S9
. This enabled me to form a tibble
using the following code
metaData <- tibble(sampleNames) %>%
mutate(
date = str_extract(sampleNames, "[0-9]+_May_[0-9]+"),
sex = str_replace_all(sampleNames, ".+Monique_([Mm]).fastq.gz", "\\1"),
sex = str_to_upper(sex),
group = str_extract(sampleNames, "(Mut|WT)"),
group = factor(group, levels = c("WT", "Mut")),
researcher = str_extract(sampleNames, "Mon[a-z]+"),
reads = str_extract(sampleNames, "R[12]"),
sampleID = str_extract(sampleNames, "S[0-9]+")
)
metaData %>%
group_by(group, sex) %>%
tally() %>%
rename_all(str_to_title) %>%
pander(
justify = "llr",
caption = "Breakdown of experimental samples by group and sex of the sampled animal."
)
Group | Sex | N |
---|---|---|
WT | M | 6 |
Mut | M | 3 |
Combine your metaData
object created in Question 4 with the object librarySizes
and generate a barplot of the library sizes for all samples. Colour your bars by the experimental treatment group, and ensure that all axes and other labels are of a standard suitable for publication.
Do you think that any of your metadata columns may have contributed to the variation in library sizes? Provide a clear explanation. (Please note that your answer may different to any other student’s answer)
If you left the column name as sampleNames
in your metadata object, you will need to tell left_join()
how to manage this.
metaData %>%
left_join(librarySizes, by = c("sampleNames" = "sampleName")) %>%
ggplot(aes(sampleID, lib.size /1e6 , fill = group)) +
geom_col() +
geom_hline(yintercept = mean(librarySizes$lib.size) / 1e6, linetype = 2) +
facet_grid(~group, scales = "free_x", space = "free_x") +
labs(
y = "Library Size (millions)",
fill = "Genotype"
) +
theme_bw()
In this dataset, it appears there is minimal association between treatment group and library size as the samples are approximately equally spread around the mean library size. One sample (S9
) appears to have a noticeably smaller library size and may need to be checked for any contributing factors.
highest | median |
---|---|
9.27 | 6.46 |