Introduction

Today we’ll look at working with scRNA data, after alignment and assignment of reads to genes. The data we’ll work with was generated using the 10X Genomics Chromium system (i.e. a droplet-based protocol) followed by the Cell Ranger pipeline, and is available from the 10X Genomics website. This pipeline has already aligned reads to the genome using STAR and counted individual Um Is for each gene. All reads were obtained from Peripheral Blood Mononuclear Cells (PBMCs) from a single healthy donor. A summary of the Cell Ranger output can be found here

In addition to the above all barcodes returning a UMI count of zero have been removed, ensuring the data is a little smaller and more workable than the original (>600,000 barcodes).

Outline

Today we’ll go through some QC steps with the aim of defining clusters that represent individual cell types within the original sample. Once we’ve obtained the clusters, we’ll try and identify marker genes for each cell-type. Along the way we’ll also explore multiple approaches for visualising scRNA data.

Setup

Let’s create a new R Project called week_11 inside your transcriptomics folder. Once you’ve formed this folder and project, head to the Terminal and using bash, download today’s data from this link using wget.

Once you have obtained this file, please extract it using the following command.

tar -xzvf wk11.tar.gz

This should produce a folder called raw_feature_bc_matrix and inside that folder will be three files. Please check that you have everything required using the following command in the bash Terminal (not the R Console).

ls -alh raw_feature_bc_matrix
## total 52M
## drwxrwxr-x 2 steveped steveped 4.0K May 26 18:08 .
## drwxrwxr-x 7 steveped steveped 4.0K May 28 22:36 ..
## -rw-rw-r-- 1 steveped steveped 4.3M May 26 18:08 barcodes.tsv
## -rw-rw-r-- 1 steveped steveped 794K May 26 18:08 genes.tsv
## -rw-rw-r-- 1 steveped steveped  47M May 26 18:08 matrix.mtx

After this, please check the md5sum for each file to ensure that the data has transferred correctly. Your values should match the following output exactly.

md5sum raw_feature_bc_matrix/*
## 6ab3b16bc6c70ad66e9940ca7a8b8b86  raw_feature_bc_matrix/barcodes.tsv
## b9995e9e7ba8a6d010dbb753e61b808a  raw_feature_bc_matrix/genes.tsv
## a9af001ff63a23377fd73a87aebf2952  raw_feature_bc_matrix/matrix.mtx

R Markdown

Now that we’ve made sure we have the correct data for today, and that everything is in the correct place, let’s begin our R Markdown document. Once you’ve finished your YAML header, here are my setup and loadPackages chunks.

knitr::opts_chunk$set(
    echo = TRUE, 
    message = FALSE, 
    warning = FALSE,
    fig.align = "center"
)
library(tidyverse)
library(scales)
library(DropletUtils)
library(scater)
library(scran)
library(AnnotationHub)
library(ensembldb)
library(robustbase)
library(cowplot)
library(pheatmap)

If any of the above packages are not installed on your VMs, please install using BiocManager::install("pkgName").

theme_set(theme_bw())

Data Import

The basic R object structure that we work with for scRNA data is known as a SingleCellExperiment object. Once again, this is an S4 object so the structure is fairly inflexible, and with good reason. Given that the output from the 10X Genomics CellRanger pipeline is fairly common, we can import these directly using the function read10xCounts(). This will take 20-30 seconds to run.

sce <- read10xCounts("raw_feature_bc_matrix")
sce
## class: SingleCellExperiment 
## dim: 33538 234600 
## metadata(1): Samples
## assays(1): counts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

Now we have our object imported, we should be setup and ready to start the filtering.

Data Pre-Processing

Barcode Ranks

The first step in any scRNA analysis is to remove empty droplets which really means remove barcodes for which no cellular information was obtained. The GEMs may have not had a cell, may have had free RNA or the library preparation may have just failed for some other reason. The most common approach is to rank each cell by the total counts and find where the clear dropoff in counts is that is likely to represent empty cells.

bcRanks <- sce %>%
  counts() %>%
  barcodeRanks()
bcRanks

This gives a DataFrame with each barcode’s rank and the total UMI count associated with each barcode. Hidden inside this object is a metadata element, which we can access using metadata(). The values for knee and inflection refer to the points 1) where the UMI counts being to drop off, and 2) where the drop begins to flatten out again. The most informative of these is the knee and gives an initial idea as to how many cells we may retain as informative.

metadata(bcRanks)
bcRanks %>% subset(total > metadata(.)$knee)

We can see this visually, and it seems like we may lose a few informative cells if we use this as out initial filter.

bcRanks %>%
  subset(total > 1) %>%
  as.data.frame() %>%
  ggplot(aes(rank, total)) +
  geom_point(alpha = 0.5) +
  geom_hline(
    yintercept = metadata(bcRanks)$knee,
    linetype = 2,
    colour = "green"
  ) +
  geom_hline(
    yintercept = metadata(bcRanks)$inflection,
    linetype = 2,
    colour = "red"
  ) +
  scale_x_log10(label = comma) +
  scale_y_log10(label = comma) +
  labs(x = "Barcode Rank", y = "Total UMI Counts")

Finding Empty Droplets

The function emptyDrops() actually performs a fairly sophisticated analysis. By default any cell with counts < 100 (lower = 100) is confidently assumed not to contain a cell, and is used to build a profile of the expected ambient RNA. A Monte Carlo (i.e. random sampling) approach is then taken to provide a p-value for the probability of a barcode representing an empty droplet.

emptyBC <- sce %>%
  counts() %>%
  emptyDrops()
emptyBC

Note that this is specifically designed to be the same size as the the number of barcodes in our sce object. To really check our results, we can use.

emptyBC %>%
  subset(FDR < 0.01)

As you can see, this has identified 1205 droplets as containing a cell. Notice that the original (unfiltered) data was > 600000 barcodes, so we have a fairly high rate (>99%) of empty droplets.

Let’s check how this result tracks with UMI counts, and compare it with our naive knee and inflection values obtained from barcodeRanks()

emptyBC %>%
  subset(!is.na(LogProb)) %>%
  as.data.frame() %>%
  mutate(isCell = FDR < 0.01) %>%
  ggplot(aes(Total, -LogProb, colour = isCell)) +
  geom_point() +
  geom_vline(
    xintercept = unlist(metadata(bcRanks)),
    linetype = 2,
    colour = "blue"
  ) +
  scale_colour_manual(values = c("black", "red")) +
  labs(x = "Total UMI Count", y = "-logProb")

As you can see, there may be a few outliers at the top with very large UMI counts too. Let’s save this as a new SingleCellExperiment object, only retaining the droplets which contain a cell.

sceWC <- sce[,which(emptyBC$FDR < 0.01)]

Removing Other Low Quality Cells

Adding Gene Annotations

Now we have the identified the droplets which contain a cell, we still have no idea about the quality of the cells within each droplet. Some may be doublets, whilst others may be low quality (i.e. broken prior to incorporation in the droplet). As we’ll use mitochondrial reads for part of this process, we’ll need to map our gene IDs to where they actually lie on the genome. Our old friend AnnotationHub will be handy for this.

ah <- AnnotationHub() %>%
  subset(species == "Homo sapiens" & rdataclass == "EnsDb")

Given that the data was made public in Nov 2018, it’s likely that Ensembl Release 93 will be close to the gene annotations used for obtaining the UMI counts. Unfortunately, 10X genomics have not made this information public so we can’t be sure, but my testing seemed to check this guess out.

ah %>% subset(str_detect(title, "Ensembl 93"))
ensDb <- ah[["AH64446"]]
genesGR <- genes(ensDb)

We can place the GRanges object associated with each gene in the rowRanges slot within the SingleCellExperiment object.

rowRanges(sceWC) <- genesGR[rownames(sceWC)]

Adding QC Statistics

We can now use the mitochondrial genes to perform QC.

mito <- sceWC %>%
  rowRanges() %>%
  seqnames() %>%
  str_detect("MT")
table(mito)

We can see that we have UMI counts for 13 mitochondrial genes and we can use these to assess whether a cell contains an excessive amount of mitochondrial RNA. The function addPerCellQC() will modify the colData of the object, so we have all of our QC stats included in the object, as opposed to having multiple objects with different sources and getting ourselves confused.

Let’s check the colData(sceWC) first, and notice that we only have two columns, with the primary column being the Barcode. Once you’ve checked that, run addPerCellQC() and check the colData() once again.

sceWC <- addPerCellQC(sceWC, subsets = list(mito = mito))
colData(sceWC)

Now we have:

  • the UMI library sizes (sum/total)
  • the number of detected genes (i.e. with at least one read)
  • a breakdown of the top 50 to 200 genes as a percentage of the library
  • the details about the mitochondrial content of each droplet

We can manually use these for QC, so let’s check the library sizes against the number of detected genes first.

Using Gene Information

colData(sceWC) %>%
  as.data.frame() %>%
  ggplot(aes(detected, sum)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  scale_x_log10(label = comma) + 
  scale_y_log10(label = comma) +
  labs(x = "Detected Genes", y = "Total UMI Count")

This seems pretty consistent using visual inspection. In particular, we are looking for cells which:

  • have unusually high counts as that may indicate a doublet
  • contain low numbers of reads/detected genes as these are likely to poorly sampled transcriptomes

There do appear to be some cells at the low end which are poorly sampled cells. We can add this directly to the colData in our SingleCellExperiment object.

sceWC$lowCounts <- colData(sceWC)$detected < 200
table(sceWC$lowCounts)

Using Mitochondrial Information

Next we’ll check the mitochondrial content by plotting the number of mitochondrial genes detected against the proportion of the library which is mitochondrial.

colData(sceWC) %>%
  as.data.frame() %>%
  mutate(subsets_mito_detected = as.factor(subsets_mito_detected)) %>%
  ggplot(aes(subsets_mito_detected, subsets_mito_percent)) +
  geom_boxplot() +
  labs(x = "Number of Mitochondrial Genes Detected", y = "Mitochondrial %")

Now we have some cells which appear to be outliers, with some being > 60% mitochondrial. These are likely to be damaged cells, and as such, will be not a good sample of the true transcriptional state of the cell. We could manually set a threshold anywhere we choose, and I’d be tempted by 20% given this plot.

sceWC$hiMito <- colData(sceWC)$subsets_mito_percent > 20
table(sceWC$hiMito)

By this strategy, we would lose 63 cells. Let’s see how this relates to the small library sizes

colData(sceWC) %>%
  as.data.frame() %>%
  group_by(lowCounts, hiMito) %>%
  tally()

Combining these two filtering approaches, we would now retain 1138 cells for downstream analysis.

Detection of Doublets

An automated approach to doublet detection is also available. This simulates doublets by adding sets of two randomly selected cells and uses this set of information to provide a score for whether a cell is likely to be a doublet.

sceWC$doubletScore <- doubletCells(sceWC) 
colData(sceWC) %>%
  as.data.frame() %>%
  mutate(doubletScore = log10(1 + doubletScore)) %>%
  ggplot(aes(detected, sum)) +
  geom_point(aes(colour = doubletScore)) +
  geom_smooth(se = FALSE) +
  scale_x_log10(label = comma) + 
  scale_y_log10(label = comma) +
  scale_colour_viridis_c() +
  labs(x = "Detected Genes", y = "Total UMI Count")

There are a few candidates here, and a score at the top end of the range (e.g. > 104) may be a suitable filter

sceWC$isDoublet <- sceWC$doubletScore > 1e4

Checking all of our QC criteria we would now have 1136 cells to keep for clustering.

sceWC$discard <- colData(sceWC) %>%
  with(
    lowCounts | hiMito | isDoublet
  )
colData(sceWC) %>%
  as.data.frame() %>%
  group_by(lowCounts, hiMito, isDoublet, discard) %>%
  tally()

Automated approaches

Alternatively, we can use automated approaches where we simply provide all of our diagnostic criteria to an R function that determines a cell’s “outlyingness”

sceWC$isOutlier <- colData(sceWC) %>%
  as.data.frame() %>%
  dplyr::select(sum, detected, subsets_mito_percent, doubletScore) %>%
  mutate_at(vars(sum, detected), log10) %>%
  mutate(doubletScore = log10(1 + doubletScore)) %>%
  adjOutlyingness(only.outlyingness = TRUE) %>%
  isOutlier(type = "higher") 
colData(sceWC) %>%
  as.data.frame() %>%
  group_by(lowCounts, hiMito, isDoublet, discard, isOutlier) %>%
  tally()

We can compare our automated approach with the manual approach

a <- colData(sceWC) %>%
  as.data.frame() %>%
  ggplot(aes(detected, sum)) +
  geom_point(aes(colour = isOutlier)) +
  geom_smooth(se = FALSE) +
  scale_x_log10(label = comma) + 
  scale_y_log10(label = comma) +
  scale_colour_manual(values = c("black", "red")) +
  labs(x = "Detected Genes", y = "Total UMI Count")
b <- colData(sceWC) %>%
  as.data.frame() %>%
  ggplot(aes(detected, sum)) +
  geom_point(aes(colour = discard)) +
  geom_smooth(se = FALSE) +
  scale_x_log10(label = comma) + 
  scale_y_log10(label = comma) +
  scale_colour_manual(values = c("black", "red")) +
  labs(x = "Detected Genes", y = "Total UMI Count")
plot_grid(a, b, nrow = 1)

Let’s proceed with a filtered version of the above dataset using our manual filtering approach to remove low quality cells. However, we’ll also remove any genes with zero counts at this point too. We can add this largest observed count to the rowData() as this time we’re working with gene-level information.

rowData(sceWC)$maxCount <- sceWC %>%
  counts() %>%
  as.matrix() %>%
  rowMaxs() 

The subset() function for SingleCellExperiment objects is a little different. For this the argument subset refers to the rowData() element, whilst the select argument refers to the colData() element.

sceFilt <- subset(sceWC, subset = maxCount > 0, select = !discard)
sceFilt

Normalisation

As discussed in lectures we can take a couple of different approaches to normalisation, and today we’ll use the deconvolution approach. Under this approach we’ll perform some quick and unsophisticated clustering and calculate normalisation factors for each cluster, in a similar manner to bulk-RNA. The clusters are then deconvoluted back to the individual cells for cell-specific normalisation factors. There is only a slight difference in the final values obtained under this method and normalisation by library size, so both are generally appropriate.

sceFilt$quickClust <- quickCluster(sceFilt)
table(sceFilt$quickClust)
sizeFactors(sceFilt) <- calculateSumFactors(sceFilt, cluster = colData(sceFilt)$quickClust)
summary(sizeFactors(sceFilt))
colData(sceFilt) %>%
  as.data.frame() %>%
  mutate(size_factors = sizeFactors(sceFilt)) %>%
  ggplot(aes(sum, size_factors)) +
  geom_point() 

Now that we have found our factors for normalisaton, we can add our normalised counts to the SingleCellExperiment object. We do this simply by calling the function logNormCounts() which adds this as a new assay to the original object.

sceFilt <- logNormCounts(sceFilt)

Dimensional Reduction

Variable genes

Now we have all that we need to start the fun part of the analysis, which is clustering the cells and identifying cell types and marker genes. The first step towards this is to select the most variable genes as that is where the bulk of our biological signal should come from. First we need to model our gene-level means and variances

geneVars <- modelGeneVar(sceFilt)

We can inspect the relationship between these using the metadata() element of this object. It’s a bit messy though.

geneVars %>%
  metadata() %>%
  with(
    tibble(mean, var, trend = trend(mean))
  ) %>%
  arrange(desc(var)) %>%
  mutate(hvg = seq_len(nrow(.)) <= 1000) %>%
  ggplot(aes(mean, var, colour = hvg)) +
  geom_point() +
  geom_line(aes(y = trend), colour = "blue") +
  scale_colour_manual(values = c("black", "red"))

Now we have the gene-level variance modelled, we can select our most variable genes. There are multiple approaches we can choose, but here, we’ll just choose the 1000 most variable. Note that this may exclude genes at the low end of expression.

hvg <- getTopHVGs(geneVars, n = 1000)

PCA

You might notice that the SingleCellExperiment object we’re working with has a (blank) element called reducedDimNames. This is ready to store any output from dimensional reduction, such as PCA, tSNE or UMAP. As we’re now becoming used to, we just add this by calling the function runPCA() and overwriting the original object. The genes identified as hvg are added as subset_row.

sceFilt <- runPCA(sceFilt, subset_row = hvg)
sceFilt

By default, the first 50 PCs are saved.

reducedDim(sceFilt, "PCA") %>% dim()

At some point, each principal component just captures technical noise, so we should find how many are informative.

sceFilt %>%
  reducedDim("PCA") %>%
  attr("percentVar") %>%
  enframe(value = "Var (%)", name = c()) %>%
  mutate(PC = seq_along(`Var (%)`)) %>%
  ggplot(aes(PC, `Var (%)`)) +
  geom_point()

As we can see, the variance captured by each PC drops off rapidly, and beyond PC5, there is minimal contribution to the overall variance. We should choose the elbow point on this plot and although there are automated approaches, I’d be comfortable choosing 7 by looking at that plot. Let’s remove the un-informative PCs.

reducedDim(sceFilt, "PCA") <- reducedDim(sceFilt, "PCA")[,1:7]

Now let’s look at our data, just using the first two PCs.

plotReducedDim(sceFilt, dimred = "PCA")

Let’s overlay the number of detected to see if the clusters are correlating with this simple value. Let’s also extend out to PC3.

plotReducedDim(sceFilt, dimred = "PCA", colour_by = "detected", ncomponents = 3) 

Clearly much of the separation between cells is captured by PC1 and PC2. PC3 adds minimal information. Our next step would be to apply clustering techniques and try alternate visualisations such as tSNE and UMAP.

tSNE

The \(t\)-stochastic neighbour embedding approach is commonly used for visualisation of scRNA data. As this is non-linear and is bound by preserving distances between points, it can tease apart clusters more easily. Let’s first check our results from the quickCluster() output that we used for Normalisation.

plotReducedDim(sceFilt, dimred = "PCA", colour_by = "quickClust")

There are a few cells that seem to group nicely, but other clusters are less compelling. Let’s now add the tSNE output to our reducedDims element. Note that it relies on a PCA already being performed a priori, so we can just use our existing PCA element As there is a step involving randomising the data, I’ve set the random seed here so we should all see similar plots.

set.seed(100)
sceFilt <- runTSNE(sceFilt, dimred = "PCA")
sceFilt

Our reducedDims element now has a TNSE component and we can plot this using very similar code to above. Instead of plotting PCA, we just change the dimred argument to be "TSNE".

plotReducedDim(sceFilt, dimred = "TSNE", colour_by = "quickClust")

UMAP

An alternative to tSNE which is gaining popularity is Uniform manifold approximation and projection (UMAP). The underlying mathematics is quite different, however, like the tSNE, the motivation is produce a reduced dimensional visualisation which demonstrates the relationships between cells.

set.seed(101)
sceFilt <- runUMAP(sceFilt, dimred = "PCA")
sceFilt
plotReducedDim(sceFilt, dimred = "UMAP", colour_by = "quickClust")

There is no correct visualisation, and how we present our results is always a matter of judgement. Here we’ll compare the three approaches, and will overlay the number of detected genes to see if this is influencing our layouts at all.

plot_grid(
  plotReducedDim(sceFilt, dimred = "PCA", colour_by = "detected") + 
    theme(legend.position = "none") +
    ggtitle("PCA"),
  plotReducedDim(sceFilt, dimred = "TSNE", colour_by = "detected") + 
    theme(legend.position = "none") +
    ggtitle("tSNE"),
  plotReducedDim(sceFilt, dimred = "UMAP", colour_by = "detected") + 
    ggtitle("UMAP"),
  nrow = 1,
  rel_widths = c(5, 5, 6)
)

Clustering

As we can see above, there is no perfect layout, just our personal choice. Identifying clusters can play a key role in selecting our preferred visualisation so let’s be a bit more rigorous than our previous method. However, as with visualisations, there is no correct solution, rather there are just different perspectives, some of which may be more informative than others. Whilst there do appear to be some distinct cell types in this dataset, there are clearly a few cells which are less definitive.

Shared Neighbour Graphs

One of the most common clustering approaches is to first determine how similar each cell is to each other using our retained Principal Components. Once we’ve determined those distances, clusters are formed from cells which appear to be similar to each other.

g <- buildSNNGraph(sceFilt, k = 10, use.dimred = "PCA")
sceFilt$snnCluster <- igraph::cluster_walktrap(g)$membership %>%
  as.factor()
table(sceFilt$snnCluster)
plotReducedDim(sceFilt, dimred = "PCA", colour_by = "snnCluster", text_by = "snnCluster") 

After you’ve had a good look at these clusters, try alternate visualisations using tSNE or UMAP as the dimensional reduction.

Do you think our clusters are a good representation of the cell types?

Checking for artefacts

Sometimes clusters can be formed from cells showing similar expression patterns, but where these patterns are driven by other artefacts. A possible checking step is to inspect the data we have and how it is distributed amongst existing clusters. Let’s see if the number of detected genes has had any impact on cluster formation.

plotColData(sceFilt, y = "detected", x = "snnCluster", colour_by = "snnCluster") +
  labs(x = "Cluster", y = "# Detected Genes")

It is entirely possible that any discrepancies between the number of features is true biology, however it is also possible that is a technical artefact. At this point, it is very difficult to make any clear judgement about the source of any variation.

We can also check the library sizes in much the same way.

plotColData(sceFilt, y = "total", x = "snnCluster", colour_by = "snnCluster") +
  labs(x = "Cluster", y = "Total Reads")

Cluster Modularity

If you’re unclear about the clusters, a helpful viewpoint can be cluster modularity. This essentially looks at how connected cells are to each other. If the clustering is good, the most connectivity (i.e. the sum of edge weights) will be within clusters with minimal connectivity between clusters. The most common approach to this in scRNA clustering is to check the ratio of observed edge weights to expected edge weights.

clustMod <- clusterModularity(g, sceFilt$snnCluster, as.ratio = TRUE)
clustMod

Here values of zero indicate that clusters are not connected. Given that these are ratios, plotting on the log scale can be useful but let’s add an offset of 1 to ensure our minimum value on the log scale is zero.

log2(clustMod + 1) %>%
  pheatmap(
    color = colorRampPalette(c("white", "blue"))(100),
    cluster_cols = FALSE,
    cluster_rows = FALSE
  )

As expected from our initial PCA plot we have one very distinct cluster, whilst some neighbouring clusters on the PCA visualisation have a degree of connectivity. Whether we need to refine our clustering is a value judgement and we’ll try this later on.

As our modularity score represents how connected modules are, we can consider this an adjacency matrix and also display how connected each cluster is. First we’ll make a graph from our modularity scores, then we’ll plot it, over-riding the default edge widths, which are set to 1.

log2(clustMod + 1) %>%
  igraph::graph_from_adjacency_matrix(
    mode = "upper", weighted = TRUE, diag = FALSE
  ) %>%
  plot(
    edge.width = igraph::E(.)$weight*5
  )

This reinforces the view that we have one distinct cluster with 2 or 3 other related clusters. Perhaps these represent the same cell-types in different states. This is where collaboration with a biologist can be helpful (once we find the marker genes for each cluster)

k-Means Clustering

An alternative approach to clustering is to use \(k\)-means, in which we define (up to) \(k\) clusters beforehand. This algorithm then finds the optimal means for each cluster and allocates each cell to the cluster with the nearest mean. Unfortunately there is a bias towards clusters of the same spherical radius. The kmeans() function required to perform the clustering here is actually a base function too.

set.seed(100)
km <- reducedDim(sceFilt, type = "PCA") %>%
  kmeans(centers = 10)
table(km$cluster)
sceFilt$kMeans <- as.factor(km$cluster)
plotReducedDim(sceFilt, dimred = "PCA", colour_by = "kMeans", text_by = "kMeans") 

Here we can simply explore the relationship between clusters using dendrogram

dist(km$centers) %>%
  hclust("ward.D2") %>%
  plot()

Additional notes regarding k-means

As it is not build from any edges or graph-based approaches, it is very easy to over-cluster using \(k\)-means. This can actually be useful for separating extremes of a relatively continuous cluster and enables detection of marker genes which may denote the gradient of a cluster.

Unfortunately our version of R is too out of date for this, but in the newer versions of scran, there is a new function where we can use \(k\)-means as initial clustering before generating the shared neighbour graph. It is expected this may become a common approach in the next few years, particularly with large datasets.

Marker Gene Detection

Genes which are DE somewhere

One of the key motivations behind any scRNA analysis is to identify cell types within one or more samples, and determine what genes are defining the differences between cell types or cell states. This is performed relatively simply using the function findMarkers(), which performs a simple \(t\)-test when left with default settings.

markersAny <- findMarkers(sceFilt, groups = sceFilt$snnCluster)
subset(markersAny[[1]], FDR < 0.01)

Looking at the most significant marker genes from cluster 1 immediately highlights some of the challenges.

  • How was the p-value obtained. From just one comparison, from all comparisons or another method?
  • Looking across the the logFC for each comparison, how do we consider a gene to be a marker for each cell type?

The default setting for the p-value is not made clear in the manual, however this is set to pval.type = "any". This uses a method for combining p-values (Simes’ method) testing the null hypothesis that the gene is DE nowhere with the alternative the gene is DE somewhere. The Top column shows the minimum rank across all comparisons for that gene, such that all genes with Top = 1 were the most highly ranked in at least one comparison. This column only appears when using the default setting of pval.type = "any" and you’ll notice that it disappears as we look at different options.

As you can also see, no directionality has been applied to the test, but we can rerun this setting direction="up" which will now return results for genes that are up-regulated somewhere.

markersAny <- findMarkers(sceFilt, groups = sceFilt$snnCluster, direction = "up")
subset(markersAny[[1]], FDR < 0.01)

Notice that we can still see genes that are highly-ranked for being up in one comparison, but are still down in other comparisons.

We can quickly summarise the results for all clusters using some lapply() magic.

markersAny %>%
  lapply(subset, FDR < 0.01) %>% 
  sapply(nrow)

The last command (sapply()) is like lapply() but it attempts to simplify the output. Here it’s done well and has returned a vector instead of a list.

Genes which are DE in many comparisons

As an alternative, we can change the value for pval.type = "some", which will now test the Null Hypothesis that each gene is not DE in more than half the comparisons, with the alternative that the gene is DE in more than half of the comparisons. This involves a different strategy for combining the p-values

markersSome <- findMarkers(sceFilt, groups = sceFilt$snnCluster, direction = "up", pval.type = "some")
markersSome %>%
  lapply(subset, FDR < 0.01) %>% 
  sapply(nrow)

As this is a more stringent test, we clearly have fewer results

subset(markersSome[[1]], FDR < 0.01)

Now our genes appear to be more consistently up-regulated in cluster 1 compared to other clusters. When checking this, it’s always good to be aware of which clusters were the most related to each other too.

Genes which are DE in all comparisons

Clearly the next approach would be to look for marker genes which are up-regulated in each cluster compared to all other clusters, and we do this by setting pval.type = "all". This uses a different method again for combining p-values and as this is the most difficult hurdle to overcome, we’d expect to see far fewer genes.

markersAll <- findMarkers(sceFilt, groups = sceFilt$snnCluster, direction = "up", pval.type = "all")
markersAll %>%
  lapply(subset, FDR < 0.01) %>% 
  sapply(nrow)

We may see that some clusters have no unique markers under this strategy.

Let’s check our first cluster again

subset(markersAll[[1]], FDR < 0.01)

Looking at the logFC values when comparing against all other clusters, this looks like a promising set of candidate genes. From here, we should inspect the expression patterns. Before we move on though, it might be worth discussing a couple of other alternatives to the testing strategy.

Alternative Tests

In the above sections, we used a \(t\) test to find our markers, but findMarkers() offer two more alternatives. The argument test.type can take one of three options: 1) "t"; 2) "wilcox" or 3) "binom". The Wilcoxon Rank-Sum test is what’s known as a non-parametric test, in that it doesn’t rely on the assumption of normality. The ranks of the logcounts are instead compared across clusters, however it is known that non-parametric tests are less powerful than parametric tests if the conditions of normality are approximately satisfied.

The third option test.type = "binom" models the probability of a cell within a cluster expressing the gene and this can lend an alternative viewpoint on the data.

markersBinom <- findMarkers(sceFilt, groups = sceFilt$snnCluster, test.type = "binom", direction = "up", pval.type = "all")
markersBinom %>%
  lapply(subset, FDR < 0.01) %>% 
  sapply(nrow)

This is clearly less powerful than the t-test but as you may realise, may by highly appropriate in some circumstances

Visualisation

Using PCA/TSNE

Now that we’ve identified a few genes which appear unique to a given cluster, we can visualise the expression of these genes within each cluster. Two common approaches are to overlay the logcounts over our clusters, or to produce violin plots for each cluster. Let’s start by overlaying the expression values over the clusters.

gn <- rownames(markersAll[[1]])[1]
plotReducedDim(sceFilt, dimred = "PCA", colour_by = gn)

Unfortunately, if we’d like to add the gene name instead of the Ensembl ID, we have to regenerate the fill colours. However this is a ggplot object so if we know the fill is just the default scale_fill_viridis_c() this is easy enough. The alternative would be to set the rownames as the gene name/symbol at the beginning of our analysis. Some people do prefer this.

plotReducedDim(sceFilt, dimred = "PCA", colour_by = gn) +
  scale_fill_viridis_c(name = rowData(sceFilt)[gn, "gene_name"])

This approach is great for overlaying a single gene (or other confounders such as doubletScore or total)

Violin Plots

If we wish to check out more than one marker gene, we can produce violin plots.

plotExpression(sceFilt, features = gn, x = "snnCluster", colour_by = "snnCluster")
gn <- markersAll[[1]] %>%
  subset(FDR < 0.01) %>%
  rownames() %>%
  .[seq(1, min(10, length(.)))]
plotExpression(sceFilt, features = gn, x = "snnCluster", colour_by = "snnCluster")

Once again we can take advantage of our ggplot skills and remake the facets, including a labeller function. This is a common trick used to change the labels for facets without changing your data.

gnLabeller <- as_labeller(
  rowData(sceFilt)$gene_name %>% setNames(rownames(sceFilt))
  )
plotExpression(sceFilt, features = gn, x = "snnCluster", colour_by = "snnCluster") +
  facet_wrap(~Feature, labeller = gnLabeller, ncol = 2)

Let’s get the top ranked gene from each cluster and run a quick check across all of our data.

topRank <- markersAll %>%
  lapply(subset, FDR < 0.01) %>%
  lapply(rownames) %>%
  sapply(magrittr::extract, 1) %>%
  .[!is.na(.)]
gnLabeller(topRank) %>% unlist()
plotExpression(sceFilt, features = topRank, x = "snnCluster", colour_by = "snnCluster") +
  facet_wrap(~Feature, labeller = gnLabeller, ncol = 2)

Conclusion

Now you’ve thoroughly inspected your clusters, do you believe them? Have we found markers that we have faith in? How could we pick up more complex patterns?

Please change your clustering approach and see if you are able to improve your results.