SingleRBook-base/diagnostics.Rmd at master · SingleR-inc/SingleRBook-base · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
bibliography: ref.bib
---

# Annotation diagnostics

```{r, echo=FALSE, results='asis'}
library(rebook)
chapterPreamble(cache=TRUE)
```

## Overview

In addition to the labels, `SingleR()` returns a number of helpful diagnostics about the annotation process
that can be used to determine whether the assignments are appropriate.
Unambiguous assignments corroborated by expression of canonical markers add confidence to the results;
conversely, low-confidence assignments can be pruned out to avoid adding noise to downstream analyses.
This chapter will demonstrate some of these common sanity checks on the pancreas datasets
from Chapter \@ref(more-markers) [@muraro2016singlecell;@grun2016denovo].

```{r, results='asis', echo=FALSE}
extractCached("pancreas.Rmd", "annotation", c("sceG", "pred.grun"))
```

## Based on the scores within cells

The most obvious diagnostic reported by `SingleR()` is the nested matrix of per-cell scores in the `scores` field.
This contains the correlation-based scores prior to any fine-tuning for each cell (row) and reference label (column).
Ideally, we would see unambiguous assignments where, for any given cell, one label's score is clearly larger than the others.

```{r}
pred.grun$scores[1:10,]
```

To check whether this is indeed the case,
we use the `plotScoreHeatmap()` function to visualize the score matrix (Figure \@ref(fig:score-heatmap-grun)).
Here, the key is to examine the spread of scores within each cell, i.e., down the columns of the heatmap.
Similar scores for a group of labels indicates that the assignment is uncertain for those columns,
though this may be acceptable if the uncertainty is distributed across closely related cell types.
(Note that the assigned label for a cell may not be the visually top-scoring label if fine-tuning is applied,
as the only the pre-tuned scores are directly comparable across all labels.)

```{r score-heatmap-grun, fig.cap="Heatmap of normalized scores for the Grun dataset. Each cell is a column while each row is a label in the reference Muraro dataset. The final label (after fine-tuning) for each cell is shown in the top color bar."}
library(SingleR)
plotScoreHeatmap(pred.grun)
```

We can also display other metadata information for each cell by setting `clusters=` or `annotation_col=`.
This is occasionally useful for examining potential batch effects,
differences in cell type composition between conditions,
relationship to clusters from an unsupervised analysis and so on,.
For example, Figure \@ref(fig:score-heatmap-grun-donor) displays the donor of origin for each cell;
we can see that each cell type has contributions from multiple donors,
which is reassuring as it indicates that our assignments are not (purely) driven by donor effects.

```{r score-heatmap-grun-donor, fig.cap="Heatmap of normalized scores for the Grun dataset, including the donor of origin for each cell."}
plotScoreHeatmap(pred.grun,
    annotation_col=as.data.frame(colData(sceG)[,"donor",drop=FALSE]))
```

```{r, echo=FALSE}
# Making sure this is true for the major populations.
tab <- table(sceG$donor, pred.grun$labels)
stopifnot(colSums(tab > 0)[c("acinar", "alpha", "beta", "delta", "duct")] >= 4)
```

The `scores` matrix has several caveats associated with its interpretation.
Only the pre-tuned scores are stored in this matrix, as scores after fine-tuning are not comparable across all labels.
This means that the label with the highest score for a cell may not be the cell's final label if fine-tuning is applied.
Moreover, the magnitude of differences in the scores has no clear interpretation;
indeed, `plotScoreHeatmap()` dispenses with any faithful representation of the scores
and instead adjusts the values to highlight any differences between labels within each cell.

## Based on the deltas across cells

We identify poor-quality or ambiguous assignments based on the per-cell "delta",
i.e., the difference between the score for the assigned label and the median across all labels for each cell.
Our assumption is that most of the labels in the reference are not relevant to any given cell.
Thus, the median across all labels can be used as a measure of the baseline correlation,
while the gap from the assigned label to this baseline can be used as a measure of the assignment confidence.

Low deltas indicate that the assignment is uncertain, possibly because the cell's true label does not exist in the reference.
An obvious next step is to apply a threshold on the delta to filter out these low-confidence assignments.
We use the delta rather than the assignment score as the latter is more sensitive to technical effects.
For example, changes in library size affect the technical noise and can increase/decrease all scores for a given cell,
while the delta is somewhat more robust as it focuses on the differences between scores within each cell.

`SingleR()` will set a threshold on the delta for each label using an outlier-based strategy.
Specifically, we identify cells with deltas that are small outliers relative to the deltas of other cells with the same label.
This assumes that, for any given label, most cells assigned to that label are correct.
We focus on outliers to avoid difficulties with setting a fixed threshold,
especially given that the magnitudes of the deltas are about as uninterpretable as the scores themselves.
Pruned labels are reported in the `pruned.labels` field where low-quality assignments are replaced with `NA`.

```{r}
to.remove <- is.na(pred.grun$pruned.labels)
table(Label=pred.grun$labels, Removed=to.remove)
```

However, the default pruning parameters may not be appropriate for every dataset.
For example, if one label is consistently misassigned, the assumption that most cells are correctly assigned will not be appropriate.
In such cases, we can revert to a fixed threshold by manually calling the underlying `pruneScores()` function with `min.diff.med=`.
The example below discards cells with deltas below an arbitrary threshold of 0.2,
where higher thresholds correspond to greater assignment certainty.

```{r}
to.remove <- pruneScores(pred.grun, min.diff.med=0.2)
table(Label=pred.grun$labels, Removed=to.remove)
```

This entire process can be visualized using the `plotScoreDistribution()` function,
which displays the per-label distribution of the deltas across cells (Figure \@ref(fig:score-dist-grun)).
We can use this plot to check that outlier detection in `pruneScores()` behaved sensibly.
Labels with especially low deltas may warrant some additional caution in their interpretation.

```{r score-dist-grun, fig.asp=1, fig.wide=TRUE, fig.cap="Distribution of deltas for the Grun dataset. Each facet represents a label in the Muraro dataset, and each point represents a cell assigned to that label (colored by whether it was pruned)."}
plotDeltaDistribution(pred.grun)
```

If fine-tuning was performed, we can apply an even more stringent filter
based on the difference between the highest and second-highest scores after fine-tuning.
Cells will only pass the filter if they are assigned to a label that is clearly distinguishable from any other label.
In practice, this approach tends to be too conservative as assignments involving closely related labels are heavily penalized.

```{r}
to.remove2 <- pruneScores(pred.grun, min.diff.next=0.1)
table(Label=pred.grun$labels, Removed=to.remove2)
```

## Based on marker gene expression

Another simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset.
The marker genes used for each label are reported in the `metadata()` of the `SingleR()` output, so we can simply retrieve them to visualize their (usually log-transformed) expression values across the test dataset.
In Figure \@ref(fig:grun-beta-heat), we use the  `plotHeatmap()` function from `r Biocpkg("scater")` to examine the expression of markers used to identify beta cells.

```{r grun-beta-heat, fig.asp=1, fig.cap="Heatmap of log-expression values in the Grun dataset for all marker genes upregulated in beta cells in the Muraro reference dataset. Assigned labels for each cell are shown at the top of the plot."}
all.markers <- metadata(pred.grun)$de.genes
beta.markers <- unique(unlist(all.markers$beta))
sceG$labels <- pred.grun$labels

library(scater)
plotHeatmap(sceG, order_columns_by="labels", features=beta.markers)
```

If a cell in the test dataset is confidently assigned to a particular label,
we would expect it to have strong expression of that label's markers.
We would also hope that those label's markers are biologically meaningful;
in this case, we do observe strong upregulation of insulin (_INS_) in the beta cells,
which is reassuring and gives greater confidence to the correctness of the assignment.
If the identified markers are not meaningful or not consistently upregulated,
some skepticism towards the quality of the assignments is warranted.

```{r, echo=FALSE}
# Sanity check.
stopifnot(any(grepl("^INS_", beta.markers)))
```

In practice, the heatmap may be overwhelmingly large if there too many reference-derived markers.
To resolve this, we can prune the set of markers to focus on the most interesting genes based on their test expression profiles.
Figure \@ref(fig:grun-beta-heat2) is limited to the top genes with the strongest evidence for upregulation in our test dataset using the assigned labels; such genes are effectively markers for beta cells in both the reference _and_ test datasets.
As a diagnostic plot, this is much more amenable to quick inspection to check that the expected genes are present.

```{r grun-beta-heat2, fig.asp=1, fig.cap="Heatmap of log-expression values in the Grun dataset for all marker genes upregulated in beta cells in the Muraro reference dataset, pruned to those that are also upregulated in the assigned cells in the Grun dataset. Assigned labels for each cell are shown at the top of the plot."}
# Taking the first 20 reference markers that are the top empirical markers.
library(scran)
empirical.markers <- findMarkers(sceG, sceG$labels, direction="up")
m <- match(beta.markers, rownames(empirical.markers$beta))
m <- beta.markers[rank(m) <= 20]

library(scater)
plotHeatmap(sceG, order_columns_by="labels", features=m)
```

It is straightforward to repeat this process for all labels by wrapping this code in a loop,
as shown below in Figure \@ref(fig:grun-beta-heat-all).
Note that `plotHeatmap()` is not the only function that can be used for this visualization;
we could also use `plotDots()` to create a `r CRANpkg("Seurat")`-style dot plot,
or we could use other heatmap plotting functions such as `dittoHeatmap()` from `r Biocpkg("dittoSeq")`.

```{r grun-beta-heat-all, fig.width=20, fig.height=15, fig.cap="Heatmaps of log-expression values in the Grun dataset for all marker genes upregulated in each label in the Muraro reference dataset. Assigned labels for each cell are shown at the top of each plot."}
collected <- list()
for (lab in unique(pred.grun$labels)) {
    lab.markers <- unique(unlist(all.markers[[lab]]))
    m <- match(lab.markers, rownames(empirical.markers[[lab]]))
    m <- lab.markers[rank(m) <= 20]
    collected[[lab]] <- plotHeatmap(sceG, silent=TRUE,
        order_columns_by="labels", main=lab, features=m)[[4]]
}
do.call(gridExtra::grid.arrange, collected)
```

In general, the heatmap provides a more interpretable diagnostic visualization than the plots of scores and deltas.
However, it does require more effort to inspect and may not be feasible for large numbers of labels.
It is also difficult to use a heatmap to determine the correctness of assignment for closely related labels.

## Comparing to unsupervised clustering

It can also be instructive to compare the assigned labels to the groupings generated from unsupervised clustering algorithms.
The assumption is that the differences between reference labels are also the dominant factor of variation in the test dataset;
this implies that we should expect strong agreement between the clusters and the assigned labels.
To demonstrate, we'll use the `sceG` from Chapter \@ref(pancreas-case-study)
where clusters have generated using a graph-based method [@xu2015identification].

```{r, results='asis', echo=FALSE}
extractCached("pancreas.Rmd", "clustering", c("sceG", "pred.grun"))
```

We compare these clusters to the labels generated by `r Biocpkg("SingleR")`.
Any similarity can be quantified with the adjusted rand index (ARI) with `pairwiseRand()` from the `r Biocpkg("bluster")` package.
Large ARIs indicate that the two partitionings are in agreement, though an acceptable definition of "large" is difficult to gauge;
experience suggests that a reasonable level of consistency is achieved at ARIs above 0.5.

```{r}
library(bluster)
pairwiseRand(sceG$cluster, pred.grun$labels, mode="index")
```

In practice, it is more informative to examine the distribution of cells across each cluster/label combination.
Figure \@ref(fig:grun-label-clusters) shows that most clusters are nested within labels,
a difference in resolution that is likely responsible for reducing the ARI.
Clusters containing multiple labels are particularly interesting for diagnostic purposes,
as this suggests that the differences between labels are not strong enough to drive formation of distinct clusters in the test.

```{r grun-label-clusters, fig.cap="Heatmap of the log-transformed number of cells in each combination of label (column) and cluster (row) in the Grun dataset."}
tab <- table(cluster=sceG$cluster, label=pred.grun$labels)
pheatmap::pheatmap(log10(tab+10)) # using a larger pseudo-count for smoothing.
```

```{r, echo=FALSE}
# Sanity check for the statement above.
m <- max.col(tab)
p <- tab[cbind(seq_along(m), m)]/rowSums(tab)
stopifnot(sum(p > 0.9) > 0.7)
```

The underlying assumption is somewhat reasonable in most scenarios where the labels relate to cell type identity.
However, disagreements between the clusters and labels should not be cause for much concern.
The whole point of unsupervised clustering is to identify novel variation that, by definition, is not in the reference.
It is entirely possible for the clustering and labels to be different without compromising the validity or utility of either;
the former captures new heterogeneity while the latter facilitates interpretation in the context of existing knowledge.

## Session information {-}

```{r, results='asis', echo=FALSE}
prettySessionInfo()
```