DSAVE package

Johan Gustafsson, Juan Inda

2019-12-17

1 Overview

DSAVE (Down-SAmpling based Variation Estimation) is a package that can be used for investigating the cell-to-cell variation in single-cell RNA-Seq data. Specifically, it supports the following functions:

The package works on cell populations containing similar cells. The cells are expected to be classified into cell types if the dataset has a mix of cell types, and the total cell population need to be divided into populations per cell type to give meaningful results.

2 Total Variation - estimation of cell population size needed for pooled expression profiles

DSAVE Total Variation Estimation - This metric measures the total variation, including sampling noise, for the mean expression over a gene range for a pool of single-cells of a certain size. The metric is useful for determining the pool size needed to obtain the same variation as in for example typical bulk RNA-Seq data. The calculations include generation of sampling noise only (SNO) datasets, which are randomly generated by sampling from a multinomial distribution with probabilities calculated from the mean expression of the cell population.

library(ggplot2)
library(DSAVE)
#> Registered S3 method overwritten by 'R.oo':
#>   method        from       
#>   throw.default R.methodsS3
library(progress)
library(Seurat)
library("plotly")
#> 
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#> 
#>     last_plot
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:graphics':
#> 
#>     layout
extrDir <- downloadData("http://cf.10xgenomics.com/samples/cell-exp/1.1.0/b_cells/b_cells_filtered_gene_bc_matrices.tar.gz", "B10k")
dataDir = paste0(extrDir,"/filtered_matrices_mex/hg19")
bcells <- Read10X(data.dir = dataDir)

extrDir <- downloadData("http://cf.10xgenomics.com/samples/cell-exp/2.1.0/t_4k/t_4k_filtered_gene_bc_matrices.tar.gz", "T4k")
dataDir = paste0(extrDir,"/filtered_gene_bc_matrices/GRCh38")
tcells <- Read10X(data.dir = dataDir)

extrDir <- downloadData("http://cf.10xgenomics.com/samples/cell-exp/1.1.0/fresh_68k_pbmc_donor_a/fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz", "PBMC68k")
dataDir = paste0(extrDir,"/filtered_matrices_mex/hg19")
pbmc <- Read10X(data.dir = dataDir)
#only use the 20000 first cells to speed up the demo

pbmc = pbmc[,1:20000]
pbmcCellTypes = ctPbmc68k[1:20000]
scTotalVariation_bcells <- DSAVEGetTotalVariationPoolSize(bcells,
        upperBound = 1e5, lowerBound = 5e-1)

scTotalVariation_tcells <- DSAVEGetTotalVariationPoolSize(tcells,
          upperBound = 1e5, lowerBound = 5e-1)

DSAVEPlotTotalVariation(list(scTotalVariation_bcells, scTotalVariation_tcells), list("B cells", "T cells"))

We can see that we need to pool around 1500 T cells to on average get down to bulk variation level, while we need twice as many of the B cells, likely depending on that the sampling noise is higher in the B10k dataset due to fewer reads. It is also to be noted that the number of cells that is needed in a pool highly depends on gene expression. For highly expressed genes, a few hundreds are enough, while for lowly expressed genes several thousands are needed.

3 Variation score

The DSAVE variation score is a metric describing the non-sampling part of the cell-to-cell variation in single-cell RNA-Seq data. It is useful both as a technical quality metric (including the cell type annotation process if applicable) and as a way to compare different clustering methods on a specific dataset.

To calculate the score, DSAVE divides the variation into two components; sampling noise and BTM (Biological, Technical and Misclassification) variation. The sampling noise is a function of the number of reads per cell, while the BTM variation contains interesting information about technical artefacts, biological variation, and cell misclassifications. The sampling noise typically varies between datasets and is often the dominating variation factor, often obscuring the BTM variation if not separated with a method such as DSAVE.

BTM variation can be explained the following way: For each mapped read from a cell in a cell population, there is a certain chance that this will belong to a certain gene. This probability is based on the mean expression of that gene in the entire population. If no BTM variation exists, the reads will be sampled according to these gene probabilities the same way in all cells, and we will get only sampling noise. In this context, BTM variation is the variation between cells in the probabilities used when sampling reads.

The function uses a template, which is used for making the metric comparable across cell populations and even datasets. The template specifies how the algorithm should align the data (downsample etc.) before measuring the variation. We provide 3 different templates for human data as part of this package: The standard template, for populations of 2000 cells or more, and two variants that require only 1000 and 500 cells, respectively.

We have shown that if the BTM variation of a whole cell population is large compared to other datasets, the reason for this is most likely technical; biological variation is often smaller when compared over the whole gene range.

The metric’s use as a technical quality metric is shown below, while the use for comparing clustering methods is shown further down in this document:

templInfo <- DSAVEGetStandardTemplate()
tcells_btm <- DSAVECalcBTMScore(tcells, templInfo, skipAlignment=FALSE,
                                iterations = 15, useLogTransform=FALSE, 
                                logTPMAddon=1, silent=FALSE)
DSAVEPlotDSAVEScore(tcells_btm$DSAVEScore, "T4k", DSAVE::datasetScoresHuman)

We can see that the T4k dataset has a BTM variation that is comparable to the other datasets and that the technical quality thus is what can be expected for single-cell RNA-Seq.

5 Cell-wise variation - detect misclassified cells


so = CreateSeuratObject(counts = pbmc, project = "pbmc", min.cells = 0, min.features = 0)
#> Warning: Feature names cannot have underscores ('_'), replacing with dashes
#> ('-')
so <- NormalizeData(so, normalization.method = "LogNormalize", scale.factor = 10000)

#Finds the most variable genes
so <- FindVariableFeatures(so, selection.method = "vst", nfeatures = 2000)

#Make mean and variance the same for all genes:
all.genes <- rownames(so)
so <- ScaleData(so, features = all.genes)
#> Centering and scaling data matrix

#Principal component analysis
so <- RunPCA(so, features = VariableFeatures(object = so))
#> PC_ 1 
#> Positive:  CST3, SPI1, LST1, SERPINA1, LYZ, FCN1, CFD, RP11-290F20.3, IFI30, CD68 
#>     AIF1, MS4A7, PILRA, HCK, TMEM176B, TYMP, FCER1G, CFP, HLA-DRB1, LRRC25 
#>     HLA-DRA, TYROBP, HLA-DRB5, HLA-DPA1, IGSF6, S100A9, HMOX1, HLA-DPB1, CPVL, SAT1 
#> Negative:  RPS6, RPL13, LTB, RPS2, IL32, CD7, RPL34, IL7R, CCR7, JUN 
#>     CD8B, AQP3, CTSW, RGCC, CCL5, GZMK, RP11-291B21.2, PASK, CD8A, NGFRAP1 
#>     MYC, FKBP11, DUSP2, JUNB, CPA5, GZMA, CRIP2, XBP1, KLRG1, LYAR 
#> PC_ 2 
#> Positive:  NKG7, GNLY, CST7, GZMB, GZMA, FGFBP2, CCL5, GZMH, PRF1, CTSW 
#>     HOPX, FCGR3A, SPON2, CLIC3, CCL4, SRGN, TYROBP, MATK, PRSS23, CD63 
#>     S1PR5, PFN1, KLRD1, IGFBP7, GPR56, IFITM2, C1orf21, RHOC, FCRL6, TMSB4X 
#> Negative:  LTB, RPL13, RPS2, CD79A, HLA-DRA, RPL34, TCL1A, MS4A1, RPLP1, CD79B 
#>     RPS6, HLA-DQA1, LINC00926, VPREB3, FCER2, HLA-DQA2, JUNB, HLA-DMA, LY86, HLA-DOB 
#>     SPIB, HLA-DMB, BANK1, FCRLA, EAF2, HVCN1, CCR7, HLA-DQB1, HLA-DRB1, CD22 
#> PC_ 3 
#> Positive:  CD79A, MS4A1, TCL1A, CD79B, HLA-DRA, HLA-DQA1, HLA-DPB1, CD74, HLA-DPA1, HLA-DRB1 
#>     HLA-DQA2, LINC00926, HLA-DRB5, VPREB3, SPIB, FCER2, BANK1, HLA-DQB1, HLA-DMB, HLA-DOB 
#>     PDLIM1, HLA-DMA, FCRLA, GZMB, IRF8, HVCN1, IGLL5, BLK, CD22, EAF2 
#> Negative:  JUNB, AIF1, RP11-290F20.3, SERPINA1, CFD, RPL13, RPS2, IL7R, TMEM176B, RPS6 
#>     FCN1, RPL34, LST1, MS4A7, PILRA, S100A6, S100A9, IFI30, HES4, CFP 
#>     LRRC25, HMOX1, CDKN1C, LILRB2, COTL1, C1QA, LTB, AQP3, S100A11, VIM 
#> PC_ 4 
#> Positive:  PPBP, PF4, SDPR, CLU, GNG11, NRGN, HIST1H2AC, ACRBP, CMTM5, TREML1 
#>     NCOA4, TUBB1, GP9, SPARC, RUFY1, AP001189.4, ITGA2B, PTCRA, CD9, MYL9 
#>     CLDN5, MPP1, MARCH2, MAP3K7CL, AC147651.3, SNCA, TMEM40, RGS18, C2orf88, PARVB 
#> Negative:  RPS2, RPL13, RPS6, TMSB10, RPLP1, NKG7, CYBA, GNLY, RPL34, CST7 
#>     GZMA, FGFBP2, GZMB, TYROBP, GZMH, FCGR3A, CTSW, PRF1, HOPX, CD74 
#>     HLA-DRB1, CCL4, HLA-DRB5, HLA-DPB1, SPON2, CLIC3, HLA-DRA, IFITM2, HLA-DPA1, MATK 
#> PC_ 5 
#> Positive:  MS4A7, RP11-290F20.3, VMO1, CD79B, FCGR3A, HMOX1, CDKN1C, HES4, CKB, LILRA3 
#>     PILRA, CD79A, ICAM4, SERPINA1, C1QA, SIGLEC10, LYN, MS4A1, LRRC25, C5AR1 
#>     TESC, CD68, TCL1A, IFITM3, GPBAR1, CUX1, MTSS1, CXCL16, MAFB, LYPD2 
#> Negative:  LGALS2, FCER1A, S100A8, S100A9, CLEC10A, LYZ, CST3, MS4A6A, GRN, CD14 
#>     GPX1, CD1C, ENHO, ALDH2, LGALS1, RNASE6, BLVRB, CAPG, S100A4, SERPINF1 
#>     S100A6, CPVL, TYROBP, VCAN, CEBPD, IGSF6, CSF3R, HLA-DMA, RAB32, GSN

so[["extCellTypes"]] = pbmcCellTypes

so <- RunUMAP(so, dims = 1:10)
#> Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
#> To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
#> This message will be shown once per session
#> 14:58:13 UMAP embedding parameters a = 0.9922 b = 1.112
#> 14:58:13 Read 20000 rows and found 10 numeric columns
#> 14:58:13 Using Annoy for neighbor search, n_neighbors = 30
#> 14:58:13 Building Annoy index with metric = cosine, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 14:58:18 Writing NN index file to temp file C:\Users\gustajo\AppData\Local\Temp\RtmpKQCPLw\file4988133e3f7e
#> 14:58:18 Searching Annoy index using 1 thread, search_k = 3000
#> 14:58:29 Annoy recall = 100%
#> 14:58:30 Commencing smooth kNN distance calibration using 1 thread
#> 14:58:32 Initializing from normalized Laplacian + noise
#> 14:58:34 Commencing optimization for 200 epochs, with 855134 positive edges
#> 14:59:03 Optimization finished
DimPlot(so, reduction = "umap", group.by = "extCellTypes")


#show data extraction from Seurat:
so <- FindNeighbors(so, dims = 1:10)
#> Computing nearest neighbor graph
#> Computing SNN
so <- FindClusters(so, resolution = 0.5)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 20000
#> Number of edges: 616555
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8922
#> Number of communities: 11
#> Elapsed time: 6 seconds
DimPlot(so, reduction = "umap")

We show both the cell classification from the authors and the clustering performed by a newer version of Seurat. Although similar, the newer clustering agrees better with the UMAP.


dend = pbmc[, pbmcCellTypes == "Dendritic"]

dend_cell_divergence <- DSAVEGetSingleCellDivergence(dend,
                                                     minUMIsPerCell = 200, tpmLowerBound = 0,
                                                     iterations = 15,silent=FALSE)

DSAVEPlotDivergence(dend, dend_cell_divergence)

The divergence plot shows each cell’s UMI counts plotted against their divergence The most divergent cells are the ones that have the lowest (i.e. most negative) divergence. By investigating the leftmost cells by hoovering over them with the mouse pointer, we can see which 5 genes that contribute the most to the divergence, and thus get an idea about which types of cells aremost divergent. For example, some cells have MT genes as most divergent genes, which may suggest that those are low quality cells. Furthermore, we see that some cells have the gene PPBP as highly divergent, suggesting that they may be megakaryocytes. NKG7 or GNLY show up for other cells, suggesting that they may be NK cells, while immunoglobin genes such as IGLL5 show up for others, suggesting a B cell lineage.

We also take a look at cluster 7 from Seurat, which supposedly should be the same cell type. First, extract the data from the Seurat object:

clust7 = extractSeuratData(so, 7)

Then plot in the same way as above:


#also investigate the clustering from Seurat directly
clust7_divergence <- DSAVEGetSingleCellDivergence(clust7,
                                                     minUMIsPerCell = 200, tpmLowerBound = 0,
                                                     iterations = 15,silent=FALSE)

DSAVEPlotDivergence(clust7, clust7_divergence)

We see less misclassified cells here in general, although there seem to still be some NK cells present.

6 Compare clustering methods using the DSAVE BTM Variation Score

We can use the DSAVE BTM score to get an idea about which of the cell populations examined above that exhibit the most variation, which could serve as a metric of clustering quality:


templ <- DSAVEGetStandardTemplate500();
clust7_score <- DSAVECalcBTMScore(clust7, templ)
dend_score <- DSAVECalcBTMScore(dend, templ)

clust7_score$DSAVEScore
#> [1] 0.01900283
dend_score$DSAVEScore
#> [1] 0.03457516

We can see that the score is higher for the authors’ classification, suggesting that the newer Seurat clustering leads to more homogenous clusters in this case.

6 Plot the UMI Counts and mitochondrial content vs Cell Divergence to estimate suitable thresholds for cell filtering

bcells_id <- sample(1:dim(bcells)[2], 1000)
subb = as.matrix(bcells[,bcells_id])
bcells_cell_divergence <- DSAVEGetSingleCellDivergence(subb,
                                                       minUMIsPerCell = 200, tpmLowerBound = 0,
                                                       iterations = 15,silent=FALSE)
DSAVEDivUMIPlot(subb, bcells_cell_divergence)

We can see that cells with low number of UMIs are more divergent. Cells below a certain UMI count threshold are usually discarded in a quality control step. We can see that elevating that threshold would likely get rid of more divergent cells, however at the cost of loss of data. Also cells with a high UMI count are on average more divergent; this could be explained by that some of them are doublets, i.e. cases where two or more cells have been accidentally joined and treated as just one cell. Divergence could then come from that the joined cells are of different cell type.


mitoGenes <- grep(pattern = "^MT-", x = row.names(subb), value = TRUE)
fracMT <- colSums(subb[mitoGenes, ])/colSums(subb)

#for mitochondrial content, it makes sense to calculate the divergence without the MT genes, 
#because these in themselves affect the divergence:

subbNoMT = subb[-match(mitoGenes, row.names(subb)),]
bcells_cell_divergence_no_MT <- DSAVEGetSingleCellDivergence(subbNoMT,
                                                       minUMIsPerCell = 200, tpmLowerBound = 0,
                                                       iterations = 15,silent=FALSE)

DSAVEDivParamPlot(bcells_cell_divergence_no_MT, fracMT, "MT Content")

It is evident that cells with high mitochondrial content are more divergent, dispite that the the mitochondrial genes are not part of the calculation. This suggest that a high mitochondrial content is a sign of low quality for cells, which is also commonly used in many processing pipelines.