The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands or even millions of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems.
Single-cell RNA sequencing (scRNA-seq) enables transcriptome-wide gene expression measurement at single-cell resolution, allowing for cell type clusters to be distinguished, the arrangement of populations of cells according to novel hierarchies, and the identification of cells transitioning between states. This can lead to a much clearer view of the dynamics of tissue and organism development, and on structures within cell populations that had so far been perceived as homogeneous. In a similar vein, analyses based on single-cell DNA sequencing (scDNA-seq) can highlight somatic clonal structures, thus helping to track the formation of cell lineages and provide insight into evolutionary processes acting on somatic mutations.
The overall, experimental scRNA-seq protocols are similar to the methods used for bulk RNA-seq.
Pre‐processing and visualization
Raw data produced by sequencing machines are processed to gain matrices of molecular counts (count matrices) or, alternatively, read counts (read matrices), depending on whether unique molecular identifiers (UMIs) were incorporated in the single‐cell library construction protocol. Raw data processing pipelines such as Cell Ranger, indrops, SEQC, or zUMIs take care of read quality control (QC), assigning reads to their cellular barcodes and mRNA molecules of origin (also called “demultiplexing”), genome alignment, and quantification. The resulting read or count matrices have the dimension number of barcodes x number of transcripts.
Before analyzing the single‐cell gene expression data, it is ensured that all cellular barcode data correspond to viable cells. Cell QC is commonly performed based on three QC covariates: the number of counts per barcode (count depth), the number of genes per barcode, and the fraction of counts from mitochondrial genes per barcode. The distributions of these QC covariates are examined for outlier peaks that are filtered out by thresholding. These outlier barcodes can correspond to dying cells, cells whose membranes are broken, or doublets.
Each count in a count matrix represents the successful capture, reverse transcription and sequencing of a molecule of cellular mRNA. Count depths for identical cells can differ due to the variability inherent in each of these steps. Thus, when gene expression is compared between cells based on count data, any difference may have arisen solely due to sampling effects. Normalization addresses this issue by e.g. scaling count data to obtain correct relative gene expression abundances between cells.
Many normalization methods exist for bulk gene expression. While some of these methods have been applied to scRNA‐seq analysis, sources of variation specific to single‐cell data such as technical dropouts have prompted the development of scRNA‐seq‐specific normalization methods.
Data correction and integration
Normalization attempts to remove the effects of count sampling. However, normalized data may still contain unwanted variability. Data correction targets further technical and biological covariates such as batch, dropout, or cell cycle effects. These covariates are not always corrected for. Instead, the decision of which covariates to consider will depend on the intended downstream analysis.
Regressing out biological effects
While correcting for technical covariates may be crucial to uncovering the underlying biological signal, correction for biological covariates serves to single out particular biological signals of interest. The most common biological data correction is to remove the effects of the cell cycle on the transcriptome. This data correction can be performed by a simple linear regression against a cell cycle score as implemented in the
Regressing out technical effects
The variants of regression models used to regress out biological covariates can also be applied to technical covariates. The most prominent technical covariates in single‐cell data are count depth and batch. Although normalization scales count data to render gene counts comparable between cells, a count depth effect often remains in the data. This count depth effect can be both a biological and a technical artifact.
Batch effects and data integration
Batch effects can occur when cells are handled in distinct groups. These groups can consist of cells on different chips, cells in different sequencing lanes or cells harvested at different time points. The differing environments experienced by the cells can have an effect on the measurement of the transcriptome or on the transcriptome itself. The resulting effects exist on multiple levels: between groups of cells in an experiment, between experiments performed in the same laboratory or between datasets from different laboratories.