Now that we have loaded our data in seurat (using the CreateSeuratObject), we want to perform some initial QC on our cells.
While the CreateSeuratObject
imposes a basic minimum gene-cutoff, you may want to filter out cells at this stage based on technical or biological parameters. Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. In the example below, we visualize gene and molecule counts, plot their relationship, and exclude cells with a clear outlier number of genes detected as potential multiplets. Of course this is not a guaranteed method to exclude cell doublets, but we include this as an example of filtering user-defined outlier cells. We also filter cells based on the percentage of mitochondrial genes present.
# The number of genes and UMIs (nGene and nUMI) are automatically calculated # for every object by Seurat. For non-UMI data, nUMI represents the sum of # the non-normalized values within a cell We calculate the percentage of # mitochondrial genes here and store it in percent.mito using AddMetaData. # We use object@raw.data since this represents non-transformed and # non-log-normalized counts The % of UMI mapping to MT-genes is a common # scRNA-seq QC metric. NOTE: You must have the Matrix package loaded to # calculate the percent.mito values. mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE) percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data) # AddMetaData adds columns to object@meta.data, and is a great place to # stash QC stats pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito") VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
Note: In order to detect mitochondrial genes, we need to tell Seurat how to distinguish these genes. We do this using a regular expression as in “mito.genes <- grep(pattern = "^MT-"
. If your mitochondrial genes are named differently, then you will need to adjust this pattern accordingly (e.g. “mt-“, “mt.”, or “MT_” etc.).
Questions:
- What is the difference between nGenes and nUMIs?
- Can you detect the potential outliers in each plot?
- How do you feel about the quality of the cells at this initial QC step?
# GenePlot is typically used to visualize gene-gene relationships, but can # be used for anything calculated by the object, i.e. columns in # object@meta.data, PC scores etc. Since there is a rare subset of cells # with an outlier level of high mitochondrial percentage and also low UMI # content, we filter these as well par(mfrow = c(1, 2)) GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito") GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene")
Now based on our observations, we can filter out what we see as clear outliers. We will define a window of a minimum of 200 detected genes per cell and a maximum of 2500 detected genes per cell. Again, these parameters should be adjusted according to your own data and observations. For example, if you had very high coverage, you might want to adjust these parameters and increase the threshold window. As this is a guided approach, visualization of the earlier plots will give you a good idea of what these parameters should be.
# We filter out cells that have unique gene counts over 2,500 or less than # 200 Note that low.thresholds and high.thresholds are used to define a # 'gate' -Inf and Inf should be used if you don't want a lower or upper # threshold. pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.05))
Questions:
- How many cells did we filter out using the thresholds specified above?