Seurat part 4 – Cell clustering

So now that we have QC’ed our cells, normalized them, and determined the relevant PCAs, we are ready to determine cell clusters and proceed with annotating the clusters.

Seurat includes a graph-based clustering approach compared to (Macosko et al.). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partitioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure, for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar gene expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard distance). To cluster the cells, we apply modularity optimization techniques [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function.

The FindClusters function implements the procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.6-1.2 typically returns good results for single cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters are saved in the object@ident slot.

# save.SNN = T saves the SNN so that the clustering algorithm can be rerun
# using the same graph but with a different resolution value (see docs for
# full details)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, 
    resolution = 0.6, print.output = 0, save.SNN = TRUE)

Questions:

What parameter would you change to include the first 12 PCAs?
If your dataset contained 4K cells, what do you think the resolution parameter be set to?

A useful feature in Seurat v2.0 is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, we provide the function PrintFindClustersParams to print a nicely formatted summary of the parameters that were chosen.

PrintFindClustersParams(object = pbmc)

## Parameters used in latest FindClusters calculation run on: 2017-10-12 11:49:20
## =============================================================================
## Resolution: 0.6
## -----------------------------------------------------------------------------
## Modularity Function    Algorithm         n.start         n.iter
##      1                   1                 100             10
## -----------------------------------------------------------------------------
## Reduction used          k.param          k.scale          prune.SNN
##      pca                 30                25              0.0667
## -----------------------------------------------------------------------------
## Dims used in calculation
## =============================================================================
## 1 2 3 4 5 6 7 8 9 10

# While we do provide function-specific printing functions, the more general
# function to print calculation parameters is PrintCalcParams().

Run Non-linear dimensional reduction (tSNE)

Seurat continues to use tSNE as a powerful tool to visualize and explore these datasets. While we no longer advise clustering directly on tSNE components, cells within the graph-based clusters determined above should co-localize on the tSNE plot. This is because the tSNE aims to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space. As input to the tSNE, we suggest using the same PCs as input to the clustering analysis, although computing the tSNE based on scaled gene expression is also supported using the genes.use argument.

pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE)

# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = pbmc)

You can save the object at this point so that it can easily be loaded back in without having to rerun the computationally intensive steps performed above, or easily share it with collaborators.

save(pbmc, file = "~/Projects/datasets/pbmc3k/pbmc_tutorial.Robj")

Finding differentially expressed genes (cluster biomarkers)

Seurat can help you find markers that define clusters via differential expression. By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a gene to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a gene to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time – since this will test a large number of genes that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed gains can be significant and the most highly differentially expressed genes will likely still rise to the top.

The parameters described above can be adjusted to decrease computational time. Be careful when setting these, because (and depending on your data) it might have a substantial effect on the power of detection. We suggest using the HPC nodes to perform computationally intensive steps, rather than you personal laptops.

# find all markers of cluster 1
cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)
print(x = head(x = cluster1.markers, n = 5))

##                p_val avg_logFC pct.1 pct.2    p_val_adj
## S100A9  0.000000e+00  3.827593 0.996 0.216  0.00000e+00
## S100A8  0.000000e+00  3.786535 0.973 0.123  0.00000e+00
## LGALS2  0.000000e+00  2.634722 0.908 0.060  0.00000e+00
## FCN1    0.000000e+00  2.369524 0.956 0.150  0.00000e+00
## CD14   8.129864e-290  1.949317 0.663 0.029 1.11493e-285

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0, 3), 
    min.pct = 0.25)
print(x = head(x = cluster5.markers, n = 5))

##                p_val avg_logFC pct.1 pct.2     p_val_adj
## GZMB   3.854665e-190  3.195021 0.955 0.084 5.286288e-186
## IGFBP7 2.967797e-155  2.175917 0.542 0.010 4.070037e-151
## GNLY   7.492111e-155  3.514718 0.961 0.143 1.027468e-150
## FGFBP2 2.334109e-150  2.559484 0.852 0.085 3.200998e-146
## FCER1G 4.819154e-141  2.280724 0.839 0.100 6.608987e-137

# find markers for every cluster compared to all remaining cells, report
# only the positive ones
pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, 
    thresh.use = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)

## # A tibble: 16 x 7
## # Groups:   cluster [8]
##            p_val avg_logFC pct.1 pct.2     p_val_adj cluster     gene
##            <dbl>     <dbl> <dbl> <dbl>         <dbl>  <fctr>    <chr>
##  1 9.480623e-268  1.149058 0.924 0.483 1.300173e-263       0     LDHB
##  2 3.519020e-133  1.068122 0.662 0.202 4.825984e-129       0     IL7R
##  3  0.000000e+00  3.827593 0.996 0.216  0.000000e+00       1   S100A9
##  4  0.000000e+00  3.786535 0.973 0.123  0.000000e+00       1   S100A8
##  5  0.000000e+00  2.977399 0.936 0.042  0.000000e+00       2    CD79A
##  6 1.030004e-191  2.492236 0.624 0.022 1.412548e-187       2    TCL1A
##  7 3.158793e-231  2.158812 0.974 0.230 4.331968e-227       3     CCL5
##  8 4.103013e-125  2.113428 0.588 0.050 5.626872e-121       3     GZMK
##  9 1.304430e-165  2.151881 1.000 0.316 1.788896e-161       4     LST1
## 10 1.420369e-138  2.275509 0.962 0.137 1.947894e-134       4   FCGR3A
## 11 1.127318e-215  3.763928 0.961 0.131 1.546004e-211       5     GNLY
## 12 8.182907e-187  3.334634 0.955 0.068 1.122204e-182       5     GZMB
## 13  3.261141e-48  2.729243 0.844 0.011  4.472329e-44       6   FCER1A
## 14  7.655042e-30  1.965168 1.000 0.513  1.049812e-25       6 HLA-DPB1
## 15  1.165274e-56  5.889503 1.000 0.023  1.598056e-52       7     PPBP
## 16  3.238887e-44  4.952160 0.933 0.010  4.441810e-40       7      PF4

Seurat has four tests for differential expression which can be set with the test.use parameter: ROC test (“roc”), t-test (“t”), LRT test based on zero-inflated data (“bimod”, default), LRT test based on tobit-censoring models (“tobit”) The ROC test returns the ‘classification power’ for any individual marker (ranging from 0 – random, to 1 – perfect).

cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 0, thresh.use = 0.25, 
    test.use = "roc", only.pos = TRUE)

Questions:

What is the effect of changing the DE test? Can you experiment with these tests and see what the outcome is?

We include several tools for visualizing marker expression. VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploring JoyPlot, CellPlot, and DotPlot as additional methods to view your dataset.

VlnPlot(object = pbmc, features.plot = c("MS4A1", "CD79A"))

# you can plot raw UMI counts as well
VlnPlot(object = pbmc, features.plot = c("NKG7", "PF4"), use.raw = TRUE, y.log = TRUE)

FeaturePlot(object = pbmc, features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", 
    "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"), cols.use = c("grey", "blue"), 
    reduction.use = "tsne")

DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)
# setting slim.col.label to TRUE will print just the cluster IDS instead of
# every cell name
DoHeatmap(object = pbmc, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

Assigning cell type identity to clusters

Fortunately in the case of this dataset, we can use canonical markers to easily match the unbiased clustering to known cell types:

Cluster ID	Markers	Cell Type
0	IL7R	CD4 T cells
1	CD14, LYZ	CD14+ Monocytes
2	MS4A1	B cells
3	CD8A	CD8 T cells
4	FCGR3A, MS4A7	FCGR3A+ Monocytes
5	GNLY, NKG7	NK cells
6	FCER1A, CST3	Dendritic Cells
7	PPBP	Megakaryocytes

current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c("CD4 T cells", "CD14+ Monocytes", "B cells", "CD8 T cells", 
    "FCGR3A+ Monocytes", "NK cells", "Dendritic cells", "Megakaryocytes")
pbmc@ident <- plyr::mapvalues(x = pbmc@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(object = pbmc, do.label = TRUE, pt.size = 0.5)

Further subdivisions within cell types

If you perturb some of our parameter choices above (for example, setting resolution=0.8 or changing the number of PCs), you might see the CD4 T cells subdivide into two groups. You can explore this subdivision to find markers separating the two T cell subsets. However, before reclustering (which will overwrite object@ident), we can stash our renamed identities to be easily recovered later.

# First lets stash our identities for later
pbmc <- StashIdent(object = pbmc, save.name = "ClusterNames_0.6")

# Note that if you set save.snn=T above, you don't need to recalculate the
# SNN, and can simply put: pbmc <- FindClusters(pbmc,resolution = 0.8)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, 
    resolution = 0.8, print.output = FALSE)

## Warning in BuildSNN(object = object, genes.use = genes.use, reduction.type
## = reduction.type, : Build parameters exactly match those of already
## computed and stored SNN. To force recalculation, set force.recalc to TRUE.

# Demonstration of how to plot two tSNE plots side by side, and how to color
# points based on different criteria
plot1 <- TSNEPlot(object = pbmc, do.return = TRUE, no.legend = TRUE, do.label = TRUE)
plot2 <- TSNEPlot(object = pbmc, do.return = TRUE, group.by = "ClusterNames_0.6", 
    no.legend = TRUE, do.label = TRUE)
plot_grid(plot1, plot2)

# Find discriminating markers
tcell.markers <- FindMarkers(object = pbmc, ident.1 = 0, ident.2 = 1)

# Most of the markers tend to be expressed in C1 (i.e. S100A4). However, we
# can see that CCR7 is upregulated in C0, strongly indicating that we can
# differentiate memory from naive CD4 cells.  cols.use demarcates the color
# palette from low to high expression
FeaturePlot(object = pbmc, features.plot = c("S100A4", "CCR7"), cols.use = c("green", 
    "blue"))

The memory/naive split is a bit weak, and we would probably benefit from looking at more cells to see if this becomes more convincing. In the meantime, we can restore our old cluster identities for downstream processing.

pbmc <- SetAllIdent(object = pbmc, id = "ClusterNames_0.6")
save(pbmc, file = "~/Projects/datasets/pbmc3k_final.Rda")

Excellent! And for more of these great tutorials exploring the power of Seurat, head over to the Seurat tutorial page.

Below is the complete R code used in this tutorial,

library(Seurat)
library(dplyr)
library(Matrix)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")

# Examine the memory savings between regular and sparse matrices
dense.size <- object.size(x = as.matrix(x = pbmc.data))
dense.size

## 709264728 bytes

sparse.size <- object.size(x = pbmc.data)
sparse.size

## 38715120 bytes

dense.size/sparse.size

## 18.3 bytes

# Initialize the Seurat object with the raw (non-normalized data).  Keep all
# genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at
# least 200 detected genes
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, 
    project = "10X_PBMC")

# The number of genes and UMIs (nGene and nUMI) are automatically calculated
# for every object by Seurat.  For non-UMI data, nUMI represents the sum of
# the non-normalized values within a cell We calculate the percentage of
# mitochondrial genes here and store it in percent.mito using AddMetaData.
# We use object@raw.data since this represents non-transformed and
# non-log-normalized counts The % of UMI mapping to MT-genes is a common
# scRNA-seq QC metric.  NOTE: You must have the Matrix package loaded to
# calculate the percent.mito values.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)

# AddMetaData adds columns to object@meta.data, and is a great place to
# stash QC stats
pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")
VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)

# GenePlot is typically used to visualize gene-gene relationships, but can
# be used for anything calculated by the object, i.e. columns in
# object@meta.data, PC scores etc.  Since there is a rare subset of cells
# with an outlier level of high mitochondrial percentage and also low UMI
# content, we filter these as well
par(mfrow = c(1, 2))
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito")
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene")

# We filter out cells that have unique gene counts over 2,500 or less than
# 200 Note that low.thresholds and high.thresholds are used to define a
# 'gate' -Inf and Inf should be used if you don't want a lower or upper
# threshold.
pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), 
    low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.05))

pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", 
    scale.factor = 10000)

pbmc <- FindVariableGenes(object = pbmc, mean.function = ExpMean, dispersion.function = LogVMR, 
    x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)

length(x = pbmc@var.genes)

## [1] 1838

pbmc <- ScaleData(object = pbmc, vars.to.regress = c("nUMI", "percent.mito"))

pbmc <- RunPCA(object = pbmc, pc.genes = pbmc@var.genes, do.print = TRUE, pcs.print = 1:5, 
    genes.print = 5)

## [1] "PC1"
## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
## [1] ""
## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
## [1] ""
## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"    
## [1] ""
## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"   
## [1] ""
## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"
## [1] ""
## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"         
## [5] "CEBPB"        
## [1] ""
## [1] ""

# Examine and visualize PCA results a few different ways
PrintPCA(object = pbmc, pcs.print = 1:5, genes.print = 5, use.full = FALSE)

## [1] "PC1"
## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
## [1] ""
## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
## [1] ""
## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"    
## [1] ""
## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"   
## [1] ""
## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"
## [1] ""
## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"         
## [5] "CEBPB"        
## [1] ""
## [1] ""

VizPCA(object = pbmc, pcs.use = 1:2)

PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)

# ProjectPCA scores each gene in the dataset (including genes not included
# in the PCA) based on their correlation with the calculated components.
# Though we don't use this further here, it can be used to identify markers
# that are strongly correlated with cellular heterogeneity, but may not have
# passed through variable gene selection.  The results of the projected PCA
# can be explored by setting use.full=T in the functions above
pbmc <- ProjectPCA(object = pbmc, do.print = FALSE)

PCHeatmap(object = pbmc, pc.use = 1, cells.use = 500, do.balanced = TRUE, label.columns = FALSE)

PCHeatmap(object = pbmc, pc.use = 1:12, cells.use = 500, do.balanced = TRUE, 
    label.columns = FALSE, use.full = FALSE)

# NOTE: This process can take a long time for big datasets, comment out for
# expediency.  More approximate techniques such as those implemented in
# PCElbowPlot() can be used to reduce computation time
pbmc <- JackStraw(object = pbmc, num.replicate = 100, do.print = FALSE)

JackStrawPlot(object = pbmc, PCs = 1:12)

PCElbowPlot(object = pbmc)

# save.SNN = T saves the SNN so that the clustering algorithm can be rerun
# using the same graph but with a different resolution value (see docs for
# full details)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, 
    resolution = 0.6, print.output = 0, save.SNN = TRUE)

PrintFindClustersParams(object = pbmc)

## Parameters used in latest FindClusters calculation run on: 2017-10-12 11:49:20
## =============================================================================
## Resolution: 0.6
## -----------------------------------------------------------------------------
## Modularity Function    Algorithm         n.start         n.iter
##      1                   1                 100             10
## -----------------------------------------------------------------------------
## Reduction used          k.param          k.scale          prune.SNN
##      pca                 30                25              0.0667
## -----------------------------------------------------------------------------
## Dims used in calculation
## =============================================================================
## 1 2 3 4 5 6 7 8 9 10

# While we do provide function-specific printing functions, the more general
# function to print calculation parameters is PrintCalcParams().

pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE)

# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = pbmc)

save(pbmc, file = "~/Projects/datasets/pbmc3k/pbmc_tutorial.Robj")

# find all markers of cluster 1
cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)
print(x = head(x = cluster1.markers, n = 5))

##                p_val avg_logFC pct.1 pct.2    p_val_adj
## S100A9  0.000000e+00  3.827593 0.996 0.216  0.00000e+00
## S100A8  0.000000e+00  3.786535 0.973 0.123  0.00000e+00
## LGALS2  0.000000e+00  2.634722 0.908 0.060  0.00000e+00
## FCN1    0.000000e+00  2.369524 0.956 0.150  0.00000e+00
## CD14   8.129864e-290  1.949317 0.663 0.029 1.11493e-285

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0, 3), 
    min.pct = 0.25)
print(x = head(x = cluster5.markers, n = 5))

##                p_val avg_logFC pct.1 pct.2     p_val_adj
## GZMB   3.854665e-190  3.195021 0.955 0.084 5.286288e-186
## IGFBP7 2.967797e-155  2.175917 0.542 0.010 4.070037e-151
## GNLY   7.492111e-155  3.514718 0.961 0.143 1.027468e-150
## FGFBP2 2.334109e-150  2.559484 0.852 0.085 3.200998e-146
## FCER1G 4.819154e-141  2.280724 0.839 0.100 6.608987e-137

# find markers for every cluster compared to all remaining cells, report
# only the positive ones
pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, 
    thresh.use = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)

## # A tibble: 16 x 7
## # Groups:   cluster [8]
##            p_val avg_logFC pct.1 pct.2     p_val_adj cluster     gene
##            <dbl>     <dbl> <dbl> <dbl>         <dbl>  <fctr>    <chr>
##  1 9.480623e-268  1.149058 0.924 0.483 1.300173e-263       0     LDHB
##  2 3.519020e-133  1.068122 0.662 0.202 4.825984e-129       0     IL7R
##  3  0.000000e+00  3.827593 0.996 0.216  0.000000e+00       1   S100A9
##  4  0.000000e+00  3.786535 0.973 0.123  0.000000e+00       1   S100A8
##  5  0.000000e+00  2.977399 0.936 0.042  0.000000e+00       2    CD79A
##  6 1.030004e-191  2.492236 0.624 0.022 1.412548e-187       2    TCL1A
##  7 3.158793e-231  2.158812 0.974 0.230 4.331968e-227       3     CCL5
##  8 4.103013e-125  2.113428 0.588 0.050 5.626872e-121       3     GZMK
##  9 1.304430e-165  2.151881 1.000 0.316 1.788896e-161       4     LST1
## 10 1.420369e-138  2.275509 0.962 0.137 1.947894e-134       4   FCGR3A
## 11 1.127318e-215  3.763928 0.961 0.131 1.546004e-211       5     GNLY
## 12 8.182907e-187  3.334634 0.955 0.068 1.122204e-182       5     GZMB
## 13  3.261141e-48  2.729243 0.844 0.011  4.472329e-44       6   FCER1A
## 14  7.655042e-30  1.965168 1.000 0.513  1.049812e-25       6 HLA-DPB1
## 15  1.165274e-56  5.889503 1.000 0.023  1.598056e-52       7     PPBP
## 16  3.238887e-44  4.952160 0.933 0.010  4.441810e-40       7      PF4

cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 0, thresh.use = 0.25, 
    test.use = "roc", only.pos = TRUE)

VlnPlot(object = pbmc, features.plot = c("MS4A1", "CD79A"))

# you can plot raw UMI counts as well
VlnPlot(object = pbmc, features.plot = c("NKG7", "PF4"), use.raw = TRUE, y.log = TRUE)

FeaturePlot(object = pbmc, features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", 
    "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"), cols.use = c("grey", "blue"), 
    reduction.use = "tsne")

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)
# setting slim.col.label to TRUE will print just the cluster IDS instead of
# every cell name
DoHeatmap(object = pbmc, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c("CD4 T cells", "CD14+ Monocytes", "B cells", "CD8 T cells", 
    "FCGR3A+ Monocytes", "NK cells", "Dendritic cells", "Megakaryocytes")
pbmc@ident <- plyr::mapvalues(x = pbmc@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(object = pbmc, do.label = TRUE, pt.size = 0.5)

# First lets stash our identities for later
pbmc <- StashIdent(object = pbmc, save.name = "ClusterNames_0.6")

# Note that if you set save.snn=T above, you don't need to recalculate the
# SNN, and can simply put: pbmc <- FindClusters(pbmc,resolution = 0.8)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, 
    resolution = 0.8, print.output = FALSE)

## Warning in BuildSNN(object = object, genes.use = genes.use, reduction.type
## = reduction.type, : Build parameters exactly match those of already
## computed and stored SNN. To force recalculation, set force.recalc to TRUE.

# Demonstration of how to plot two tSNE plots side by side, and how to color
# points based on different criteria
plot1 <- TSNEPlot(object = pbmc, do.return = TRUE, no.legend = TRUE, do.label = TRUE)
plot2 <- TSNEPlot(object = pbmc, do.return = TRUE, group.by = "ClusterNames_0.6", 
    no.legend = TRUE, do.label = TRUE)
plot_grid(plot1, plot2)

# Find discriminating markers
tcell.markers <- FindMarkers(object = pbmc, ident.1 = 0, ident.2 = 1)

# Most of the markers tend to be expressed in C1 (i.e. S100A4). However, we
# can see that CCR7 is upregulated in C0, strongly indicating that we can
# differentiate memory from naive CD4 cells.  cols.use demarcates the color
# palette from low to high expression
FeaturePlot(object = pbmc, features.plot = c("S100A4", "CCR7"), cols.use = c("green", 
    "blue"))

pbmc <- SetAllIdent(object = pbmc, id = "ClusterNames_0.6")
save(pbmc, file = "~/Projects/datasets/pbmc3k_final.Rda")