Gene Set Enrichment Analysis with ClusterProfiler

Install and load required packages

BiocManager::install("clusterProfiler", version = "3.8")
BiocManager::install("pathview")
BiocManager::install("enrichplot")
library(clusterProfiler)
library(enrichplot)
# we use ggplot2 to add x axis labels (ex: ridgeplot)
library(ggplot2)

Data

CSV file containing a list of gene names and log2 fold change values. This data is typically produced by differential expression analysis tool such as DESeq 2. Download sample data here.

Annotations

The sample data is from D melanogaster, so install and load the annotation “org.Dm.eg.db” below. See all annotations available here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb (there are 19 presently available).

# SET THE DESIRED ORGANISM HERE
organism = "org.Dm.eg.db"
BiocManager::install(organism, character.only = TRUE)
library(organism, character.only = TRUE)

Prepare Input

# reading in data from deseq2
df = read.csv("drosphila_example_de.csv", header=TRUE)

# we want the log2 fold change 
original_gene_list <- df$log2FoldChange

# name the vector
names(original_gene_list) <- df$X

# omit any NA values 
gene_list<-na.omit(original_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)

Gene Set Enrichment

Params:

keyType This is the source of the annotation (gene ids). The options vary for each annotation. In the example of org.Dm.eg.db, the options are:

“ACCNUM” “ALIAS” “ENSEMBL” “ENSEMBLPROT” “ENSEMBLTRANS” “ENTREZID”
“ENZYME” “EVIDENCE” “EVIDENCEALL” “FLYBASE” “FLYBASECG” “FLYBASEPROT”
“GENENAME” “GO” “GOALL” “MAP” “ONTOLOGY” “ONTOLOGYALL”
“PATH” “PMID” “REFSEQ” “SYMBOL” “UNIGENE” “UNIPROT”

Check which options are available with the keytypes command, for example keytypes(org.Dm.eg.db).

ont one of “BP”, “MF”, “CC” or “ALL”
nPerm the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.
minGSSize minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
pvalueCutoff pvalue Cutoff.
pAdjustMethod one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”

gse <- gseGO(geneList=gene_list, 
             ont ="ALL", 
             keyType = "ENSEMBL", 
             nPerm = 10000, 
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.05, 
             verbose = TRUE, 
             OrgDb = organism, 
             pAdjustMethod = "none")

Output

Dotplot

require(DOSE)
dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)

Encrichment Map:

Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.

emapplot(gse, showCategory = 10)

Category Netplot

The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories).

# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(gse, categorySize="pvalue", foldChange=gene_list, showCategory = 3)

Ridgeplot

Grouped by gene set, density plots are generated by using the frequency of fold change values per gene within each set. Helpful to interpret up/down-regulated pathways.

ridgeplot(gse) + labs(x = "enrichment distribution")