Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether a pre-defined set of genes (ex: those beloging to a specific GO term or KEGG pathway) shows statistically significant, concordant differences between two biological states. This R Notebook describes the implementation of GSEA using the clusterProfiler package in R. For more information please see the full documentation here: https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
Author
R Markdown
Follow along interactively with the R Markdown Notebook:
https://github.com/gencorefacility/r-notebooks/blob/master/gsea.Rmd
Install and load required packages
BiocManager::install("clusterProfiler", version = "3.8")
BiocManager::install("pathview")
BiocManager::install("enrichplot")
library(clusterProfiler)
library(enrichplot)
# we use ggplot2 to add x axis labels (ex: ridgeplot)
library(ggplot2)
Data
CSV file containing a list of gene names and log2 fold change values. This data is typically produced by differential expression analysis tool such as DESeq 2. Download sample data here.
Annotations
The sample data is from D melanogaster, so install and load the annotation “org.Dm.eg.db” below. See all annotations available here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb (there are 19 presently available).
# SET THE DESIRED ORGANISM HERE
organism = "org.Dm.eg.db"
BiocManager::install(organism, character.only = TRUE)
library(organism, character.only = TRUE)
Prepare Input
# reading in data from deseq2
df = read.csv("drosphila_example_de.csv", header=TRUE)
# we want the log2 fold change
original_gene_list <- df$log2FoldChange
# name the vector
names(original_gene_list) <- df$X
# omit any NA values
gene_list<-na.omit(original_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)
Gene Set Enrichment
Params:
keyType This is the source of the annotation (gene ids). The options vary for each annotation. In the example of org.Dm.eg.db, the options are:
“ACCNUM” “ALIAS” “ENSEMBL” “ENSEMBLPROT” “ENSEMBLTRANS” “ENTREZID”
“ENZYME” “EVIDENCE” “EVIDENCEALL” “FLYBASE” “FLYBASECG” “FLYBASEPROT”
“GENENAME” “GO” “GOALL” “MAP” “ONTOLOGY” “ONTOLOGYALL”
“PATH” “PMID” “REFSEQ” “SYMBOL” “UNIGENE” “UNIPROT”
Check which options are available with the keytypes
command, for example keytypes(org.Dm.eg.db)
.
ont one of “BP”, “MF”, “CC” or “ALL”
nPerm the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.
minGSSize minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
pvalueCutoff pvalue Cutoff.
pAdjustMethod one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”
gse <- gseGO(geneList=gene_list, ont ="ALL", keyType = "ENSEMBL", nPerm = 10000, minGSSize = 3, maxGSSize = 800, pvalueCutoff = 0.05, verbose = TRUE, OrgDb = organism, pAdjustMethod = "none")
Output
Dotplot
require(DOSE)
dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)
Encrichment Map:
Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.
emapplot(gse, showCategory = 10)
Category Netplot
The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories).
# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(gse, categorySize="pvalue", foldChange=gene_list, showCategory = 3)
Ridgeplot
Grouped by gene set, density plots are generated by using the frequency of fold change values per gene within each set. Helpful to interpret up/down-regulated pathways.
ridgeplot(gse) + labs(x = "enrichment distribution")