Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether a pre-defined set of genes (ex: those beloging to a specific GO term or KEGG pathway) shows statistically significant, concordant differences between two biological states. This R Notebook describes the implementation of GSEA using the clusterProfiler package in R. For more information please see the full documentation here: https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
Follow along interactively with the R Markdown Notebook:
Install and load required packages
BiocManager::install("clusterProfiler", version = "3.8") BiocManager::install("pathview") BiocManager::install("enrichplot") library(clusterProfiler) library(enrichplot) # we use ggplot2 to add x axis labels (ex: ridgeplot) library(ggplot2)
The sample data is from D melanogaster, so install and load the annotation “org.Dm.eg.db” below. See all annotations available here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb (there are 19 presently available).
# SET THE DESIRED ORGANISM HERE organism = "org.Dm.eg.db" BiocManager::install(organism, character.only = TRUE) library(organism, character.only = TRUE)
# reading in data from deseq2 df = read.csv("drosphila_example_de.csv", header=TRUE) # we want the log2 fold change original_gene_list <- df$log2FoldChange # name the vector names(original_gene_list) <- df$X # omit any NA values gene_list<-na.omit(original_gene_list) # sort the list in decreasing order (required for clusterProfiler) gene_list = sort(gene_list, decreasing = TRUE)
Gene Set Enrichment
keyType This is the source of the annotation (gene ids). The options vary for each annotation. In the example of org.Dm.eg.db, the options are:
“ACCNUM” “ALIAS” “ENSEMBL” “ENSEMBLPROT” “ENSEMBLTRANS” “ENTREZID”
“ENZYME” “EVIDENCE” “EVIDENCEALL” “FLYBASE” “FLYBASECG” “FLYBASEPROT”
“GENENAME” “GO” “GOALL” “MAP” “ONTOLOGY” “ONTOLOGYALL”
“PATH” “PMID” “REFSEQ” “SYMBOL” “UNIGENE” “UNIPROT”
Check which options are available with the
keytypes command, for example
ont one of “BP”, “MF”, “CC” or “ALL”
nPerm the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.
minGSSize minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
pvalueCutoff pvalue Cutoff.
pAdjustMethod one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”
gse <- gseGO(geneList=gene_list, ont ="ALL", keyType = "ENSEMBL", nPerm = 10000, minGSSize = 3, maxGSSize = 800, pvalueCutoff = 0.05, verbose = TRUE, OrgDb = organism, pAdjustMethod = "none")
require(DOSE) dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)
Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.
emapplot(gse, showCategory = 10)
The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories).
# categorySize can be either 'pvalue' or 'geneNum' cnetplot(gse, categorySize="pvalue", foldChange=gene_list, showCategory = 3)
Grouped by gene set, density plots are generated by using the frequency of fold change values per gene within each set. Helpful to interpret up/down-regulated pathways.
ridgeplot(gse) + labs(x = "enrichment distribution")