Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether a pre-defined set of genes (ex: those beloging to a specific GO term or KEGG pathway) shows statistically significant, concordant differences between two biological states. This R Notebook describes the implementation of GSEA using the clusterProfiler package in R. For more information please see the full documentation here: https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
Author
R Markdown
Follow along interactively with the R Markdown Notebook:
https://github.com/gencorefacility/r-notebooks/blob/master/gsea.Rmd
Install and load required packages
BiocManager::install("clusterProfiler", version = "3.8")
BiocManager::install("pathview")
BiocManager::install("enrichplot")
library(clusterProfiler)
library(enrichplot)
# we use ggplot2 to add x axis labels (ex: ridgeplot)
library(ggplot2)
Data
CSV file containing a list of gene names and log2 fold change values. This data is typically produced by differential expression analysis tool such as DESeq 2. Download sample data here.
Annotations
The sample data is from D melanogaster, so install and load the annotation “org.Dm.eg.db” below. See all annotations available here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb (there are 19 presently available).
# SET THE DESIRED ORGANISM HERE
organism = "org.Dm.eg.db"
BiocManager::install(organism, character.only = TRUE)
library(organism, character.only = TRUE)
Prepare Input
# reading in data from deseq2
df = read.csv("drosphila_example_de.csv", header=TRUE)
# we want the log2 fold change
original_gene_list <- df$log2FoldChange
# name the vector
names(original_gene_list) <- df$X
# omit any NA values
gene_list<-na.omit(original_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)
Gene Set Enrichment
Params:
keyType This is the source of the annotation (gene ids). The options vary for each annotation. In the example of org.Dm.eg.db, the options are:
“ACCNUM” “ALIAS” “ENSEMBL” “ENSEMBLPROT” “ENSEMBLTRANS” “ENTREZID”
“ENZYME” “EVIDENCE” “EVIDENCEALL” “FLYBASE” “FLYBASECG” “FLYBASEPROT”
“GENENAME” “GO” “GOALL” “MAP” “ONTOLOGY” “ONTOLOGYALL”
“PATH” “PMID” “REFSEQ” “SYMBOL” “UNIGENE” “UNIPROT”
Check which options are available with the keytypes
command, for example keytypes(org.Dm.eg.db)
.
ont one of “BP”, “MF”, “CC” or “ALL”
nPerm the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.
minGSSize minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
pvalueCutoff pvalue Cutoff.
pAdjustMethod one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”
gse <- gseGO(geneList=gene_list, ont ="ALL", keyType = "ENSEMBL", nPerm = 10000, minGSSize = 3, maxGSSize = 800, pvalueCutoff = 0.05, verbose = TRUE, OrgDb = organism, pAdjustMethod = "none")
Output
Dotplot
require(DOSE)
dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)
Encrichment Map:
Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.
emapplot(gse, showCategory = 10)
Category Netplot
The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories).
# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(gse, categorySize="pvalue", foldChange=gene_list, showCategory = 3)
Ridgeplot
Grouped by gene set, density plots are generated by using the frequency of fold change values per gene within each set. Helpful to interpret up/down-regulated pathways.
ridgeplot(gse) + labs(x = "enrichment distribution")
GSEA Plot
Plot of the Running Enrichment Score (green line) for a gene set as the analysis walks down the ranked gene list, including the location of the maximum enrichment score (the red line). The black lines in the Running Enrichment Score show where the members of the gene set appear in the ranked list of genes, indicating the leading edge subset.
The Ranked list metric shows the value of the ranking metric (log2 fold change) as you move down the list of ranked genes. The ranking metric measures a gene’s correlation with a phenotype.
Params:
Gene Set Integer. Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc.
# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1)
PubMed trend of enriched terms
Plots the number/proportion of publications trend based on the query result from PubMed Central.
terms <- gse$Description[1:3]
pmcplot(terms, 2010:2018, proportion=FALSE)
KEGG Gene Set Enrichment Analysis
For KEGG pathway enrichment using the gseKEGG()
function, we need to convert id types. We can use the bitr
function for this (included in clusterProfiler). It is normal for this call to produce some messages / warnings.
In the bitr
function, the param fromType
should be the same as keyType
from the gseGO
function above (the annotation source). This param is used again in the next two steps: creating dedup_ids
and df2
.
toType
in the bitr
function has to be one of the available options from keyTypes(org.Dm.eg.db)
and must map to one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’ because gseKEGG()
only accepts one of these 4 options as it’s keytype
parameter. In the case of org.Dm.eg.db, none of those 4 types are available, but ‘ENTREZID’ are the same as ncbi-geneid for org.Dm.eg.db so we use this for toType
.
As our intial input, we use original_gene_list
which we created above.
Prepare Input
# Convert gene IDs for gseKEGG function # We will lose some genes here because not all IDs will be converted ids<-bitr(names(original_gene_list), fromType = "ENSEMBL", toType = "ENTREZID", OrgDb=organism)
# remove duplicate IDS (here I use "ENSEMBL", but it should be whatever was selected as keyType) dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),] # Create a new dataframe df2 which has only the genes which were successfully mapped using the bitr function above df2 = df[df$X %in% dedup_ids$ENSEMBL,] # Create a new column in df2 with the corresponding ENTREZ IDs df2$Y = dedup_ids$ENTREZID # Create a vector of the gene unuiverse kegg_gene_list <- df2$log2FoldChange # Name vector with ENTREZ ids names(kegg_gene_list) <- df2$Y # omit any NA values kegg_gene_list<-na.omit(kegg_gene_list) # sort the list in decreasing order (required for clusterProfiler) kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)
Create gseKEGG object
organism KEGG Organism Code: The full list is here: https://www.genome.jp/kegg/catalog/org_list.html (need the 3 letter code). I define this as kegg_organism
first, because it is used again below when making the pathview plots.
nPerm the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.
minGSSize minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
pvalueCutoff pvalue Cutoff.
pAdjustMethod one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”.
keyType one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’.
kegg_organism = "dme" kk2 <- gseKEGG(geneList = kegg_gene_list, organism = kegg_organism, nPerm = 10000, minGSSize = 3, maxGSSize = 800, pvalueCutoff = 0.05, pAdjustMethod = "none", keyType = "ncbi-geneid")
Dotplot
dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign)
Encrichment map:
Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.
emapplot(kk2)
Category Netplot:
The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories).
# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(kk2, categorySize="pvalue", foldChange=gene_list)
Ridgeplot
Helpful to interpret up/down-regulated pathways.
ridgeplot(kk2) + labs(x = "enrichment distribution")
GSEA Plot
Traditional method for visualizing GSEA result.
Params:
Gene Set Integer. Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc. Default: 1
# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
gseaplot(kk2, by = "all", title = kk2$Description[1], geneSetID = 1)
Pathview
This will create a PNG and different PDF of the enriched KEGG pathway.
Params:
gene.data This is kegg_gene_list
created above
pathway.id The user needs to enter this. Enriched pathways + the pathway ID are provided in the gseKEGG output table (above).
species Same as organism
above in gseKEGG
, which we defined as kegg_organism
library(pathview)
# Produce the native KEGG plot (PNG)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="dme04130", species = kegg_organism)
# Produce a different plot (PDF) (not displayed here)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="dme04130", species = kegg_organism, kegg.native = F)
knitr::include_graphics("dme04130.pathview.png")