Gene Set Enrichment Analysis with ClusterProfiler

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether a pre-defined set of genes (ex: those beloging to a specific GO term or KEGG pathway) shows statistically significant, concordant differences between two biological states. This R Notebook describes the implementation of GSEA using the clusterProfiler package in R. For more information please see the full documentation here: https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html

Author

Mohammed Khalfan

R Markdown

Follow along interactively with the R Markdown Notebook:
https://github.com/gencorefacility/r-notebooks/blob/master/gsea.Rmd

Install and load required packages

BiocManager::install("clusterProfiler", version = "3.8")
BiocManager::install("pathview")
BiocManager::install("enrichplot")
library(clusterProfiler)
library(enrichplot)
# we use ggplot2 to add x axis labels (ex: ridgeplot)
library(ggplot2)

Data

CSV file containing a list of gene names and log2 fold change values. This data is typically produced by differential expression analysis tool such as DESeq 2. Download sample data here.

Annotations

The sample data is from D melanogaster, so install and load the annotation “org.Dm.eg.db” below. See all annotations available here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb (there are 19 presently available).

# SET THE DESIRED ORGANISM HERE
organism = "org.Dm.eg.db"
BiocManager::install(organism, character.only = TRUE)
library(organism, character.only = TRUE)

Prepare Input

# reading in data from deseq2
df = read.csv("drosphila_example_de.csv", header=TRUE)

# we want the log2 fold change 
original_gene_list <- df$log2FoldChange

# name the vector
names(original_gene_list) <- df$X

# omit any NA values 
gene_list<-na.omit(original_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)

Gene Set Enrichment

Params:

keyType This is the source of the annotation (gene ids). The options vary for each annotation. In the example of org.Dm.eg.db, the options are:

“ACCNUM” “ALIAS” “ENSEMBL” “ENSEMBLPROT” “ENSEMBLTRANS” “ENTREZID”
“ENZYME” “EVIDENCE” “EVIDENCEALL” “FLYBASE” “FLYBASECG” “FLYBASEPROT”
“GENENAME” “GO” “GOALL” “MAP” “ONTOLOGY” “ONTOLOGYALL”
“PATH” “PMID” “REFSEQ” “SYMBOL” “UNIGENE” “UNIPROT”

Check which options are available with the keytypes command, for example keytypes(org.Dm.eg.db).

ont one of “BP”, “MF”, “CC” or “ALL”
nPerm the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.
minGSSize minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
pvalueCutoff pvalue Cutoff.
pAdjustMethod one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”

gse <- gseGO(geneList=gene_list, 
             ont ="ALL", 
             keyType = "ENSEMBL", 
             nPerm = 10000, 
             minGSSize = 3, 
             maxGSSize = 800, 
             pvalueCutoff = 0.05, 
             verbose = TRUE, 
             OrgDb = organism, 
             pAdjustMethod = "none")

Output

Dotplot

require(DOSE)
dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)

Encrichment Map:

Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules.

emapplot(gse, showCategory = 10)

Category Netplot

The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories).

# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(gse, categorySize="pvalue", foldChange=gene_list, showCategory = 3)

Ridgeplot

Grouped by gene set, density plots are generated by using the frequency of fold change values per gene within each set. Helpful to interpret up/down-regulated pathways.

ridgeplot(gse) + labs(x = "enrichment distribution")

GSEA Plot

Plot of the Running Enrichment Score (green line) for a gene set as the analysis walks down the ranked gene list, including the location of the maximum enrichment score (the red line). The black lines in the Running Enrichment Score show where the members of the gene set appear in the ranked list of genes, indicating the leading edge subset.

The Ranked list metric shows the value of the ranking metric (log2 fold change) as you move down the list of ranked genes. The ranking metric measures a gene’s correlation with a phenotype.

Params:
Gene Set Integer. Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc.

# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1)

PubMed trend of enriched terms

Plots the number/proportion of publications trend based on the query result from PubMed Central.

terms <- gse$Description[1:3]
pmcplot(terms, 2010:2018, proportion=FALSE)

KEGG Gene Set Enrichment Analysis

For KEGG pathway enrichment using the gseKEGG() function, we need to convert id types. We can use the bitr function for this (included in clusterProfiler). It is normal for this call to produce some messages / warnings.

In the bitr function, the param fromType should be the same as keyType from the gseGO function above (the annotation source). This param is used again in the next two steps: creating dedup_ids and df2.

toType in the bitr function has to be one of the available options from keyTypes(org.Dm.eg.db) and must map to one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’ because gseKEGG() only accepts one of these 4 options as it’s keytype parameter. In the case of org.Dm.eg.db, none of those 4 types are available, but ‘ENTREZID’ are the same as ncbi-geneid for org.Dm.eg.db so we use this for toType.

As our intial input, we use original_gene_list which we created above.

Prepare Input

# Convert gene IDs for gseKEGG function
# We will lose some genes here because not all IDs will be converted
ids<-bitr(names(original_gene_list), fromType = "ENSEMBL", toType = "ENTREZID", OrgDb=organism)
 # remove duplicate IDS (here I use "ENSEMBL", but it should be whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),]

# Create a new dataframe df2 which has only the genes which were successfully mapped using the bitr function above
df2 = df[df$X %in% dedup_ids$ENSEMBL,]

# Create a new column in df2 with the corresponding ENTREZ IDs
df2$Y = dedup_ids$ENTREZID

# Create a vector of the gene unuiverse
kegg_gene_list <- df2$log2FoldChange

# Name vector with ENTREZ ids
names(kegg_gene_list) <- df2$Y

# omit any NA values 
kegg_gene_list<-na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

Create gseKEGG object

organism KEGG Organism Code: The full list is here: https://www.genome.jp/kegg/catalog/org_list.html (need the 3 letter code). I define this as kegg_organism first, because it is used again below when making the pathview plots.
nPerm the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take.
minGSSize minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
maxGSSize maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
pvalueCutoff pvalue Cutoff.
pAdjustMethod one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”.
keyType one of ‘kegg’, ‘ncbi-geneid’, ‘ncib-proteinid’ or ‘uniprot’.

kegg_organism = "dme"
kk2 <- gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_organism,
               nPerm        = 10000,
               minGSSize    = 3,
               maxGSSize    = 800,
               pvalueCutoff = 0.05,
               pAdjustMethod = "none",
               keyType       = "ncbi-geneid")

Dotplot

dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign)

Encrichment map:

 emapplot(kk2)

Category Netplot:

# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(kk2, categorySize="pvalue", foldChange=gene_list)

Ridgeplot

Helpful to interpret up/down-regulated pathways.

ridgeplot(kk2) + labs(x = "enrichment distribution")

GSEA Plot

Traditional method for visualizing GSEA result.

Params:
Gene Set Integer. Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc. Default: 1

# Use the `Gene Set` param for the index in the title, and as the value for geneSetId
gseaplot(kk2, by = "all", title = kk2$Description[1], geneSetID = 1)

Pathview

This will create a PNG and different PDF of the enriched KEGG pathway.

Params:
gene.data This is kegg_gene_list created above
pathway.id The user needs to enter this. Enriched pathways + the pathway ID are provided in the gseKEGG output table (above).
species Same as organism above in gseKEGG, which we defined as kegg_organism

library(pathview)

# Produce the native KEGG plot (PNG)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="dme04130", species = kegg_organism)

# Produce a different plot (PDF) (not displayed here)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="dme04130", species = kegg_organism, kegg.native = F)

knitr::include_graphics("dme04130.pathview.png")

KEGG Native Enriched Pathway Plot