Over-representation analysis

Over-representation (or enrichment) analysis is a statistical method that determines whether genes from pre-defined sets (e.g. those belonging to a specific GO term or KEGG pathway) are present more than would be expected (over-represented) in a subset of your data. In this case, the subset is your set of under- or over-expressed genes.

For example, the fruit fly transcriptome has about 10,000 genes. If 260 genes are categorized as “axon guidance” (2.6% of all genes carry that category), and in an experiment we find 1,000 genes are differentially expressed and 200 of those genes are in the “axon guidance” category (20% of DE genes), is that significant?

This page describes the implementation of over-representation analysis using the clusterProfiler package. It is run downstream of a differential expression analysis (e.g. with DESeq2). Full clusterProfiler documentation: clusterProfiler vignette.

Note

An R Markdown notebook covering this workflow is available at gencorefacility/r-notebooks/ora.Rmd.

Workflow overview

Load clusterProfiler and the organism-specific annotation database.
Read the differential-expression results and build a sorted, named vector of log2 fold-changes (the gene universe) and a filtered vector of significant DE genes.
Run enrichGO to test for GO-term over-representation.
Visualize the GO results (upset, wordcloud, barplot, dotplot, enrichment map, induced graph, category netplot).
Convert gene IDs and run enrichKEGG for KEGG pathway over-representation.
Visualize the KEGG results and render an enriched pathway with pathview.

Environment

This is an R workflow. Install clusterProfiler, pathview, and wordcloud once, then load them at the start of each session.

BiocManager::install("clusterProfiler", version = "3.8")
BiocManager::install("pathview")
install.packages("wordcloud")
library(clusterProfiler)
library(wordcloud)

Inputs

You need:

A differential expression results CSV (drosphila_example_de.csv in the examples below), typically produced by a tool such as DESeq2. At minimum it must contain a gene-ID column, a log2FoldChange column, and a padj column.
An organism annotation database matched to the gene IDs in that CSV. The examples here use D. melanogaster and org.Dm.eg.db.

Annotation database

Install and load the annotation package for your organism. See the full list at Bioconductor OrgDb packages (there are 19 presently available).

organism = "org.Dm.eg.db"
BiocManager::install(organism, character.only = TRUE)
library(organism, character.only = TRUE)

Prepare input

Read the DESeq2 output and build two vectors: the full gene universe (named, sorted log2 fold-changes) and the filtered set of significant DE genes (gene names only).

# reading in input from deseq2
df = read.csv("drosphila_example_de.csv", header=TRUE)

# we want the log2 fold change
original_gene_list <- df$log2FoldChange

# name the vector
names(original_gene_list) <- df$X

# omit any NA values
gene_list <- na.omit(original_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
gene_list = sort(gene_list, decreasing = TRUE)

# Extract significant results (padj < 0.05)
sig_genes_df = subset(df, padj < 0.05)

# From significant results, we want to filter on log2 fold change
genes <- sig_genes_df$log2FoldChange

# Name the vector
names(genes) <- sig_genes_df$X

# omit NA values
genes <- na.omit(genes)

# filter on min log2 fold change (log2FoldChange > 2)
genes <- names(genes)[abs(genes) > 2]

GO enrichment

Run enrichGO to test whether GO terms are over-represented among the significant DE genes, using the full gene universe as background.

Parameters

Ontology Options: "BP" (Biological Process), "MF" (Molecular Function), "CC" (Cellular Component).

keyType This is the source of the annotation (gene IDs). The options vary for each annotation. In the example of org.Dm.eg.db, the options are:

"ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "FLYBASE" "FLYBASECG" "FLYBASEPROT" "GENENAME" "GO" "GOALL" "MAP" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"

Check which options are available with the keytypes command, for example keytypes(org.Dm.eg.db).

Create the enrichGO object

go_enrich <- enrichGO(gene = genes,
                      universe = names(gene_list),
                      OrgDb = organism,
                      keyType = 'ENSEMBL',
                      readable = T,
                      ont = "BP",
                      pvalueCutoff = 0.05,
                      qvalueCutoff = 0.10)

Visualizations

Upset plot

Emphasizes the genes overlapping among different gene sets.

BiocManager::install("enrichplot")
library(enrichplot)
upsetplot(go_enrich)

enrichplot upsetplot of enriched GO terms — Upset plot of GO-term over-representation, highlighting gene overlap between enriched sets.

Wordcloud

wcdf <- read.table(text = go_enrich$GeneRatio, sep = "/")[1]
wcdf$term <- go_enrich[,2]
wordcloud(words = wcdf$term, freq = wcdf$V1, scale=(c(4, .1)),
          colors = brewer.pal(8, "Dark2"), max.words = 25)

Barplot

barplot(go_enrich,
        drop = TRUE,
        showCategory = 10,
        title = "GO Biological Pathways",
        font.size = 8)

Dotplot

dotplot(go_enrich)

Enrichment map

Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. Mutually overlapping gene sets tend to cluster together, making it easy to identify functional modules.

emapplot(go_enrich)

enrichment map of enriched GO terms — Enrichment map: nodes are GO terms, edges connect terms that share genes. Functional modules cluster together.

Enriched GO induced graph

goplot(go_enrich, showCategory = 10)

induced GO graph of enriched terms — Induced GO DAG for the top 10 enriched terms, showing each term’s parents in the ontology hierarchy.

Category netplot

The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network. It is helpful for seeing which genes are involved in enriched pathways and which genes belong to multiple annotation categories.

# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(go_enrich, categorySize = "pvalue", foldChange = gene_list)

cnetplot of enriched GO terms linking genes to categories — Category netplot: gene nodes are connected to the enriched GO categories they belong to, with fold-change overlaid as colour.

KEGG pathway enrichment

For KEGG pathway enrichment we use the enrichKEGG function. KEGG requires its own identifier types, so we first convert gene IDs with bitr (included in clusterProfiler). It is normal for this call to produce some messages or warnings.

In the bitr function, the param fromType should be the same as the keyType we used for enrichGO above (the annotation source). This param is used again in the next two steps: creating dedup_ids and df2.

toType in the bitr function has to be one of the available options from keyTypes(org.Dm.eg.db) and must map to one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot', because enrichKEGG only accepts one of those four options as its keytype parameter. In the case of org.Dm.eg.db, none of those four types are available directly, but 'ENTREZID' is the same as 'ncbi-geneid' for org.Dm.eg.db, so we use 'ENTREZID' for toType.

As our initial input, we use original_gene_list created above.

Prepare data

# Convert gene IDs for enrichKEGG function
# We will lose some genes here because not all IDs will be converted
ids <- bitr(names(original_gene_list), fromType = "ENSEMBL",
            toType = "ENTREZID", OrgDb = "org.Dm.eg.db")

# remove duplicate IDS (here I use "ENSEMBL", but it should be
# whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),]

# Create a new dataframe df2 which has only the genes which were
# successfully mapped using the bitr function above
df2 = df[df$X %in% dedup_ids$ENSEMBL,]

# Create a new column in df2 with the corresponding ENTREZ IDs
df2$Y = dedup_ids$ENTREZID

# Create a vector of the gene universe
kegg_gene_list <- df2$log2FoldChange

# Name vector with ENTREZ ids
names(kegg_gene_list) <- df2$Y

# omit any NA values
kegg_gene_list <- na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

# Extract significant results from df2
kegg_sig_genes_df = subset(df2, padj < 0.05)

# From significant results, we want to filter on log2 fold change
kegg_genes <- kegg_sig_genes_df$log2FoldChange

# Name the vector with the CONVERTED ID!
names(kegg_genes) <- kegg_sig_genes_df$Y

# omit NA values
kegg_genes <- na.omit(kegg_genes)

# filter on log2 fold change (PARAMETER)
kegg_genes <- names(kegg_genes)[abs(kegg_genes) > 2]

Parameters

organism KEGG organism code. The full list is here: KEGG organism list (need the 3-letter code). We define it as kegg_organism first, because it is used again below when making the pathview plots.

keyType one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot'.

Create the enrichKEGG object

kegg_organism = "dme"
kk <- enrichKEGG(gene = kegg_genes,
                 universe = names(kegg_gene_list),
                 organism = kegg_organism,
                 pvalueCutoff = 0.05,
                 keyType = "ncbi-geneid")

Visualizations

Barplot

barplot(kk,
        showCategory = 10,
        title = "Enriched Pathways",
        font.size = 8)

Dotplot

dotplot(kk,
        showCategory = 10,
        title = "Enriched Pathways",
        font.size = 8)

Category netplot

The cnetplot depicts the linkages of genes and biological concepts (here KEGG pathways) as a network.

# categorySize can be either 'pvalue' or 'geneNum'
cnetplot(kk, categorySize = "pvalue", foldChange = gene_list)

cnetplot of enriched KEGG pathways linking genes to categories — Category netplot for KEGG pathway enrichment, with DE-gene fold-change overlaid as colour.

Pathview

pathview renders an enriched KEGG pathway with your DE genes overlaid, producing a PNG and a (different) PDF of the pathway map.

Parameters

gene.data The named vector kegg_gene_list created above.

pathway.id The user needs to enter this. Enriched pathways plus the pathway ID are provided in the enrichKEGG output table above.

species Same as organism above in enrichKEGG, which we defined as kegg_organism.

gene.idtype The index number (first index is 1) corresponding to your keytype from this list: gene.idtype.list.

library(pathview)

# Produce the native KEGG plot (PNG)
dme <- pathview(gene.data = gene_list, pathway.id = "dme04080",
                species = kegg_organism,
                gene.idtype = gene.idtype.list[3])

# Produce a different plot (PDF) (not displayed here)
dme <- pathview(gene.data = gene_list, pathway.id = "dme04080",
                species = kegg_organism,
                gene.idtype = gene.idtype.list[3],
                kegg.native = F)

knitr::include_graphics("dme04080.pathview.png")

KEGG native enriched pathway plot for dme04080 — KEGG native enriched pathway plot (`dme04080`) with DE-gene fold-change overlaid on pathway nodes.