Series of Metagenomics Workshop

Shotgun

During this course, participants will learn about the concepts and steps involved in analyzing the metagenomic data obtained from short read high throughput sequencing.

Participants are expected to have some basic knowledge of metagenome/microbiome as well as some biological knowledge on the subject matter.

Although the material and the methods are designed to cater for the NYU Abu Dhabi High Performance Computing environment, it is possible to run this workshop on any other system provided that the necessary software is installed, and the data is uploaded to that environment. It will also be necessary to change the input and output directories/files to accommodate such an environment.

Connecting to the HPC

Using a Mac Machine

Open the Terminal app and type:
```
ssh NetID@jubail.abudhabi.nyu.edu
```
Enter your NYU password when prompted and hit enter.
Once logged in, navigate to your personal SCRATCH directory or any other subdirectory where you want to run this workshop.

Using a Windows Machine

Open the PuTTY app, and fill out the fields as follows:
- Host name: jubail.abudhabi.nyu.edu
- Port: 22
- Click Open.
Enter your NetID and password when prompted.
Once logged in, navigate to your personal SCRATCH directory or any other subdirectory where you want to run this workshop.

MODULE 1

The Dataset

The shotgun metagenomic sequencing data used for this hands-on training was obtained from a recently published study in Microbiome by Jeilu et al. 2025 (https://doi.org/10.1186/s40168-025-02276-7). The authors deposited the data in the NCBI under the BioProject accession number PRJNA1228129.

Briefly, the study was undertaken for metagenomic profiling of airborne microbial communities from aircraft filters and face masks. This study compared the following four groups:

Aircraft filter — aircraft cabin filter removed from a commercial airplane after 8039 h of usage
Travel — face masks worn by air travelers
Hospital — face masks worn by healthcare professionals in a hospital
Control — unworn

For the sake of time and resources, we will be using the following two groups in this workshop to analyze microbial profiling:

Aircraft filter
Control

We will be using the NYUAD High Performance Computing (HPC) cluster for this workshop; however, you can run all of the analysis on any stand-alone machine (server, personal laptop/desktop, etc.) provided that the necessary software packages are pre-installed.

Setting Up the Working Directory

mkdir -p /scratch/$USER/metagenomics/
mkdir -p /scratch/$USER/metagenomics/raw_fastqs/
cd /scratch/$USER/metagenomics/
cp -r /scratch/gencore/metagenomics/* .

Quality Control and Trimming (QC/QT)

We will check the quality of reads, then perform quality trimming to remove low-quality reads, adapter sequences, and host (human) sequences. This step produces clean, high-quality reads free from host contamination for downstream analysis.

We will use KneadData, part of the Biobakery suite (https://github.com/biobakery/kneaddata).

Load the KneadData module:

module purge
module load all
module load gencore/3
module load kneaddata/0.12.4

Run kneaddata --help to explore available options, then run the tool:

kneaddata \
  -i1 raw_fastqs/AF1_1.fastq \
  -i2 raw_fastqs/AF1_2.fastq \
  --reference-db /scratch/Reference_Genomes/Public/Metagenomic/kneaddata_hg39/ \
  --reference-db /scratch/Reference_Genomes/Public/Metagenomic/kneaddata_phix/ \
  --trimmomatic-options="SLIDINGWINDOW:4:20" --trimmomatic-options="MINLEN:50" \
  --run-trim-repetitive \
  --output analysis/kneaddata/ \
  --output-prefix AF1 \
  --remove-intermediate-output \
  --fastqc fastqc \
  --threads 14

Run FastQC on the cleaned output:

fastqc analysis/kneaddata/AF1_paired_1.fastq -t 14 -o analysis/kneaddata/fastqc
fastqc analysis/kneaddata/AF1_paired_2.fastq -t 14 -o analysis/kneaddata/fastqc

Note

The results of this run have been shared with you. Compare the results you obtain by running QC/QT with the provided ready-made results. Feel free to ask questions or discuss if you obtain very different results between the two sets.

MODULE 2

Taxonomic Classification with Kraken2

Load the required modules and run Kraken2:

module purge && \
module load all && \
module load gencore/3 && \
module load kraken2/2.17.1

k2 classify \
  --db /scratch/Reference_Genomes/Public/Metagenomic/kraken2/bacteria \
  --threads 14 \
  --confidence 0.1 \
  --output analysis/kraken2/AF1.kraken \
  --unclassified-out analysis/kraken2/AF1_unbact#.fastq \
  --report analysis/kraken2/AF1_profile.txt \
  --report-zero-counts \
  --paired analysis/kneaddata/AF1_paired_1.fastq analysis/kneaddata/AF1_paired_2.fastq

Abundance Estimation with Bracken

module purge && \
module load all gencore/3 && \
module load bracken/3.1

bracken -d /scratch/Reference_Genomes/Public/Metagenomic/kraken2/bacteria \
  -i analysis/kraken2/AF1_profile.txt \
  -o analysis/bracken/AF1_S.txt \
  -r 100 -l S -t 10 > analysis/bracken/AF1_log.txt

Diversity Analysis

Load modules:

module load all gencore/3
module load R/4.3.1
module load biopython/1.85

Alpha Diversity

python \
  /scratch/Reference_Genomes/Public/Metagenomic/KrakenTools/DiversityTools/alpha_diversity.py \
  -f analysis/bracken/AF1_S.txt \
  -a Sh \
  > analysis/alpha_diversity/AF1_Sh.txt

Beta Diversity

python \
  /scratch/Reference_Genomes/Public/Metagenomic/KrakenTools/DiversityTools/beta_diversity.py \
  -i analysis/kraken2/*profile.txt --type kreport2 -l S \
  > analysis/beta_diversity/merged_beta_div.txt

Diversity Plots

Set up the plots directory and merge diversity outputs:

mkdir -p diversity_plots/
cd diversity_plots/
cp analysis/kraken2/*_profile.txt .
cp analysis/bracken/*_Sh.txt .

python /scratch/Reference_Genomes/Public/Metagenomic/KrakenTools/DiversityTools/beta_diversity.py \
  -i *_profile.txt --type kreport2 -l S > merged_beta_div.txt

(echo -e "Sample\tDiversity"; \
  awk 'FNR==1{f=FILENAME; sub(/^.*\//,"",f); sub("_Sh.txt","",f)} \
       /Shannon/{print f"\t"$NF}' *_Sh.txt) > merged_alpha_div.txt

rm *_profile.txt
rm *_Sh.txt

Add group annotations to the alpha diversity table:

Rscript -e '
  df <- read.table("merged_alpha_div.txt", header=TRUE, stringsAsFactors=FALSE)
  df$Group <- ifelse(grepl("^AF", df$Sample), "Filter",
                ifelse(grepl("^C", df$Sample), "Control", "Other"))
  write.table(df, "merged_alpha_div_with_group.txt", sep="\t", quote=FALSE, row.names=FALSE)
'

Generate a Shannon diversity boxplot with Wilcoxon test:

Rscript -e '
  df <- read.table("merged_alpha_div_with_group.txt", header=TRUE, stringsAsFactors=FALSE)
  ymax <- max(df$Diversity) * 1.15
  png("diversity_boxplot_stats.png", width=800, height=600, res=120)
  boxplot(Diversity ~ Group, data=df, col=c("steelblue","orange"),
          ylab="Shannon Diversity", main="Shannon Diversity by Group",
          ylim=c(min(df$Diversity), ymax))
  stripchart(Diversity ~ Group, data=df, vertical=TRUE, method="jitter",
             pch=16, col="black", add=TRUE)
  w <- wilcox.test(Diversity ~ Group, data=df)
  p_txt <- paste0("Wilcoxon p = ", signif(w$p.value, 3))
  text(x=1.5, y=ymax, labels=p_txt)
  dev.off()
'

Clean the beta diversity matrix:

Rscript -e '
  args <- commandArgs(trailingOnly=TRUE)
  infile <- args[1]; outfile <- args[2]
  lines <- readLines(infile)
  samp_lines <- grep("^#", lines, value=TRUE)
  samps <- sub("^#\\d+\\s+","", samp_lines)
  samps <- sub("_profile.txt.*","", samps)
  start <- grep("^x\\s", lines)
  mat <- read.table(text=lines[start:length(lines)], header=TRUE,
                    stringsAsFactors=FALSE, check.names=FALSE)
  mat <- mat[, -1]
  mat[mat=="x.xxx"] <- NA
  mat <- as.matrix(sapply(mat, as.numeric))
  rownames(mat) <- samps; colnames(mat) <- samps
  mat[lower.tri(mat)] <- t(mat)[lower.tri(mat)]
  diag(mat) <- 0
  write.table(mat, file=outfile, sep="\t", quote=FALSE, col.names=NA)
' merged_beta_div.txt cleaned_beta_matrix.txt

Generate a PCoA plot of beta diversity:

Rscript -e '
  library(vegan)
  mat <- as.matrix(read.table("cleaned_beta_matrix.txt", header=TRUE,
                               check.names=FALSE, row.names=1))
  d <- as.dist(mat)
  p <- cmdscale(d, k=2, eig=TRUE)
  scores <- as.data.frame(p$points)
  scores$Sample <- rownames(scores)
  scores$Group <- ifelse(grepl("^AF", scores$Sample), "AirFilter",
                   ifelse(grepl("^C", scores$Sample), "Control", "Other"))
  ad <- adonis2(d ~ Group, data=scores)
  pval <- ad$"Pr(>F)"[1]
  var_expl <- round(p$eig / sum(p$eig) * 100, 1)
  png("beta_pcoa.png", width=800, height=600, res=120)
  par(mar=c(5,5,4,2))
  cols <- c(AirFilter="steelblue", Control="orange", Other="gray")
  plot(scores$V1, scores$V2,
       xlab=paste0("PCoA1 (", var_expl[1], "%)"),
       ylab=paste0("PCoA2 (", var_expl[2], "%)"),
       pch=16, col=cols[scores$Group],
       main="PCoA of beta diversity")
  legend("topright", legend=unique(scores$Group),
         col=cols[unique(scores$Group)], pch=16, bty="n")
  text(x=min(scores$V1), y=max(scores$V2),
       labels=paste0("PERMANOVA p = ", signif(pval, 3)), adj=c(0,1))
  dev.off()
'