De Novo Transcriptome Assembly & Annotation

De novo assembly, quality assessment, functional annotation, and expression quantification of a transcriptome from paired-end short reads (Illumina), run on locally installed HPC modules.

This workshop uses publicly available RNA-Seq data from the non-model prokaryote Staphylococcus aureus grown under three carbon sources (glucose, fructose, pyruvate; two replicates each). No reference genome is assumed: the transcriptome is built directly from the reads with Trinity, assessed, annotated against homology databases, and then used as a reference for abundance estimation.

Scope is limited to a single, guided end-to-end pass on this dataset. It does not attempt to compare assemblers, tune parameters per species, or cover every downstream analysis. The aim is a reference workflow you can adapt to your own data.

Note

The analysis runs on the NYUAD High Performance Computing (HPC) cluster, but every step works on any standalone machine (server, laptop, desktop) provided the software stack below is installed. We recommend conda for installing and maintaining bioinformatics software.

Pipeline overview

De novo transcriptome assembly and annotation pipeline

The workflow. Reads are assembled once with Trinity into a reference transcriptome; that assembly is then assessed (read mapping, Nx/ExN50, BUSCO, QUAST), annotated (TransDecoder ORFs + Trinotate homology search), and finally quantified per sample (RSEM) for differential expression (DESeq2). The assembly is the hub: every downstream step consumes it.

Note

Drop your own workflow image at _images/denovo-workflow.png (or change the path in the figure directive). If you have sphinx-design or Graphviz enabled, this is also a natural place for a generated diagram.

Steps

  1. Assemble the reads into a reference transcriptome with Trinity.

  2. Assess the assembly: read mapping rate, contig Nx / ExN50, BUSCO completeness, and QUAST mis-assembly metrics.

  3. Predict ORFs with TransDecoder and annotate with Trinotate (BLAST/DIAMOND vs. SwissProt, PFAM via HMMER, EggNOG, TmHMM, SignalP).

  4. Quantify expression per sample with RSEM, build expression matrices, and run differential expression with DESeq2.

  5. Visualize alignments and annotations in IGV and TrinotateWeb.

Environment

All commands assume the following modules are loaded. Exact names and versions depend on your HPC installation; substitute the versions available to you.

module purge
module load all
module load gencore/2
module load trinity/2.15.1
module load salmon/0.14.2
module load numpy/1.26.4
module load bowtie2/2.4.5
module load samtools/1.10
module load hisat2/2.2.1

Note

Different stages load different stacks (BUSCO, QUAST, and Trinotate each need their own environment). Those are given inline in the relevant sections below rather than all up front, because they conflict if loaded together. Parameters can change between tool versions, so verify flags against the version you actually have.

The tools used across the workshop, with upstream documentation:

Inputs

Before running the pipeline you need:

  • Paired-end RNA-Seq reads in FASTQ format. This dataset has 6 samples (12 files), 3 conditions × 2 replicates. Public accessions: SRR28281136SRR28281141 from the SRA (https://www.ncbi.nlm.nih.gov/sra).

  • A samples / conditions file (cond.txt), tab-delimited, mapping each replicate to its read pair (format given in Running the assembly).

  • A pre-generated assembly (trinity_assembly.Trinity.fasta) shipped with the data. Assembly is CPU/memory/time intensive, so for the workshop it is provided ready-made; the commands to regenerate it are shown anyway.

  • A pre-initialized Trinotate database (myTrinotate.sqlite), also shipped with the data so you do not have to download and build the homology databases during the session.

All other files in the pipeline are produced by the commands below.

File formats you will encounter (assumed familiar): FASTA, FASTQ, SAM, BAM.

Getting the data

We use the NYUAD HPC cluster. Connect, move to your scratch space, and copy the workshop data.

Connecting from macOS / Linux

  1. Open Terminal and run ssh NetID@jubail.abudhabi.nyu.edu. Enter your NYU password when prompted.

  2. Move to your scratch directory: cd $SCRATCH.

  3. Create and enter a working directory: mkdir de_novo_transcriptome && cd de_novo_transcriptome.

  4. Copy the data: cp -r /scratch/gencore/nd48/workshop_de_novo_trans/data .

Connecting from Windows

  1. Open PuTTY and set Host name = jubail.abudhabi.nyu.edu, Port = 22, then click Open.

  2. Enter your NetID and password when prompted.

  3. Move to your scratch directory: cd $SCRATCH.

  4. Create and enter a working directory: mkdir de_novo_transcriptome && cd de_novo_transcriptome.

  5. Copy the data: cp -r /scratch/gencore/nd48/workshop_de_novo_trans/data .

The data

The datasets are publicly available sequencing reads in FASTQ format, downloadable from the SRA under accessions SRR28281136, SRR28281137, SRR28281138, SRR28281139, SRR28281140, and SRR28281141.

There are 6 samples organized as paired-end sequences (12 files), spanning 3 conditions with 2 replicates each. The organism is Staphylococcus aureus, and the conditions represent bacterial growth under three carbon sources: glucose, fructose, and pyruvate.

The scripts

The data folder ships with several shell scripts:

  • assembly.sh — assembles the transcriptome with Trinity.

  • qc.sh — assesses the assembly using read alignment, BUSCO, and QUAST.

  • annotate.sh — runs TransDecoder and Trinotate for annotation.

  • quantify.sh — quantifies expression and runs DGE (RSEM + DESeq2).

  • de_novo.sh — combines all of the above into a single submission.

Note

To run the complete analysis as one job, submit de_novo.sh. The individual scripts exist so each stage can be run and inspected on its own.

Assembly with Trinity

Most sequencing analyses begin with quality checking and trimming to remove low-quality bases and adapter contamination. That step is not covered here; see the quality-checking section of the Variant Detection & Annotation workshop (https://github.com/nizardrou/Variant-Detection-and-Annotation-workshop).

We use Trinity for assembly. There are alternatives, but Trinity produces reliable assemblies across organism sizes and ships a rich set of helper scripts for post-processing. The golden rule still applies: no single tool is best for every dataset, so for real projects run several assemblers and compare.

Note

The data folder already contains a pre-built assembly, trinity_assembly.Trinity.fasta. Assembly takes 2+ hours and is resource-heavy, so the workshop skips the run itself and uses the pre-built FASTA for all downstream steps. The Trinity commands are shown (and are commented out in de_novo.sh) so you can reproduce them later.

SLURM arguments

The first lines of the script are SLURM directives telling the scheduler where and how to run the job. They are ignored if you are not submitting with sbatch, so you can leave them in place when running locally.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=28
#SBATCH --time=04:00:00
#SBATCH --mem 100000

De novo assembly is resource intensive, so we request 100 GB of memory and 28 CPUs. Larger datasets and larger transcriptomes need more. A useful rule of thumb is at least 1 GB of memory per 1 million reads; for very large read sets, run digital normalization before assembling (https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Insilico-Normalization).

Loading the software stack

These module load lines make the tools available. They are specific to the NYUAD HPC; adapt them to your environment.

module purge
module load all
module load gencore/2
module load trinity/2.15.1
module load salmon/0.14.2
module load numpy/1.26.4
module load bowtie2/2.4.5
module load samtools/1.10
module load hisat2/2.2.1

Running the assembly

The command below runs the Trinity assembly: input reads are FASTQ (--seqType fq), memory capped at 100 GB, 28 CPUs, output prefixed trinity_assembly, and samples described in the tab-delimited cond.txt.

Trinity \
    --seqType fq \
    --samples_file cond.txt \
    --CPU 28 \
    --output trinity_assembly \
    --max_memory 100G

cond.txt is tab-delimited, one row per replicate, in the column order:

CONDITION  CONDITION_REPLICATE_NUMBER  READ1_FASTQ_FILE_NAME  READ2_FASTQ_FILE_NAME

For this dataset it looks like:

fructose   fructose_rep1   fru_1_1.fastq   fru_1_2.fastq
fructose   fructose_rep2   fru_2_1.fastq   fru_2_2.fastq
glucose    glucose_rep1    glu_1_1.fastq   glu_1_2.fastq
glucose    glucose_rep2    glu_2_1.fastq   glu_2_2.fastq
pyruvate   pyruvate_rep1   pyr_1_1.fastq   pyr_1_2.fastq
pyruvate   pyruvate_rep2   pyr_2_1.fastq   pyr_2_2.fastq

Specifying samples this way pays off later: the same cond.txt is reused for per-sample abundance estimation after the assembly is built.

Submitting the assembly

Submit the script to SLURM:

sbatch assembly.sh

Or run it directly on the command line:

chmod 755 assembly.sh
./assembly.sh

Assessing the assembly quality

Before annotating or quantifying, confirm the assembly is sound. We use three complementary angles: how well the reads map back, length/expression statistics, and gene-content completeness. The QC steps live in qc.sh.

Assess using read mapping

Ideally the reads that produced the assembly should map back to it at a high rate. A low rate can signal insufficient coverage, contamination, poor quality, a complex transcriptome the assembler struggled with, or partially spliced genes pointing to an extraction-protocol issue.

Use a splice-aware aligner — here HISAT2 with SAMtools.

  1. Index the transcriptome:

    hisat2-build -p 28 trinity_assembly.Trinity.fasta trinity_assembly
    
  2. Align the reads:

    hisat2 \
        -x trinity_assembly \
        -p 28 \
        -1 fru_1_1.fastq,fru_2_1.fastq,glu_1_1.fastq,glu_2_1.fastq,pyr_1_1.fastq,pyr_2_1.fastq \
        -2 fru_1_2.fastq,fru_2_2.fastq,glu_1_2.fastq,glu_2_2.fastq,pyr_1_2.fastq,pyr_2_2.fastq \
        -S aln.sam
    
  3. Convert to BAM, coordinate-sort, index, and gather alignment stats:

    samtools view -@ 28 -b -o aln.bam aln.sam
    samtools sort -@ 28 -o aln.sorted.bam aln.bam
    samtools index -@ 28 aln.sorted.bam
    samtools flagstat aln.sorted.bam > stats.txt
    

Look at stats.txt for the overall alignment rate. Aim for >80%; this assembly reaches ~93%, meaning most reads are represented in the assembly.

61353459 reads; of these:
  61353459 (100.00%) were paired; of these:
    10998445 (17.93%) aligned concordantly 0 times
    36304307 (59.17%) aligned concordantly exactly 1 time
    14050707 (22.90%) aligned concordantly >1 times
    ----
    10998445 pairs aligned concordantly 0 times; of these:
      1097373 (9.98%) aligned discordantly 1 time
    ----
    9901072 pairs aligned 0 times concordantly or discordantly; of these:
      19802144 mates make up the pairs; of these:
        8727873 (44.08%) aligned 0 times
        7094790 (35.83%) aligned exactly 1 time
        3979481 (20.10%) aligned >1 times
92.89% overall alignment rate

Dedicated alignment-QC tools (e.g. Qualimap) give a more detailed breakdown, but this is enough for an initial check.

Contig Nx and ExN50 statistics

Trinity ships helper scripts for length statistics. N50 is the contig length at which 50% of the total assembly length is contained in contigs of that length or longer — a contiguity indicator, where higher may mean a better assembly. It says nothing about correctness, so never use it alone.

TrinityStats.pl trinity_assembly.Trinity.fasta > assembly_stats.txt
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':       3040
Total trinity transcripts:   3794
Percent GC: 34.52

########################################
Stats based on ALL transcript contigs:
########################################

     Contig N10: 14232
     Contig N20: 10849
     Contig N30: 8233
     Contig N40: 5784
     Contig N50: 4271

     Median contig length: 310
     Average contig: 1122.70
     Total assembled bases: 4259519

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

     Contig N10: 14232
     Contig N20: 10599
     Contig N30: 7935
     Contig N40: 5575
     Contig N50: 4175

     Median contig length: 302
     Average contig: 1097.71
     Total assembled bases: 3337029

We read the gene/transcript counts and total size alongside N50. At ~3K genes we are in the expected ballpark for this organism. The ~4.2 Mb of assembled bases is larger than expected, which may reflect background genomic DNA contamination or captured plasmids.

A more transcriptome-appropriate statistic is the Ex90N50 and Ex90 gene count: N50 computed over only the most highly expressed genes that account for 90% of normalized expression. See the Trinity wiki (https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Contig-Nx-and-ExN50-stats).

Assessing completeness with BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) assesses an assembly by searching for conserved single-copy core genes. It works on eukaryotes and prokaryotes; higher scores mean a more complete assembly. Unlike the read- and length-based checks above, BUSCO evaluates assembly content.

Load its environment:

module purge
module load all
module load gencore/2
export PATH="/scratch/gencore/.eb/2.0/software/augustus/3.4.0/bin:$PATH"
export AUGUSTUS_CONFIG_PATH="/scratch/gencore/.eb/2.0/software/augustus/3.4.0/config/"
module load busco/5.2.0

Run BUSCO in transcriptome mode with automatic prokaryote lineage selection:

busco \
    -i trinity_assembly.Trinity.fasta \
    --out BUSCO \
    -m tran \
    --auto-lineage-prok \
    -c 28

Results are written to the BUSCO folder.

Detecting mis-assemblies with QUAST

QUAST (Quality Assessment Tool for Genome Assemblies) adds mis-assembly detection on top of the length statistics. Mis-assemblies are contigs stitched together that should not be — often around repeats or low-complexity regions, or where coverage is too low. They are found by mapping reads back and looking for regions of zero or near-zero coverage: an assembled region with no read support should not exist. QUAST can also wrap ab initio gene finding and BUSCO, so it makes sense to run them together.

QUAST takes a single read pair, not a comma-separated list, so concatenate all read 1s and all read 2s first:

cat *_1.fastq > read1.fastq
cat *_2.fastq > read2.fastq

Load the QUAST environment:

module purge
module load gencore
module load Miniconda3/4.7.10
source activate /scratch/gencore/conda3/envs/quast_env
export PATH="/scratch/gencore/software/genemarks_t:$PATH"

Run QUAST:

quast.py \
    -t 28 \
    -f \
    -b \
    -o QUAST \
    --rna-finding \
    --bam aln.sorted.bam \
    -1 read1.fastq \
    -2 read2.fastq \
    trinity_assembly.Trinity.fasta

Flags:

  • -t 28 — number of CPUs.

  • -f — enable gene finding (GeneMark; the prokaryote default we want).

  • -b — run BUSCO.

  • -o QUAST — output folder.

  • --rna-finding — detect ribosomal RNA genes.

  • --bam aln.sorted.bam — the sorted BAM from the read-mapping step, used for structural-variation detection and coverage histograms.

  • -1 read1.fastq -2 read2.fastq — the concatenated reads, aligned to the assembly to detect mis-assemblies and coverage.

  • trinity_assembly.Trinity.fasta — the assembly.

Output is written to the QUAST folder.

Annotation with Trinotate

With QC passed, annotate the assembly. Broadly there are two approaches:

  1. Ab initio gene finding — predicting genes from sequence alone (AUGUSTUS, GeneMark, Glimmer); this is what BUSCO and QUAST used above.

  2. Homology-based searching — what Trinotate uses, comparing against known databases (SwissProt/UniProt, PFAM, Gene Ontology, EggNOG).

Trinotate does not predict gene locations; it annotates the genes you give it, in FASTA format. Here the assembly already is that gene set, so we supply it directly. The annotation steps are in annotate.sh.

Note

As of March 2024, Trinotate is no longer under active development.

Load the Trinotate environment and make a safety copy of the assembly:

module purge
module load gencore
module load Miniconda3/4.7.10
source activate /scratch/gencore/conda3/envs/Trinotate/
export PATH="/scratch/gencore/software/Trinotate-Trinotate-v4.0.2:$PATH"
export TRINOTATE_DATA_DIR=/scratch/gencore/trinotate_addons/databases/4.0.2

cp trinity_assembly.Trinity.fasta transcripts.fasta

Predict ORFs with TransDecoder

Identify the longest open reading frames in the transcripts:

TransDecoder.LongOrfs -t transcripts.fasta

These ORFs are inferred from sequence alone. To add supporting evidence, search the longest ORFs against SwissProt with BLASTP and against PFAM with HMMER before the final prediction:

blastp \
    -query transcripts.fasta.transdecoder_dir/longest_orfs.pep \
    -db /scratch/gencore/trinotate_addons/databases/4.0.2/uniprot_sprot.pep \
    -max_target_seqs 1 \
    -outfmt 6 \
    -evalue 1e-5 \
    -num_threads 28 > blastp.outfmt6

hmmscan \
    --cpu 28 \
    --domtblout pfam.domtblout \
    /scratch/gencore/trinotate_addons/databases/4.0.2/Pfam-A.hmm \
    transcripts.fasta.transdecoder_dir/longest_orfs.pep

Re-run prediction, retaining ORFs with homology hits:

TransDecoder.Predict \
    -t transcripts.fasta \
    --retain_pfam_hits pfam.domtblout \
    --retain_blastp_hits blastp.outfmt6

This produces transcripts.fasta.transdecoder.{bed,cds,gff3,pep}.

Initialize and load the Trinotate database

The data folder already includes a built myTrinotate.sqlite. You only need to create it once; the command below is shown for reference (it downloads and builds PFAM, UniProt/SwissProt, EggNOG, etc.) and can be reused across projects.

Trinotate --create \
    --db myTrinotate.sqlite \
    --trinotate_data_dir /scratch/gencore/trinotate_addons/databases/4.0.2 \
    --use_diamond

Import the transcripts and their predicted protein translations:

Trinotate \
    --db myTrinotate.sqlite \
    --init \
    --gene_trans_map trinity_assembly.Trinity.fasta.gene_trans_map \
    --transcript_fasta transcripts.fasta \
    --transdecoder_pep transcripts.fasta.transdecoder.pep

Run Trinotate

Trinotate --db myTrinotate.sqlite --CPU 28 \
    --transcript_fasta transcripts.fasta \
    --transdecoder_pep transcripts.fasta.transdecoder.pep \
    --trinotate_data_dir /scratch/gencore/trinotate_addons/databases/4.0.2/ \
    --run "swissprot_blastp swissprot_blastx pfam tmhmmv2 EggnogMapper" \
    --use_diamond

Flags:

  • --CPU 28 — use 28 CPUs.

  • --transcript_fasta transcripts.fasta — the assembly as input.

  • --transdecoder_pep ... — the TransDecoder ORFs.

  • --trinotate_data_dir ... — the database directory used at init.

  • --run "..." — the analysis modules to run.

  • --use_diamond — use DIAMOND instead of NCBI BLAST, which is much faster.

Results load into the SQLite database automatically.

SignalP for Gram-positive bacteria

SignalP6 predicts signal peptides and cleavage sites. Trinotate runs it in eukaryotic mode by default, which is wrong for our Gram-positive bacterium. Run it independently in gram+ mode, then load the results:

/scratch/gencore/trinotate_addons/signalp-4.1/signalp \
    -f short \
    -t gram+ \
    -n signalp.out \
    transcripts.fasta.transdecoder.pep

Trinotate --db myTrinotate.sqlite --LOAD_signalp signalp.out

Generate the annotation report

Finally, export a tab-delimited report of all annotations.

Warning

In the Trinotate version used here, report generation produces an empty report (this may be specific to our installation). The workaround is to generate the report with an earlier Trinotate version. Switch environments, then run the report command:

conda deactivate
module purge
module load gencore/1
module load gencore_rnaseq/1.0
module load gencore_annotation
module load gencore_dev
module load gencore_trinity/1.0

Trinotate myTrinotate.sqlite report > trinotate_annotation_report.xls
import_transcript_names.pl myTrinotate.sqlite trinotate_annotation_report.xls

This is exactly why a conda-managed stack is valuable: switching versions is a module swap, not a reinstall.

Open trinotate_annotation_report.xls in a spreadsheet to inspect the annotations.

Expression quantification

The assembly is built, assessed, and annotated. The final stage quantifies expression per sample and compares conditions — essentially a reference-based RNA-Seq analysis using our own assembly as the reference. Trinity ships helper scripts that integrate cleanly with the outputs so far, so we stay within Trinity; quantify.sh holds these steps.

Trinity supports several differential-expression packages:

Warning

This dataset has only two replicates per condition, not three. With duplicates you can compute fold changes but cannot perform robust statistical testing of differential expression. Treat the DGE output here as illustrative.

Switch back to the Trinity stack (now also loading RSEM):

module purge
module load all
module load gencore/2
module load trinity/2.15.1
module load salmon/0.14.2
module load numpy/1.26.4
module load bowtie2/2.4.5
module load samtools/1.10
module load hisat2/2.2.1
module load rsem/1.3.3

Estimate abundance per sample

A counts matrix records how much of each transcript each sample expresses. There are alignment-based methods (RSEM, HISAT2, Bowtie2) and alignment-free, k-mer methods (Kallisto, Salmon); a benchmark is here (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4869-5). We use RSEM with Bowtie2 via Trinity’s wrapper:

align_and_estimate_abundance.pl \
    --transcripts trinity_assembly.Trinity.fasta \
    --seqType fq \
    --samples_file cond.txt \
    --est_method RSEM \
    --thread_count 28 \
    --gene_trans_map trinity_assembly.Trinity.fasta.gene_trans_map \
    --prep_reference \
    --aln_method bowtie2

--prep_reference indexes the assembly with Bowtie2 first; skip it if the assembly is already indexed. Output is per sample — one folder per replicate. For example:

ls -1 fructose_rep1/
bowtie2.bam
bowtie2.bam.for_rsem.bam
bowtie2.bam.ok
RSEM.genes.results
RSEM.isoforms.results
RSEM.isoforms.results.ok
RSEM.stat

The two files of interest are RSEM.genes.results and RSEM.isoforms.results — quantification at gene and transcript level.

Build expression matrices

Combine the per-replicate results into experiment-wide matrices with abundance_estimates_to_matrix.pl:

abundance_estimates_to_matrix.pl \
    --est_method RSEM \
    --gene_trans_map trinity_assembly.Trinity.fasta.gene_trans_map \
    --name_sample_by_basedir \
    --out_prefix RSEM.out \
    fructose_rep1/RSEM.isoforms.results \
    fructose_rep2/RSEM.isoforms.results \
    glucose_rep1/RSEM.isoforms.results \
    glucose_rep2/RSEM.isoforms.results \
    pyruvate_rep1/RSEM.isoforms.results \
    pyruvate_rep2/RSEM.isoforms.results

Note

Supply the isoform results, not the gene results — this step computes both gene- and isoform-level matrices in one pass, and supplying the gene files instead raises an error.

This produces:

RSEM.out.gene.counts.matrix
RSEM.out.gene.TMM.EXPR.matrix
RSEM.out.gene.TPM.not_cross_norm
RSEM.out.gene.TPM.not_cross_norm.runTMM.R
RSEM.out.gene.TPM.not_cross_norm.TMM_info.txt
RSEM.out.isoform.counts.matrix
RSEM.out.isoform.TMM.EXPR.matrix
RSEM.out.isoform.TPM.not_cross_norm
RSEM.out.isoform.TPM.not_cross_norm.runTMM.R
RSEM.out.isoform.TPM.not_cross_norm.TMM_info.txt

Differential expression with DESeq2

Run DGE with Trinity’s run_DE_analysis.pl, using DESeq2:

module load bioconductor-deseq2/1.32.0

run_DE_analysis.pl \
    --matrix RSEM.out.gene.counts.matrix \
    --method DESeq2 \
    --samples_file dge.txt \
    --output deseq2 \
    --contrasts comp.txt

Results are written to the deseq2 folder.

Note

As an alternative to the command line, you can use the NYUAD NASQAR2 instance (http://nasqar2.abudhabi.nyu.edu/) and its DESeq2 app (http://nasqar2.abudhabi.nyu.edu/deseq2shiny/). For more downstream options within Trinity (e.g. extracting and clustering differentially expressed transcripts), see https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression.

Visualization

With the results generated, you can bring everything together visually:

This concludes the workshop. For questions or comments, raise an issue on the workshop repository.