Pre-processing and QC

 

Quality trimming using trimmomatic

It is always a good idea to perform quality assessment of your raw fastq files before assembling your reads. In addition, there are other tools that perform read error correction. All these steps can result in a noticeable improvement in the resulting assembly, as well as an improvement in computational run time and resource requirements. The data that we are assembling today have already been quality trimmed and filtered. If however you are looking to perform that yourself (using trimmomatic for instance), then the command below is a typical example.

(For more information on trimmomatic quality filtering, please refer to the trimmomatic homepage).

If you are working on the NYU Abu Dhabi HPC (dalma), we suggest using the predefined QC/QT workflow in BioSAILs, which is available at,

For paired-end and single-end trimming respectively.

And then you can run the workflow as follows,

  1. cd into the directory containing your samples.
  2. copy one of the above workflows to your samples directory.
  3. Load the following modules (gencore AND gencore_dev).
  4. Run the following 2 commands

The same idea applies to any other workflow, and will work straight out of the box if your samples are found in their own individual directories, and are prefixed with “Sample_”.

For more on the workflows and BioSAILs, go to the BioSAILs homepage.

FastQC

FastQC allows you to inspect various quality metrics that can inform your quality trimming decisions. We recommend running FastQC before and after you perform your quality trimming. This will allow you to assess and inspect the effects of quality trimming, as well as how much of your raw data (reads) where lost during the quality trimming stage.

Below is an example of FastQC on the read1 file.

Working with multiple samples

In cases where you have QC’ed multiple samples, it can become very difficult to visualize all the QC reports separately. Imagine you have sequenced 6 samples (paired end reads), perform FastQC, followed by Trimmomatic, and then FastQC again.

This means,

  • Raw FastQC: 6 x 2 = 12
  • Trimmed FastQC: 6 x 2 = 12
  • If QC’ing mateless reads (i.e. where only R1’s or R2’s survived): 6 x 2 = 12
  • At least 24 QC reports, and possibly 36!

It then makes more sense to aggregate the reports and visualize them together, how?

Using MultiQC

All you have to do it point multiqc to the top level directory, and it will automatically scan the directory structure (or tree) and consolidate all the reports into one interactive HTML report.

Suppose you have a directory in your scratch folder /scratch/$USER/my_project/, which is where you have performed your analysis. Then all you have to do,

MultiQC can detect and parse a lot of different kinds of reports (e.g. PICARD output, SnpEff, Samtools, FastQC among others), for more information, please refer to the software homepage above.

Again, all these steps are automated in QC-QT BioSAILs workflow.