Individual Commands

Assembly

Below are the steps, and commands that we will use during the assembly stage. For Abyss, we need to specify the kmer size, and although there are automated ways of achieving this, for the purpose of this tutorial, we suggest you pick 1 kmer sizes

AbySS assemblies (e.g. k45 & k63)

K=45

1.

And the complete SLURM (abyssk45_SLURM.sh) script looks something like this,

K=63

And the complete SLURM (abyssk63_SLURM.sh) script looks something like this,

Parameters

-j Number of threads (CPUs) to use.

k= Size of kmer to assemble.

lib= Name(s) of sequenced libraries. In this case we only have 1 paired end library, in case we had additional libraries (say 1 more mate pair library) we will name it here e.g. lib=’pe1 mp1′

pe1= Full path to the reads specified in the libraries above (in the lib= part), read1 first followed by read2’s (space separated).

se= Path to single end (unmated) reads separated by spaces in case of multiple file.

–directory Path to output directory.

SPADES

Spades uses an ‘auto’ kmer selection mode (although kmer sizes can be specified as well). Below is the command for assembling using Spades.

Spades kmer=auto

And the complete SLURM script for SPADES

Parameters

-o Path to output directory

–careful Tries to reduce the number of mis-matches and short indels (Insertions/deletions).

-1 Path to forward reads (Read1’s) file.

-2 Path to reverse reads (Read2’s) file.

-s Path to single end files (unmated).

-t Number of threads to use (CPUs).

-m Upper memory limit (in Gb).

Note that setting the memory in the SLURM part of the script should match what we tell SPADES using the -m parameter. The same applies for the number of CPUs -t argument.

Assembly improvement – GapCloser

Next, we will use the reads and attempt to close “fill in” some of the resulting gaps. This step takes into account the length of the reads and their insert sizes (in case of paired reads). Like always, there are multiple tools that have been developed to accomplish this task, and although some will perform better (in certain circumstances) than others, here we will be focusing on SOAPdenovo’s GapCloser.

First, we need to prepare a configuration file with some basic information about our assemblies and sequenced library.

Let’s call it gapclose.config and this is what it looks like.

In this config file, we specify the average insert size of the paired end library (don’t worry, the algorithm has leniency in the estimation), whether the sequences should be reversed (0 for our paired end reads i.e. no), and finally the path to our read 1’s and read 2’s.

We can then run the command like so,

For the SPADES assembly,

For the AbySS k45 assembly,

For the AbySS k63 assembly,

Parameters

-a Assembly filename (input).

-b Configuration filename.

-o Path and name of gap_closed output file.

-l Maximum read length.

-t Number of threads (CPUs) to use.

At the end of these commands, you will get a brief report mentioning how many gaps were closed.

Again, these commands will need to be embedded within SLURM scripts similar to the assembly examples above.

Assembly Evaluation

Now that we have our assemblies, we need to evaluate and assess the 3 different assemblies that we produced. For this we will use Quast. Quast can take as input multiple fasta files of scaffolds (or contigs), in addition if you have a reference (which we don’t) and a GFF annotation file of that reference (this could be a closely related species), then you can provide that as well.

Quast will look for length distributions of scaffolds/contigs, gaps, alignments and mis-assemblies.

The command is as follows,

Parameters

-o Output folder path

-t Number of threads to use.

-s Assemblies that you want to assess, here we can supply multiple assemblies (in our case 3) separated by spaces.

-l Labels to use foreach assembly (the order must match the order which was used in the -s argument above).

-f Enable gene prediction mode.

-1 Path to forward reads (Read1’s).

-2 Path to reverse reads (Read2’s).

Once Quast completes, all the output will be available in the “QUAST” older. The output contains plain text, PDF, and HTML reports. Now let’s see which assembly we pick!