Alignment

IntroSeqAlign2018 – Presentation

Once data are in a FASTQ format the first step of any NGS analysis is to align the short reads against the reference genome. This module describes how to map short DNA sequence reads, assess the quality of the alignment and prepare to visualize the mapping of the reads.

Required Modules:

Overview:

  • Align reads to reference
  • Sort sam file (output from alignment) and convert to bam
  • Alignment Metrics
  • Mark duplicates
  • Prepare reference dictionary, fasta index, and bam index

1) The Burroughs Wheeler Transform

2) Performing a read alignment using Illumina data

We will use the BWA MEM algorithm to align input reads to your reference genome. We use BWA MEM because it is recommended in the Broads best practices and because it has been found to produce better results for variant calling. Note that BWA MEM is recommended for longer reads, ie. 75bp and up.

Alternative aligners such as Bowtie2 may be used.

Note: Most aligners require an indexed reference sequence as input. Reference index files for the sample data have been prebuilt and are available in:

/scratch/work/cgsb/gencore/data/variant_calling/ref/prebuilt/

If required, index files can be built from a reference sequence (in FASTA format) using the following command:

bwa index <reference.fasta>

Using the reference sequence in the sample dataset, we can build the index files using the following command:

bwa index ./ref/GCF_000001405.33_GRCh38.p7_chr20_genomic.fna

If executed correctly, you should see the following output:

Let’s take a look at the output using ls -l

We can see 5 new files, all having the same basename as the original reference sequence file. These are the index files required by BWA.

Note: If the reference is greater than 2GB, you need to specify a different algorithm when building the BWA index, as follows:

bwa index -a bwtsw <reference.fasta>

Once we have the reference index, we can proceed to the alignment step. We run BWA as follows:

Command explained:

bwa mem Invoke the bwa mem algorithm

-M This flag tells bwa to consider split reads as secondary, required for GATK variant calling

-R <readgroup_info> Provide the readgroup as a string. The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag.

<ref> The name of your reference sequence. Note that all index files must be present in the same directory and have the same basename as the reference sequence

<reads_1.fastq>, <reads_2.fastq> Your input reads. In this case, mates of a paired end library

<output.sam> The name of your outut file

Put it altogether:

If everything worked, you should have a new aligned_reads.sam file.

2) Sort sam and convert to bam

The algorithms used in downsteam steps require the data to be sorted by coordinate and in bam format in order to be processed. We use Picard Tools and issue a single command to both sort the sam file produced in step 1 and output the resulting sorted data in bam format:

java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate

If this executed correctly, you should see something like the folloing:

Let’s take a look at the files before and after this step to see what happened. We will use samtools to view the sam/bam files.

Let’s take a look at the first few lines of the original file. We’ll use the samtools view command to view the sam file, and pipe the output to head -5 to show us only the ‘head’ of the file (in this case, the first 5 lines).

samtools view aligned_reads.sam | head -5

Output:

Let’s compare this initial alignment file to the new sorted file:

samtools view sorted_reads.bam | head -5

Output:

Is the output consistent with what we expect?

3) Alignment Metrics

Let’s compute some statistics to see how well our reads aligned to the reference genome. We’ll use samtools flagstat for this.

Output:

Hint: Save these metrics to a text file by piping the output to a new file

4) Mark Duplicates

During the sequencing process, the same DNA fragments may be sequenced several times. These duplicate reads are not informative and cannot be considered as evidence for or against a putative variant. For example, duplicates can arise during sample preparation e.g. library construction using PCR. Without this step, you risk having over-representation in your sequence of areas preferentially amplified during PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

We use Picard Tools to locate and tag duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.

Note that this step does not remove the duplicate reads, but rather flags them as such in the read’s SAM record. We’ll take a look at how this is done shortly. Downstream GATK tools will ignore reads flagged as duplicates by default.

Note: Duplicate marking should not be applied to amplicon sequencing or other data types where reads start and stop at the same positions by design.

If this executed correctly, you should see something like the following:

These stats are broken down in the metrics.txt file:

Let’s take a look at the bam file before and after the Mark Duplicates step to see how reads are flagged as duplicates.

Refresher: The second column in a SAM file is known as the bitwise flag. This flag allows for the storage of lots of information in a highly efficient format. Let’s look at the first read in sorted_reads.bam:

Question: What is the bitwise flag value for this read?

(Answer: 161)

Question: What does this value represent? http://broadinstitute.github.io/picard/explain-flags.html

(Answer: read paired, mate reverse strand, second in pair)

Note: “read is PCR or optical duplicate” is also stored in this flag

Let’s look at the read on line 58, before and after marking duplicates:

5) Prepare indexed bam file

We use samtools to build the bam index:

We should have 3 new files:

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.dict – GATK reference dictionary

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.fai – fasta Index

dedup_reads.bam.bai – bam index