Alignment

IntroSeqAlign – Presentation

Once data are in a FASTQ format the first step of any NGS analysis is to align the short reads against the reference genome. This module describes how to map short DNA sequence reads, assess the quality of the alignment and prepare to visualize the mapping of the reads.

Required Modules:

Overview:

  • Align reads to reference
  • Sort sam file (output from alignment) and convert to bam
  • Alignment Metrics
  • Mark duplicates
  • Prepare reference dictionary, fasta index, and bam index

1) The Burroughs Wheeler Transform

2) Performing a read alignment using Illumina data

We will use the BWA MEM algorithm to align input reads to your reference genome. We use BWA MEM because it is recommended in the Broads best practices and because it has been found to produce better results for variant calling. Note that BWA MEM is recommended for longer reads, ie. 75bp and up.

Alternative aligners such as Bowtie2 may be used.

Note: Most aligners require an indexed reference sequence as input. Reference index files for the sample data have been prebuilt and are available in:

/scratch/work/cgsb/gencore/data/variant_calling/ref/prebuilt/

If required, index files can be built from a reference sequence (in FASTA format) using the following command:

bwa index <reference.fasta>

Using the reference sequence in the sample dataset, we can build the index files using the following command:

bwa index ./ref/GCF_000001405.33_GRCh38.p7_chr20_genomic.fna

If executed correctly, you should see the following output:

[bwa_index] Pack FASTA... 0.95 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=128888334, availableWord=21068624
[BWTIncConstructFromPacked] 10 iterations done. 34753182 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 64202446 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 90372990 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 113629422 characters processed.
[bwt_gen] Finished constructing BWT in 48 iterations.
[bwa_index] 33.57 seconds elapse.
[bwa_index] Update BWT... 0.68 sec
[bwa_index] Pack forward-only FASTA... 0.60 sec
[bwa_index] Construct SA from BWT and Occ... 11.97 sec
[main] Version: 0.7.8-r455
[main] CMD: bwa index GCF_000001405.33_GRCh38.p7_chr20_genomic.fna
[main] Real time: 48.246 sec; CPU: 47.800 sec

Let’s take a look at the output using ls -l

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.amb
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.ann
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.bwt
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.pac
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.sa

We can see 5 new files, all having the same basename as the original reference sequence file. These are the index files required by BWA.

Note: If the reference is greater than 2GB, you need to specify a different algorithm when building the BWA index, as follows:

bwa index -a bwtsw <reference.fasta>

Once we have the reference index, we can proceed to the alignment step. We run BWA as follows:

bwa mem -M -R <ref> <reads_1.fastq> <reads_2.fastq> > <output.sam>

Command explained:

bwa mem Invoke the bwa mem algorithm

-M This flag tells bwa to consider split reads as secondary, required for GATK variant calling

-R <readgroup_info> Provide the readgroup as a string. The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag.

<ref> The name of your reference sequence. Note that all index files must be present in the same directory and have the same basename as the reference sequence

<reads_1.fastq>, <reads_2.fastq> Your input reads. In this case, mates of a paired end library

<output.sam> The name of your outut file

Put it altogether:

bwa mem -M -R '@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1' ./ref/GCF_000001405.33_GRCh38.p7_chr20_genomic.fna reads_1.fastq reads_2.fastq > aligned_reads.sam

If everything worked, you should have a new aligned_reads.sam file.

2) Sort sam and convert to bam

The algorithms used in downsteam steps require the data to be sorted by coordinate and in bam format in order to be processed. We use Picard Tools and issue a single command to both sort the sam file produced in step 1 and output the resulting sorted data in bam format:

java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate

If this executed correctly, you should see something like the folloing:

[Wed Dec 07 11:38:40 EST 2016] picard.sam.SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Dec 07 11:38:40 EST 2016] Executing as mk5636@compute-0-0.local on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
INFO 2016-12-07 11:39:05 SortSam Finished reading inputs, merging and writing to output now.
[Wed Dec 07 11:39:23 EST 2016] picard.sam.SortSam done. Elapsed time: 0.72 minutes.
Runtime.totalMemory()=1988100096

Let’s take a look at the files before and after this step to see what happened. We will use samtools to view the sam/bam files.

Let’s take a look at the first few lines of the original file. We’ll use the samtools view command to view the sam file, and pipe the output to head -5 to show us only the ‘head’ of the file (in this case, the first 5 lines).

samtools view aligned_reads.sam | head -5

Output:

HS2000-940_146:5:1101:1161:63226        73      NC_000020.11    23775298        60      78M22S  =       23775298        0       CTGNTAGCCCTGCTGAATCTCCCTCCTGACCCAACTCCCTCNTNNNNNNNGCTGGGTGACTGCTGNCNNCACNGGCTGTGNNNNNNNNNNNNNCAGCTGG   ?@@#4ADDDFDFFHIGGFCFHCHFGIHGCGHEHHEHD3?BH#0#######--5CEECG=?AEEHE###################################   NM:i:13 MD:Z:3G37C1C0T0A0C0T0C0T15T1C0T3T5      AS:i:52 XS:i:0  RG:Z:sample_1
HS2000-940_146:5:1101:1161:63226        133     NC_000020.11    23775298        0       *       =       23775298        0       NNCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNCNAAAGGAGCCTGGGT   ####################################################################################################   AS:i:0  XS:i:0  RG:Z:sample_1
HS2000-940_146:5:1101:1262:12434        99      NC_000020.11    23843774        60      100M    =       23843977        258     ATCAATGGTGTTTCTTTGCCAAGCTTCCTTAGTCGCCTTTAATCGGGAAAAGGTCTTCATTCTTTCTTGTCTTTGTTACCCTGTCATTTTTGAAGATAAC   ?@@BDDFFFFHHHGHIIIHJEGHIIIGIJICHIGIIGIJIDHHGJJJ:;8EFH=CFHGGHIIIJJHHGBEHFFFFEEDCCCCEDCCADDEDD(5>5>@5@   NM:i:0  MD:Z:100        AS:i:100        XS:i:0  RG:Z:sample_1
HS2000-940_146:5:1101:1262:12434        147     NC_000020.11    23843977        60      55M45S  =       23843774        -258    GTACATCATTCTGGGAGGCCAGGATACCATTGTCCAATTGCNNNNGGATGTTAATNNNNNNNNNNNNNNNNNNNNANNNNCTTCCNNNNNNNNCCACTCT   #CDDCCC@38C<DDDDBBCDC;5ADDDCCACAADCC?=5,,####>CGHDC;;--####################1####FC<24########FDDD=B=   NM:i:4  MD:Z:41T0G0A0T10        AS:i:47 XS:i:28 RG:Z:sample_1
HS2000-940_146:5:1101:1295:90112        83      NC_000020.11    23920564        60      100M    =       23920284        -380    GCAAGTGAATGCTCTTTCCCACAGCAAAGGATTAACTGATTTCTGCTACTTGTGGCTCAGAGGCCAGGGACACTTGACCTGTCCTAGGAAGGCTGTCACC   #@DDDD@><DDDEEAC?FFFHHEC=HECEAIGGE@IIGGHB;HAFD??F?IIJIEGGCIIJJJIHCIIIIIHIIHECIJIIHIJJJIFGHHFFDFFF@C@   NM:i:0  MD:Z:100        AS:i:100        XS:i:0  RG:Z:sample_1

Let’s compare this initial alignment file to the new sorted file:

samtools view sorted_reads.bam | head -5

Output:

HS2000-940_146:5:2109:14063:29918       161     NC_000020.11    64145   4       54S46M  =       23724989        23660944        TTCCAATCCATTCCATTCCATCACACTGCATTCCATTCCATTCCAATCCCCTCAACTCCACTCCACTCCACTCCATTCCACTCCAATCAATTCCATTGCA   @CCFFFFFHGDHHJIJJJJJIJIFHHGCGIHHIJJJGHIHIIIJJIHIGGGHIJJE:FFHIGIDHJGIGGIJJ@;CDHGGEIHHHEHF;CCB>;;3;>;>   MD:Z:6G27C8T2   RG:Z:sample_1   NM:i:3  AS:i:33 XS:i:31
HS2000-940_146:5:2110:1521:37886        163     NC_000020.11    1217420 0       55S21M24S       =       1217591 271     GACAGTTCTGAAGAGAGCAGGGGTTCTTCCAGCATTGCATTTGAGCTCCGAAAATGGACAGACTGCCTCCTCAAGTCGGTCCTTGACCTCCGTGCACCCT   ?7:DDD:B,+ADD43C?BF++<2<):**11*1:;C*0?0B?F>GCDBF30'-'-8@8..@1@E@;37@)?76@###########################   MD:Z:21 RG:Z:sample_1   NM:i:0  AS:i:21 XS:i:34
HS2000-940_146:5:2110:1521:37886        83      NC_000020.11    1217591 0       100M    =       1217420 -271    ATATCCAGACAAACAGGGTCTGGAGTAGACCTCCAGCAAATTCCAACAGACCTGCAGCTGAGGGTCCTGACTGTTAGAAGGAAAACTAACAAACAGAAAG   CA@>9CA;5?@>6;;..1;77;3=77)75CCF=)>HG@>BB3GEIGFB??BHDIIIGBHGFC:13@ACF9CAA,@F?EDA4IGGGAGHHHGHEFFFD?@@   MD:Z:3C4G17G13C59       RG:Z:sample_1   NM:i:4  AS:i:81 XS:i:81
HS2000-940_146:5:1102:10582:53061       113     NC_000020.11    1544082 22      100M    =       23852837        22308739        TTCTGTTGATTTGGGGTGGAGAGTTCTGTAGAGGTCTGTTAGGTCTGCTTGGTCCAGAGCTGAGCTCAAATCCTGAATATCCTTGTTAATTTTCTGTCTC   ###CCA:(>(;?:8;A>A>D@>D@=3=>7=@72@@AED@7BFCF?EIIF?0;IGAFD9HFD9D9@>E?9+<HCIGHEA?C3DDFDGBFD?C?DDAAD?;;   MD:Z:32T4A26T4G30       RG:Z:sample_1   NM:i:4  AS:i:80 XS:i:70
HS2000-940_146:5:1112:8371:47601        99      NC_000020.11    2502086 10      100M    =       2502334 348     AACTAGAATAACCAATGCAGAGAAGTCCTTAAAGGACCTGATGGAGCTGAAAACCAAGGCACAAGAACTACGTGATGAATACACAAGCCTCAGTAGCCGA   CCCFFFFFHHHHHIJIIJJJJGJJIHHIIJJJJIJGIJJIGGHGHHGGIIJJJJJJJGHGGGIIJIIGHHHGEDFEEFEEEEEEDDDBDDDCC>CCDDBB   MD:Z:33A22T5G12C4G19    RG:Z:sample_1   NM:i:5  AS:i:75 XS:i:79

Is the output consistent with what we expect?

3) Alignment Metrics

Let’s compute some statistics to see how well our reads aligned to the reference genome. We’ll use samtools flagstat for this.

samtools flagstat aligned_reads.sam

Output:

194492 + 0 in total (QC-passed reads + QC-failed reads)
80 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
193804 + 0 mapped (99.65% : N/A)
194412 + 0 paired in sequencing
97206 + 0 read1
97206 + 0 read2
190812 + 0 properly paired (98.15% : N/A)
193108 + 0 with itself and mate mapped
616 + 0 singletons (0.32% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Hint: Save these metrics to a text file by piping the output to a new file

samtools flagstat aligned_reads.sam > alignment_metrics.txt

4) Mark Duplicates

During the sequencing process, the same DNA fragments may be sequenced several times. These duplicate reads are not informative and cannot be considered as evidence for or against a putative variant. For example, duplicates can arise during sample preparation e.g. library construction using PCR. Without this step, you risk having over-representation in your sequence of areas preferentially amplified during PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

We use Picard Tools to locate and tag duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.

Note that this step does not remove the duplicate reads, but rather flags them as such in the read’s SAM record. We’ll take a look at how this is done shortly. Downstream GATK tools will ignore reads flagged as duplicates by default.

Note: Duplicate marking should not be applied to amplicon sequencing or other data types where reads start and stop at the same positions by design.

java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt

If this executed correctly, you should see something like the following:

[Mon Dec 19 17:29:19 EST 2016] Executing as mk5636@compute-13-30.local on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
INFO    2016-12-19 17:29:19     MarkDuplicates  Start of doWork freeMemory: 797279896; totalMemory: 798490624; maxMemory: 11276386304
INFO    2016-12-19 17:29:19     MarkDuplicates  Reading input file and constructing read end information.
INFO    2016-12-19 17:29:19     MarkDuplicates  Will retain up to 43370716 data points before spilling to disk.
INFO    2016-12-19 17:29:22     MarkDuplicates  Read 194420 records. 0 pairs never matched.
INFO    2016-12-19 17:29:22     MarkDuplicates  After buildSortedReadEndLists freeMemory: 540268072; totalMemory: 909639680; maxMemory: 11276386304
INFO    2016-12-19 17:29:22     MarkDuplicates  Will retain up to 352387072 duplicate indices before spilling to disk.
INFO    2016-12-19 17:29:23     MarkDuplicates  Traversing read pair information and detecting duplicates.
INFO    2016-12-19 17:29:23     MarkDuplicates  Traversing fragment information and detecting duplicates.
INFO    2016-12-19 17:29:23     MarkDuplicates  Sorting list of duplicate records.
INFO    2016-12-19 17:29:24     MarkDuplicates  After generateDuplicateIndexes freeMemory: 906398800; totalMemory: 3729260544; maxMemory: 11276386304
INFO    2016-12-19 17:29:24     MarkDuplicates  Marking 15269 records as duplicates.
INFO    2016-12-19 17:29:24     MarkDuplicates  Found 31 optical duplicate clusters.
INFO    2016-12-19 17:29:27     MarkDuplicates  Before output close freeMemory: 3757499808; totalMemory: 3762290688; maxMemory: 11276386304
INFO    2016-12-19 17:29:27     MarkDuplicates  After output close freeMemory: 3753828760; totalMemory: 3758620672; maxMemory: 11276386304
[Mon Dec 19 17:29:27 EST 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.14 minutes.
Runtime.totalMemory()=3758620672

These stats are broken down in the metrics.txt file:

LIBRARY     UNPAIRED_READS_EXAMINED        READ_PAIRS_EXAMINED    UNMAPPED_READS    UNPAIREDD_READ_DUPLICATES    READ_PAIR_DUPLICATE    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION    ESTIMATED_LIBRARY_SIZE
sample_1    616                            96554                   688                165                         7552                    31                              0.078818             586767

Let’s take a look at the bam file before and after the Mark Duplicates step to see how reads are flagged as duplicates.

Refresher: The second column in a SAM file is known as the bitwise flag. This flag allows for the storage of lots of information in a highly efficient format. Let’s look at the first read in sorted_reads.bam:

samtools view sorted_reads.bam | head -1
HS2000-940_146:5:2109:14063:29918       161     NC_000020.11    64145   4       54S46M  =       23724989        23660944        TTCCAATCCATTCCATTCCATCACACTGCATTCCATTCCATTCCAATCCCCTCAACTCCACTCCACTCCACTCCATTCCACTCCAATCAATTCCATTGCA   @CCFFFFFHGDHHJIJJJJJIJIFHHGCGIHHIJJJGHIHIIIJJIHIGGGHIJJE:FFHIGIDHJGIGGIJJ@;CDHGGEIHHHEHF;CCB>;;3;>;>   MD:Z:6G27C8T2   RG:Z:sample_1   NM:i:3  AS:i:33 XS:i:31

Question: What is the bitwise flag value for this read?

(Answer: 161)

Question: What does this value represent? http://broadinstitute.github.io/picard/explain-flags.html

(Answer: read paired, mate reverse strand, second in pair)

Note: “read is PCR or optical duplicate” is also stored in this flag

Let’s look at the read on line 58, before and after marking duplicates:

samtools view sorted_reads.bam | head -58 | tail -1
HS2000-1010_101:7:2306:21017:95177      81      NC_000020.11    8563852 0       48S48M4S        =       23923303        15359405      CTACAATGAGAATCACAAAACACTGTTCAAATAAATCAAAGAAGATACAAACAAATGGGAAAACATCCCATGCTCATGGATAGGAGGAATCAATATCATT     #CDCDEFFFDDDBEA>CA;=A@HD=7;GGEFC@D@IHHCGFB>IGCHHG>GIGJHHEGEJHGIJJJJJIEGCJIJJJJIGGEIIJJJHGHHHDFFFFCCC   MD:Z:10A26A10   RG:Z:sample_1   NM:i:2  AS:i:38 XS:i:37
samtools view dedup_reads.bam | head -58 | tail -1
HS2000-1010_101:7:2306:21017:95177      1105    NC_000020.11    8563852 0       48S48M4S        =       23923303        15359405      CTACAATGAGAATCACAAAACACTGTTCAAATAAATCAAAGAAGATACAAACAAATGGGAAAACATCCCATGCTCATGGATAGGAGGAATCAATATCATT     #CDCDEFFFDDDBEA>CA;=A@HD=7;GGEFC@D@IHHCGFB>IGCHHG>GIGJHHEGEJHGIJJJJJIEGCJIJJJJIGGEIIJJJHGHHHDFFFFCCC   MD:Z:10A26A10   PG:Z:MarkDuplicates     RG:Z:sample_1   NM:i:2AS:i:38  XS:i:37

5) Prepare indexed bam file

We use samtools to build the bam index:

samtools index dedup_reads.bam

We should have 3 new files:

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.dict – GATK reference dictionary

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.fai – fasta Index

dedup_reads.bam.bai – bam index