Pre-Processing

Raw data (typically FASTQ files) are not immediately usable for variant discovery analysis. The first phase of the workflow includes the pre-processing steps that are necessary to get your data from raw FASTQ files to an analysis-ready BAM file.

Overview:

  • Align reads to reference
  • Sort sam file (output from alignment) and convert to bam
  • Alignment Metrics
  • Mark duplicates
  • Prepare reference dictionary, fasta index, and bam index
  • Recalibrate base quality scores

Setting up your environment on Prince

For users working on the Prince HPC in NYC:

1) Enter an interactive Slurm session

2) Load modules

3) Get sample dataset

Setting up your environment on Dalma

For users working on the Dalma HPC in NYUAD :

1) Enter an interactive Slurm session

2) Load gencore and variant detection modules

3) Get sample dataset

1) Alignment

We will use the BWA MEM algorithm to align input reads to your reference genome. We use BWA MEM because it is recommended in the Broads best practices and because it has been found to produce better results for variant calling. Note that BWA MEM is recommended for longer reads, ie. 75bp and up.

Alternative aligners such as Bowtie2 may be used.

Note: Aligners typically require an indexed reference sequence as input.

If required, index files can be built from a reference sequence (in FASTA format) using the following command:

bwa index

Using the reference sequence in the sample dataset, we can build the index files using the following command:

bwa index GCF_000001405.33_GRCh38.p7_chr20_genomic.fna

If executed correctly, you should see the following output:

Let’s take a look at the output using ls -l

We can see 5 new files, all having the same basename as the original reference sequence file. These are the index files required by BWA.

Note: If the reference is greater than 2GB, you need to specify a different algorithm when building the BWA index, as follows:

bwa index -a bwtsw

Once we have the reference index, we can proceed to the alignment step. We run BWA as follows:

Command explained:

bwa mem Invoke the bwa mem algorithm

-M This flag tells bwa to consider split reads as secondary, required for GATK variant calling

-R <readgroup_info> Provide the readgroup as a string. The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag.

<ref> The name of your reference sequence. Note that all index files must be present in the same directory and have the same basename as the reference sequence

<reads_1.fastq>, <reads_2.fastq> Your input reads. In this case, mates of a paired end library

<output.sam> The name of your output file

Put it all together:

If everything worked, you should have a new aligned_reads.sam file.

2) Sort sam and convert to bam

The algorithms used in downstream steps require the data to be sorted by coordinate and in bam format in order to be processed. We use Picard Tools and issue a single command to both sort the sam file produced in step 1 and output the resulting sorted data in bam format:

java -jar $PICARD_JAR SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate

If this executed correctly, you should see something like the folloing:

Let’s take a look at the files before and after this step to see what happened. We will use samtools to view the sam/bam files.

Let’s take a look at the first few lines of the original file. We’ll use the samtools view command to view the sam file, and pipe the output to head -5 to show us only the ‘head’ of the file (in this case, the first 5 lines).

samtools view aligned_reads.sam | head -5

More information about samtools in the manual: http://www.htslib.org/doc/samtools.html

Output:

Let’s compare this initial alignment file to the new sorted file:

samtools view sorted_reads.bam | head -5

Output:

Is the output consistent with what we expect?

3) Alignment Metrics

Let’s compute some statistics to see how well our reads aligned to the reference genome. We’ll use samtools flagstat for this.

samtools flagstat aligned_reads.sam

Output:

Hint: Save these metrics to a text file by redirecting the output to a new file

4) Mark Duplicates

During the sequencing process, the same DNA fragments may be sequenced several times. These duplicate reads are not informative and cannot be considered as evidence for or against a putative variant. For example, duplicates can arise during sample preparation e.g. library construction using PCR. Without this step, you risk having over-representation in your sequence of areas preferentially amplified during PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

We use Picard Tools to locate and tag duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.

Note that this step does not remove the duplicate reads, but rather flags them as such in the read’s SAM record. We’ll take a look at how this is done shortly. Downstream GATK tools will ignore reads flagged as duplicates by default.

Note: Duplicate marking should not be applied to amplicon sequencing or other data types where reads start and stop at the same positions by design.

If this executed correctly, you should see something like the following:

These stats are broken down in the metrics.txt file:

Let’s take a look at the bam file before and after the Mark Duplicates step to see how reads are flagged as duplicates.

Refresher: The second column in a SAM file is known as the bitwise flag. This flag allows for the storage of lots of information in a highly efficient format. Let’s look at the first read in sorted_reads.bam:

Question: What is the bitwise flag value for this read?

(Answer: 161)

Question: What does this value represent? http://broadinstitute.github.io/picard/explain-flags.html

(Answer: read paired, mate reverse strand, second in pair)

Note: “read is PCR or optical duplicate” is also stored in this flag

Let’s look at this read before and after marking duplicates: HS2000-1010_101:8:2205:14144:55120

5) Prepare reference dictionary, fasta index, and bam index

In order to run GATK, we need to build a reference dictionary, fasta index, and a bam index.

We use Picard Tools to build the reference dictionary for GATK:

We use samtools to build the fasta index:

We use samtools to build the bam index:

We should have 3 new files:

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.dict – GATK reference dictionary

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.fai – fasta Index

dedup_reads.bam.bai – bam index

6) Base Quality Score Recalibration

Variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a basecall that has a low quality score, that means we’re not sure we actually read that A correctly, and it could actually be something else. So we won’t trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site. [https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS]

Refresher: What are quality scores?

  • Per-base estimates of error emitted by the sequencer
  • Expresses the level of confidence for each base called
  • Use standard Pred scores: Q20 is a general cutoff for high quality and represents 99% certainty that a base was called correctly
  • 99% certainty means 1 out of 100 expected to be wrong. Let’s consider a small dataset of 1M reads with a read length of 50, this means 50M bases. With 99% confidence, this means 50,000 possible erroneous bases.

The image below shows an example of average quality score at east position in the read, for all reads in a library (output from FastQC)

The image below shows individual quality scores (blue bars) for each position in a single read. The horizontal blue line represents the Q20 phred score value.

Quality scores emitted by sequencing machines are biased and inaccurate

Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. [https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS]

How BQSR works

  1. You provide GATK Base Recalibrator with a set of known variants.

  2. GATK Base Recalibrator analyzes all reads looking for mismatches between the read and reference, skipping those positions which are included in the set of known variants (from step 1).

  3. GATK Base Recalibrator computes statistics on the mismatches (identified in step 2) based on the reported quality score, the position in the read, the sequencing context (ex: preceding and current nucleotide).

  4. Based on the statistics computed in step 3, an empirical quality score is assigned to each mismatch, overwriting the original reported quality score.

As an example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machines, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG.

[http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr]

Note that this step requires a ‘truth’ or ‘known’ set of variants. For this example we will be using the gold set from the 1000 genomes project (provided in the sample dataset: 1000G_omni2.5.hg38.vcf.gz.tbi). An index for the VCF is required as well and is also provided. If you need to build an index for your VCF file, you can build one easily using the TABIX program, like so:

Step 1: Analyze Covaration

If executed correctly, you should see something like this:

Step 2: Apply BQSR

This step applies the recalibration computed in the Step 1 to the bam file.

If everything worked, you should see something like this:

The output of this step, recal_reads.bam, is our analysis-ready dataset that we will provide to the variant calling tool in the next step of the analysis.

Supplementary material: What to do if you don’t have a set of known variants?

BQSR is an optional but highly recommended step in variant calling analysis. In the event you are working with an organism for which there is no known set of variants available, it is possible to produce a set of known variants for use in this step, although it does require some additional processing steps.

This procedure is known as bootstrapping and entails calling variants without running BQSR, filtering those variants to obtain a high confidence set of variants, and then using these variants as input for the BQSR step. This process can be repeated until convergence.