Variant Discovery

Once you have a pre-processed, analysis-ready bam file, you can begin the variant discovery process.

Overview

  • Call variants

  • Filter variants

  • Annotation

  • Visualization

1) Call Variants

We use the GATK HaplotypeCaller to perform variant calling. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This step is designed to maximize sensitivity in order to minimize false negatives, i.e. failing to identify real variants.

If everything worked you should see something like the following:

And have two new output files:

raw_variants.vcf – all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

raw_variants.vcf.idx – index file

Extract SNPs and Indels

This step separates SNPs and Indels so they can be processed and analyzed independently.

Prince

Dalma

You should see something like this for each command:

And have four new output files in total after running both commands:

raw_snps.vcf and raw_indels.vcf – VCF files containing only snps and only indels

raw_snps.vcf.idx and raw_indels.vcf.idx– index files

2) Filter Variants

The first step is designed to maximize sensitivity and is thus very lenient in calling variants. This is good because it minimizes the chance of missing real variants, but it means that we need to filter the raw callset produced in order to reduce the amount of false positives, which can be quite large. This is an important step in order to obtain the the highest-quality call set possible.

We apply the recommended hard filters for SNPs and Indels.

Prince

SNPs

You should get two new files: filtered_snps.vcfand filtered_snps.vcf.idx

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however, they will be marked as ‘*_filter’ based on which filter the SNP failed, while SNPs which passed the filter will be marked as ‘PASS’

Indels

You should get two new files: filtered_indels.vcfand filtered_indels.vcf.idx

Note: Indels which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however, they will be marked as ‘*_filter’ based on which filter the indel failed, while Indels which passed the filter will be marked as ‘PASS’

Dalma

SNPs

You should get two new files: filtered_snps.vcfand filtered_snps.vcf.idx

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however, they will be marked as ‘*_filter’ based on which filter the SNP failed, while SNPs which passed the filter will be marked as ‘PASS’

Indels

You should get two new files: filtered_indels.vcfand filtered_indels.vcf.idx

Note: Indels which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however, they will be marked as ‘*_filter’ based on which filter the indel failed, while Indels which passed the filter will be marked as ‘PASS’

3) Annotation

We will use the program SnpEff. It annotates and predicts the effects of variants on genes (such as amino acid changes).

SnpEff has pre-built databases for thousands of genomes. We will be using the pre-built database GRCh38.p2.RefSeq.

Caveat:

Although we are using the NCBI RefSeq SnpEff Database, there is still an incompatibility in the chromosome names (ie. in our data, chromosome 20 is named NC_000020.11 whereas in the prebuilt SnpEff database, chromosome 20 is named ’20’). To get around this, we will do a little magic to transform ‘NC_000020.11′ to ’20’.

First let’s make a copy of filtered_snps.vcf:

We will use vim to edit our new file:

Use this snippet to replace all instanced of ‘NC_000020.11′ with ’20’:

Save changes and quit:

We’re now ready to run SnpEff:

Upon completion, you should see three new output files:

filtered_snps.ann.vcf – An annotated VCF file
snpEff_genes.txt – A tab-delimited file summarizing the genes found to contain variants
snpEff_summary.html – An HTML summary report

Locating and downloading the SnpEff database for your organism:

To locate and download the SnpEff database for your organism, execute the following command:

For example, if you are working with Arabidopsis thaliana:

Output:

Select the reference/version you are working with (for example: athalianaTair10), then:

4) Visualization (IGV)

Let’s fire up IGV, load the Human_GRCh38 genome, and load our dedup.bam and filtered_snps.vcf files. Let’s look at:

1) Examples of high confidence SNPs in our filtered_snps.vcf file vs SNPs in our alignment not called by GATK

2) Examples of Homozygous variants vs Heterozygous variants.

3) Examples of SNPs filtered out by variant filtering (23, 703, 878)

4) Examples of SNPs in the cystatin C gene (CST3, variants in this gene are associated with Alzheimer’s)