Before we get started, let’s highlight some guidelines/considerations that relate to ChiP-seq analysis QC and data generation (sequencing).
Sequencing approach & QC
- Effective analysis of ChIP-seq data requires sufficient coverage by sequence reads (sequencing depth). It mainly depends on the size of the genome, and the number and size of the binding sites of the protein.
- For mammalian transcription factors (TFs) and chromatin modifications such as enhancer-associated histone marks, 20 million reads are adequate (4 million reads for worm and fly TFs).
- Proteins with more binding sites (e.g., RNA Pol II) or broader factors: need more reads, up to 60 million for mammalian ChIPseq.
Sequencing depth rules of thumb:
- >10M reads for narrow peaks
- >20M reads for broad peaks
- Long & paired-end reads useful but not essential
- Replicates are a good idea, but unlike RNA-Seq, more than 2 replicates does not significantly increase the number of targets.
- Control samples should be sequenced significantly deeper than the ChIP ones in a TF experiment and in experiments involving diffused broad-domain chromatin data.
- This ensures sufficient coverage of a substantial portion of the genome and non-repetitive autosomal DNA regions.
- Always check the quality of raw sequenced reads to determine the appropriate QC/QT steps. For example, marking duplicates rather than filtering them out is a better approach in CHiP-seq experiments. However, if a substantial amount of raw reads PCR duplicates were flagged \(by examining the FastQC report duplication levels\), then it probably is a good idea to filter these out (Hint: PICARD tools MarkDuplicates can achieve this).
Read alignment
When aligning to a reference genome, we are interested in the uniquely mapped reads. Again, a good measure of uniquely mapped reads is ~70% (of quality trimmed reads not raw reads) for mammalian genomes (human, mouse), however, this varies between genomes. Less than 50% might be a cause for concern.
A low percentage of uniquely mapped reads might point towards high amplification (PCR) and/or sequencing bias (optical duplicates). In this case, revisit the QC reports.
Peak callers tend to ignore multi-mapping reads during the peak calling algorithm stage.