FastQ Format – NGS Analysis

The official documentation for FastQ format can be found here.

This is the most widely used format in sequence analysis as well as what is generally delivered from a sequencer. Many analysis tools require this format because it contains much more information than FastA.

The format is similar to fasta though there are differences in syntax as well as integration of quality scores. Each sequence requires at least 4 lines:

The first line is the sequence header which starts with an ‘@’ (not a ‘>’!).
- Everything from the leading ‘@’ to the first whitespace character is considered the sequence identifier.
- Everything after the first space is considered the sequence description
The second line is the sequence.
The third line starts with ‘+’ and can have the same sequence identifier appended (but usually doesn’t anymore).
The fourth line are the quality scores

The FastQ sequence identifier generally adheres to a particular format, all of which is information related to the sequencer and its position on the flowcell. The sequence description also follows a particular format and holds information regarding sample information.

What software use FastQ?

Nearly everything works with this format. Some common examples are:

Aligners
- Bowtie, Tophat2
Assemblers
- Velvet, Spades
QC tools
- Trimmomatic, FastQC

I think it’s a shorter list to tell you what does not work with FastQ files. Please note that there are tools available to convert FastQ to FastA in the event that FastQ is incompatible with the tool you’re using:

Blast
Multiple Sequence Aligners
Any reference sequence

How are these files generated?

Sequencers generate this format by default.
This can also be generated from a few different file formats (BAM, SFF, HDF5), though they all were some form of FastQ at some point.

Let’s grab one!


cp /scratch/courses/HITS-2018/file_formats/208_1_merged.fastq $SCRATCH/file_formats/

Check the size of this. What program would you use to view it?