The official documentation for FastQ format can be found here.
This is the most widely used format in sequence analysis as well as what is generally delivered from a sequencer. Many analysis tools require this format because it contains much more information than FastA.
The format is similar to fasta though there are differences in syntax as well as integration of quality scores. Each sequence requires at least 4 lines:
- The first line is the sequence header which starts with an ‘@’ (not a ‘>’!).
- Everything from the leading ‘@’ to the first whitespace character is considered the sequence identifier.
- Everything after the first space is considered the sequence description
- The second line is the sequence.
- The third line starts with ‘+’ and can have the same sequence identifier appended (but usually doesn’t anymore).
- The fourth line are the quality scores
The FastQ sequence identifier generally adheres to a particular format, all of which is information related to the sequencer and its position on the flowcell. The sequence description also follows a particular format and holds information regarding sample information.
What software use FastQ?
Nearly everything works with this format. Some common examples are:
- Bowtie, Tophat2
- Velvet, Spades
- QC tools
- Trimmomatic, FastQC
I think it’s a shorter list to tell you what does not work with FastQ files. Please note that there are tools available to convert FastQ to FastA in the event that FastQ is incompatible with the tool you’re using:
- Multiple Sequence Aligners
- Any reference sequence
How are these files generated?
- Sequencers generate this format by default.
- This can also be generated from a few different file formats (BAM, SFF, HDF5), though they all were some form of FastQ at some point.
Let’s grab one!
cp /scratch/courses/HITS-2018/file_formats/208_1_merged.fastq $SCRATCH/file_formats/
Check the size of this. What program would you use to view it?