The official FastA documentation can be found here
FastA format is the most basic format for reporting a sequence and is accepted by almost all sequence analysis program. It only contains a sequence name, a description of the sequence (metadata, sequencer info, annotations, etc.), and the sequence itself – it can be either nucleic acids or amino acids as long as it adheres to the format.
Each sequence consists of at least two lines:
- The first is the sequence header, which always starts with a ‘>’
- Everything from the beginning ‘>’ to the first whitespace is considered the sequence identifier. Everything after that is considered the sequence description (this can be metadata, machine serial number, read orientation, etc.)
- The sequence itself
- Note that the sequence can span multiple lines, depending on the length of the sequence.
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02
Software that use FastA format
In most case throughout this workshop you will encounter this format when using a reference sequence. DB query tools like blast and multiple-sequence alignment algorithms accept only FastA format. Also, when you download reference genomes they are delivered in this format.
How are these files generated?
- Some older NGS sequencers report sequences in this format. Sanger sequencing also delivers in this format.
- Most sequence databases store sequences in FastA format which is available for download.
- FastA can also generated from a FastQ file.
Let’s grab one!
Generally you will download a reference genome. You can find it here: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/sequence/genomic/c_elegans.WS236.genomic.fa.gz
Download it onto the cluster in a new folder in your scratch called
file_formats. Unzip this and look at the size. What command would you use to open it?