FastA Format

The official FastA documentation can be found here

FastA format is the most basic format for reporting a sequence and is accepted by almost all sequence analysis program. It only contains a sequence name, a description of the sequence (metadata, sequencer info, annotations, etc.), and the sequence itself – it can be either nucleic acids or amino acids as long as it adheres to the format.

Each sequence consists of at least two lines:

  1. The first is the sequence header, which always starts with a ‘>’
    • Everything from the beginning ‘>’ to the first whitespace is considered the sequence identifier. Everything after that is considered the sequence description (this can be metadata, machine serial number, read orientation, etc.)
  2. The sequence itself
    • Note that the sequence can span multiple lines, depending on the length of the sequence.

Software that use FastA format

In most case throughout this workshop you will encounter this format when using a reference sequence. DB query tools like blast and multiple-sequence alignment algorithms accept only FastA format. Also, when you download reference genomes they are delivered in this format.

How are these files generated?

  • Some older NGS sequencers report sequences in this format. Sanger sequencing also delivers in this format.
  • Most sequence databases store sequences in FastA format which is available for download.
  • FastA can also generated from a FastQ file.

Let’s grab one!

Generally you will download a reference genome. You can find it here: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/sequence/genomic/c_elegans.WS236.genomic.fa.gz

Download it onto the cluster in a new folder in your scratch called file_formats. Unzip this and look at the size. What command would you use to open it?