Bioinformatics Bioinformatics File Formats Sequence Format

FASTA (Sequence Format)

Pinterest LinkedIn Tumblr

In Bioinformatics and biochemistry, the FASTA format is a text based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.

In the original format, a sequence was represented as a series of lines, each of which was no longer than 120 characters and usually did not exceed 80 characters. This mostly was to allow for pre-allocation of fixed line sizes in software at the time most users relied on Digital Equipment Corporation (DEC) VT220 (or compatible) terminals which could display 80 or 132 characters per line.

The first line in a FASTA file started either with a “>” (greater-than) symbol or, less frequently and a “ ; ” (semicolon) was taken as a comment. As a result, lines starting with a semicolon would be ignored by software. Since the only comment used was the first, it quickly became used to hold a summary description of the sequence, often starting with a unique library accession number, and with time it has become commonplace to always use “>” for the first line and to not use “;” comments (which would otherwise be ignored). Furthermore, the description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

Following the initial line (used for a unique description of the sequence) was the actual sequence itself in standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc). It was also common to end the sequence with an “*” (asterisk) character (in analogy with use in PIR formatted sequences) and, for the same reason, to leave a blank line between the description and the sequence. 

All the useful and popular tools take input in FASTA format because it is easy to read,modify and analyse.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:

  • lower-case letters are accepted and are mapped into upper-case
  • a single hyphen or dash can be used to represent a gap of indeterminate length
  • in an amino acid sequences, U and * are acceptable letters any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
  • Blank lines are not allowed in the middle of FASTA input.

There is no standard file extension for a text file containing FASTA formatted sequences. Some examples of widely used file extensions are ‘.fasta’, ‘.fa’ ‘.fna’, ‘.faa’, ‘.frn’ or simply ‘.txt’.

Write A Comment