One of the core issues of Bioinformatics is dealing with a profusion of (often poorly defined or ambiguous) file formats. Some simple human readable formats have over time attained the status of de facto standards. A ubiquitous example of this is the ‘FASTA sequence file format’, originally invented by Bill Pearson as an input format for his FASTA suite of tools. Over time, this format has evolved by consensus; however, in the absence of an explicit standard some parsers will fail to cope with very long ‘>’ title lines or very long sequences without line wrapping. There is also no standardization for record identifiers.
In the area of DNA sequencing, the FASTQ file format has emerged as another de facto common format for data exchange between tools. It provides a simple extension to the FASTA format, the ability to store a numeric quality score associated with each nucleotide in a sequence. This is a very minimal representation of a sequencing read. No doubt because of its simplicity, the FASTQ format has become widely used as a simple interchange file format.
It was originally developed at the Well come Trust Sanger Institute to bundle a FASTA formatted sequence and its quality data, but has recently become the de facto standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer. The Sanger FASTQ format is useful both for raw sequencing reads and post-processed assemblies where higher qualities occur.
FASTQ acts as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score , despite lacking any formal definition to date, and existing in at least three incompatible variants.
A fastq file normally uses four lines per sequence.
- Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a fasta title line).
- Line 2 is the raw sequence letters.
- Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence and the sequence PHRED quality score encoded as ASCII characters (for space-efficient encoding).
The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped but this is generally discouraged due to the unfortunate choice of “@” and “+” as markers.
There is no standard file extension for a FASTQ file, but .fq and .fastq, are commonly used.
The “Sanger” fastq format uses the standard Phred-formula for quality calculation. Quality values are converted to a single character using the ASCII table. As this table starts with 32 non-printing characters, the quality values are represented by the character equivalent to Q+A, where A is at least 33 the second half of the ASCII table. The quality scores are generated in binary base call (BCL) files from Illumina sequencing platforms, which are then later converted to FASTQ files using the bcl2fastq tool.