One of the core issues of Bioinformatics is dealing with poorly defined or ambiguous file formats. Some simple human readable formats have over time attained the status of de facto standards. A ubiquitous example of this is the ‘FASTA sequence file format’, originally invented by Bill Pearson as an input format for his FASTA suite of tools. Over time, this format has evolved by consensus; however, in the absence of an explicit standard some parsers will fail to cope with very long ‘>’ title lines or very long sequences without line wrapping. There is also no standardization for record identifiers.
In the area of DNA sequencing, the FASTQ file format has emerged as another de facto common format for data exchange between tools. It provides a simple extension to the FASTA format, the ability to store a numeric quality score associated with each nucleotide in a sequence. This is a very minimal representation of a sequencing read.
SOLEXA FASTQ was invented to store both sequence and associated quality values such as from sequencing instruments and people are not able to add comments after the field of the sequence in their files.
In 2004, Solexa was introduced with its own incompatible and indistinguishable version of the FASTQ format. Although the FASTQ format only records a single quality score per letter, Solexa also produced other files with quality scores for all four bases, and in order to represent low-quality information more fully an alternative logarithmic mapping was used. In 2006, Solexa was acquired by Illumina, which continued to use this FASTQ variant. The OBF projects (and others, such as MAQ) refer to this as the Solexa FASTQ variant, format name ‘fastq-solexa’.
Although Solexa/Illumina read files look pretty much like FASTQ, they are different in that the qualities are scaled differently. In the quality string, if we can see a character with its ASCII code higher than 90, probably our file is in the Solexa/Illumina format.
A SOLEXA FASTQ file normally uses four lines per sequence.
- Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
- Line 2 is the raw sequence letters.
- Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126 although in raw read data Solexa scores from -5 to 40 only are expected while starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126.