Sequence Read Archive (SRA) is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to increase reproducibility and facilitate new discoveries through data analysis. It accepts data from all kinds of sequencing projects including clinically important studies that involve human subjects. These data often have controlled access via dbGaP (the database of Genotypes and Phenotypes) .
The preservation of experimental data is an important part of the scientific record, and increasing numbers of journals and funding agencies require that next-generation sequence data are deposited into the SRA. The Sequence Read Archive is a bioinformatics database that provides a public repository for DNA sequencing data, especially the “short reads” generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. Originally it was called the Short Read Archive, then the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads.
The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). All the metadata is exchanged on a daily basis and shared, accessed and retrieved among these three databases. Researchers can download raw sequence data from the SRA website to perform further analyses and to compare with their own data.
SRA works as a core infrastructure for sharing pre-publication sequence data, with the aim to make sequence data available to the research community. The database also stores alignment information in the form of read placements on a reference sequence.
The archive was established by the National Center for Biotechnology Information (NCBI) in 2007 in order to provide a repository for data produced by RNA-Seq and ChIP-Seq studies as well as large-scale studies including the Human Microbiome Project and the 1000 Genomes Project. The SRA has grown rapidly since 2008. The volume of data deposited in the Sequence Read Archive has grown rapidly. As of September 2010, 65% of the SRA was human genomic sequence, with another 16% relating to human metagenome sequence reads. As of June 2011, most SRA sequence data was produced by Illumina’s Genome Analyzer and the data contained within the SRA passed 100 Terabases of DNA in volume.
- Archives raw sequencing data and alignment information from high-throughput sequencing platforms, including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, Complete Genomics, and Pacific Biosciences SMRT.
- Makes sequence data available to the research community to enhance reproducibility and allow for new discoveries by comparing data sets.
- SRA uses XML schemas
- SRA uses the NCBI SRA Toolkit for storing and exchanging all next-generation sequence data.
- Support all commonly used data file formats