GenBank is a comprehensive database that has publicly available nucleotide sequences, and their protein sequence information, for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun. It is a public repository of all publicly available molecular sequence data from a range of sources. NCBI builds GenBank primarily from the submission of sequence data from authors and from the bulk submission of expressed sequence tag (EST), genome survey sequence (GSS) and other high-throughput data from sequencing centers.
In addition to relevant metadata (e.g., sequence description, source organism and taxonomy), publication information is recorded in the GenBank data file. The identification of literature associated with a given molecular sequence may be an essential first step in developing research hypotheses.
Major component of NCBI’s mission is to provide access to a variety of databases and software for the scientific and medical communities. GenBank, is one of these databases. GenBank, the European Molecular Biology Laboratory Nucleotide Sequence Database (EMBL) in Europe, and the DNA Databank of Japan (DDBJ) together form the International Nucleotide Sequence Database Collaboration (INSDC). The INSDC archives and makes publically available more than 80 million individual molecular sequences including mRNA sequences, genomic survey sequences and ribosomal RNA gene clusters. Data is exchanged daily among the INSDC partners (GenBank, EMBL, and DDBJ) to maintain consistency and completeness of molecular sequence data contributed and used by the scientific community. GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.
Some common features are;
- perform a biological function
- affect or are the result of the expression of a biological function,
- interact with other molecules
- affect replication of a sequence
- affect or are the result of recombination of different sequences,
- are a recognizable repeated unit
- have secondary or tertiary structure
- exhibit variation, or have been revised or corrected
- third Party Annotation
The GenBank database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. As GenBank continues to grow beyond a predominantly biomedical resource and incorporated into non-biomedical research inquiries, it will be necessary to consider means to link additional electronic indices associated with non-biomedical biological literature.
The reason for popularity of GenBank is;
- Linking molecular sequences and scientific literature
- Links to cited publication in GenBank record
- Journals linked with largest number of GenBank sequences
There are several ways to search and retrieve data from GenBank.
- Search GenBank for sequence identifiers and annotations with Entrez Nucleotide
- Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool). BLAST searches CoreNucleotide, dbEST, and dbGSS independently; see BLAST info for more information about the numerous BLAST databases
- Search, link, and download sequences programmatically using NCBI e-utilities
- The ASN.1 and flat file formats are available at NCBI’s anonymous FTP for GenBank