GenBank format is a Flat File Format that stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word “LOCUS” and ends with a ‘ // ’ sign. The GenBank format for protein has been renamed to GenPept. The GenBank (for nucleotides) and Genpept are essentially the same format.
The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have from their inception used tables of sites and features to describe the roles and locations of higher order sequence domains and elements within the genome of an organism. All the sections before FEATURES will be read into the attribute of metadata of the sequence or its sub-class. The header and its content of a section is stored as a pair of key and value in metadata. For the REFERENCE section, its value is stored as a list, as there are often multiple reference sections in one GenBank record.
The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. Furthermore, the annotation section contains the following main elements:
- Definition: It is a brief description of sequence including information such as source organism, gene name/protein name, or some description of the sequence’s function (if the sequence is non-coding). If the sequence has a coding region (CDS), description may be followed by a completeness qualifier, such as “complete cds”.
- Accession: It is the unique identifier for a sequence record.
- Version: A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.
- GI: “GenInfo Identifier” sequence identification number, in this case, for the nucleotide sequence.
- Keywords: Word or phrase describing the sequence.
- Source: Free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type.
- Organism: The formal scientific name for the source organism (genus and species, where appropriate) and its lineage.
- Reference: Publications by the authors of the sequence that discuss the data reported in the record.
- Authors: List of authors in the order in which they appear in the cited article.
- Title: Title of the published work or tentative title of an unpublished work.
- Features: Information about genes and gene products, as well as regions of biological significance reported in the sequence. These can include regions of the sequence that code for proteins and RNA molecules, as well as a number of other features.
- Source: Mandatory feature in each record that summarizes the length of the sequence, scientific name of the source organism, and Taxon ID number.
- Taxon: A stable unique identification number for the taxon of the source organism.
- protein_id: A protein sequence identification number, similar to the Version number of a nucleotide sequence. Protein IDs consist of three letters followed by five digits, a dot, and a version number.
- Origin: it contains sequences with 10 residues in a tab format.