Types of Databases
Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.
For researchers to benefit from the data stored in a database, two additional requirements must be met:
- Easy access to the information
- A method for extracting only that information needed to answer a specific biological question.
According to a 2014 Molecular Biology Database Collection in the journal Nucleic Acids Research, there are a sum of 1552 databases that are publicly accessible online.
They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
Primary databases are also called as archival database. They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed, they form part of the scientific record.
- ENA, GenBank and DDBJ
- Array Express Archive and GEO
- Protein Data
- Swiss-Prot and PIR for protein sequences
Secondary databases have data derived from the results of analysing primary data. They often draw upon information from many sources, including other databases (primary and secondary), controlled vocabularies and the scientific literature. They are highly curated, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.
- UniProt Knowledgebase
Species-specific databases are available for some species, mainly those that are often used in research (Model Organisms).
- Mouse Genome Informatics for the laboratory mouse
- the Rat Genome Database for Rattus
- ZFIN for Danio Rerio (zebrafish)
There are also specialized databases that cater to a particular research interest. For example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.
Many data resources have both primary and secondary characteristics. For example, UniProt accepts primary sequences derived from peptide sequencing experiments. However, UniProt also infers peptide sequences from genomic information, and it provides a wealth of additional information, some derived from automated annotation (TrEMBL), and even more from careful manual analysis (SwissProt).
Biological databases can be broadly classified into sequence, structure and functional databases. Nucleic acid and protein sequences are stored in sequence databases and structure databases store solved structures of RNA and proteins. Functional databases provide information on the physiological role of gene products, for example enzyme activities, mutant phenotypes, or biological pathways.
Main sequence databases:
Main protein databases:
- Entrez Protein
- ENSEMBL (Human, mouse and others)
- SGD (Yeast)
- TAIR (Arabidopsis)
- Web of Science