BioWarehouse is a part of the Bio-SPICE project and it is an open-source software environment for integrating a set of biological databases into a single physical database management system for data management, mining, and exploration used in the Bioinformatics field.
- A relational database schema that models important bioinformatics data types
- Its instances can be used by using either the Oracle or MySQL database management systems
- A collection of loader programs that populate the warehouse with data from public biological databases
- The loader programs transform the syntax of the source databases into relational form, and transform the diverse semantics of the source database into the common semantics of the BioWarehouse schema.
The BioWarehouse is occupied using loader programs that translate the flat file representation of a source database into the warehouse schema. A loader is provided for each source database supported by BioWarehouse. Once loaded within a BioWarehouse instance running on such as on MySQL, a set of source DBs can now be queried together.
Some loaders are specific to a data format rather than to a single source database. For each loader, there are two pieces of documentation:
- how to build and run the loader
- a “manual” for developers describing the details of the loader implementation and schema mappings.
BioWarehouse is an open source toolkit for establishing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics data types:
- biochemical reactions
- chemical compounds
- metabolic pathways
- nucleic acid sequences
- features on protein
- nucleic-acid sequences,
- organism taxonomies
- controlled vocabularies
We can use BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. As there is no sequence that exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering.
The first component of BioWarehouse is a set of relational database schema definitions that model many bioinformatics data types. The schema is stored in a format that can be automatically converted to an Oracle schema, and a MySQL schema. The data types covered by the BioWarehouse schema include:
- Genes, and proteins
- Pathways, reactions, and small molecules
- Sequences and sequence features
- Controlled vocabularies
- Gene expression data
- Protein expression data
- Flow cytometry data
- Organisms and taxonomic relationships
- Results of computations, such as sequence matches
An important aspect of the BioWarehouse approach is that data of the same type from different source databases is loaded into the same BioWarehouse tables.