The Database of Genotypes and Phenotypes (dbGaP) is a National Institutes of Health (NIH) sponsored repository charged to archive, curate and distribute information produced by studies investigating the interaction of genotype and phenotype in Humans. dbGaP assigns stable, unique identifiers to studies and subsets of information from those studies, including documents, individual phenotypic variables, tables of trait data, sets of genotype data, computed phenotype-genotype associations, and groups of study subjects who have given similar consents for use of their data.
It was launched in response to the development of NIH’s GWAS policy and provides unprecedented access to very large genetic and phenotypic datasets funded by National Institutes of Health and other agencies worldwide. Scientists from the global research community may access all public data and apply for controlled access data.
The information present in dbGaP includes individual level molecular and phenotype data, analysis results, medical images, general information about the study, and documents that contextualize phenotypic variables, such as research protocols and questionnaires. Submitted data undergoes quality control and curation by dbGaP staff before being released to the public.
Information about submitted studies, summary level data, and documents related to studies can be accessed freely on the dbGaP website. Individual-level data can be accessed only after a Controlled Access application, stating research objectives and demonstrating the ability to adequately protect the data, has been approved. Public summary data from dbGaP are also accessed without restriction through the PheGenI tool.
Variables are created from the columns of the dataset; each variable and dataset is accessioned using the general dbGaP format ph(v|t)######.v#.p#. A variable’s version (v#) will change when either values of data change or its entry in the data dictionary changes. A dataset’s version will change when a variable inside the dataset is added, updated or deleted. For both variables and datasets the participant set (p#), is inherited from the study to which it belongs. Variables, and sometimes datasets, are linked to appropriate sections of documents.
Individual level phenotype data is only available through the dbGaP Authorized Access System.
Genotype data present at the dbGaP consist of individual level genotypes and aggregated summaries, both of which are distributed through the dbGaP Authorized Access System. The types of data available include DNA variations, SNP assay, DNA methylation (epigenomics), copy number variation, as well as genomic/exomic sequencing. RNA data types such as expression array, RNA seq, and eQTL results are also available.
Genotype data are accessioned based on their data type and use the general dbGaP accession format ph(g|e|a)######.v# where ‘g’ denotes GWAS, ‘e’ expression, and ‘a’ analysis. Versioning of genotype data is triggered by addition or withdrawal of samples, sample consent status change, or error correction.
All publicly released dbGaP studies can be queried from the search box on the top of the dbGaP homepage. Queries can be very simple, just a keyword of interest (‘cancer’) or complex, making use of search fields and Boolean operators (‘cholesterol[variable] AND phs000001’). More complex searches can be facilitated by using the ‘Advanced Search’ which helps create queries via a web form. As with all other NCBI resources, the searches in dbGaP are performed using the Entrez search and retrieval system.
Once a search query is executed and results returned, clicking on an item’s name or accession will lead to a page listing more specific information about that object. This information is of particular importance to those users who want to find out more about a study before deciding whether or not to apply for Authorized Access.