The primary hindrance to access genomics and Bioinformatics data are the volume of information present and the level of background knowledge needed to navigate data repositories. The end result is that much of the bioinformatics data remains inaccessible to average end-users such as biologists or physicians, or end-users spend inordinate amounts of time navigating these resources.
GeneDig is a new tool that has been developed as a platform for entry into publicly available genomic data with broad expansion possibilities. The genomes of any sequenced organism can be searched or browsed. The genomics navigator takes advantage of recent developments in web technologies such as HTML5 to provide the most robust and smooth user experience possible.
Previous genome browsers have avoided these technologies for the sake of compatibility, but as the adoption rate of modern web browsers increases, it is believed that by leveraging these new functionalities a better user experience is delivered to produce a larger user base that would normally be discouraged in accessing genomics.
This tool also provides built-in state management to allow users to save the exact state of the genome navigator and webpage at any time by copying the uniform resource locator so that anyone with the necessary permissions may access it at a later date.
During an NGS experiment, the nucleotide sequences stored inside the raw FASTQ files, or “sequence reads”, need to be mapped or aligned to the reference genome to determine from where these sequences originated. Therefore, a reference genome in FASTA format is needed in which to align our sequences.
In addition, many NGS methods require knowing where known genes or exons are present on the genome in order to quantify the number of reads aligning to different genome features, such as exons, introns, transcription start sites, etc. These analyses require reference data containing specific information about genomic coordinates of various genomic “features”, such as gene annotation files.
To download reference data or access the genome, there are a few different sources available:
- General biological databases: Ensembl, NCBI, and UCSC
- Organism-specific biological databases: Wormbase, Flybase, etc.
- Reference data collections: Illumina’s iGenomes, one location to access genome reference data from Ensembl, UCSC and NCBI
- Local access: shared databases on FAS Odyssey cluster or HMS O2 cluster with access to genome reference data from Ensembl, UCSC and NCBI
General biological databases
Biological databases for gene expression data store genome assemblies and provide annotations regarding where the genes, transcripts, and other genomic features are located on the genome.
Genome assemblies give us the nucleotide sequence of the reference genome. The current genome build is GRCh38/hg38 for the human, which was released in 2013 and is maintained by the Genome Reference Consortium (GRC). Usually the biological databases will include the updated versions as soon as they are stable released, in addition to access to archived versions.