Genome Research requires the data from a publication that is easily available to the broader community in publicly held databases when available, and at the Genome Research Web site, and if desired at the author’s Web site, when they are not.
There are many different algorithms for searching sequence databases, but BLAST algorithms are some of the most popular, because of their speed. The key to BLAST’s speed is its use of local alignments that serve as seeds for more extensive alignments. In fact, BLAST is an acronym for Basic Local Alignment Search Tool.
BLAST searches begin with a query sequence that will be matched against sequence databases specified by the user. As the algorithms work through the data, they compute the probability that each potential match may have arisen by chance alone, which would not be consistent with an evolutionary relationship. BLAST algorithms begin by breaking down the query sequence into a series of short overlapping “words” and assigning numerical values to
the words. Words above a threshold value for statistical significance are then used to search databases. The default word size for BLASTN is 28 nucleotides. Because there are only four possible nucleotides in DNA, a sequence of this length would be expected to occur randomly once in every 428, or 1017, nucleotides, which is far longer than any genome. The default word size for BLASTP is three amino acids. Because proteins contain 20 different amino acids, a tripeptide sequence would be expected to arise randomly once in every 8000 tripeptides, which is longer than any protein.
- Query sequence is broken into words that will act as seeds in alignments.
- BLAST searches for matches in target entries in the Database.
- If a target entry has two or more matches to words from the query, the alignment is extended in both directions looking for additional similarity.
BLASTN and BLASTP use a rolling window to break down a query sequence into words and word synonyms that form
a search set. At least two words or synonyms in the search set must match a target sequence in the database, for that
sequence to be reported in the results.
SSAHA (Sequence Search and Alignment by Hashing Algorithm) is an algorithm for performing fast searches on genome databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the “hits” for each k-tuple in the query sequence and then performing a sort on the results. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.
Constructing the Hash Table
The first stage of the algorithm is to convert the hash table into a hash table. The hash table is stored in memory as two data structures, a list of positions in L and an array A of pointers into L.
We move on base-by-base along Q from base 0 to base n−k, in which n is the length of Q.
Clearly, we also require storage for the query sequences, although this can be kept to a minimum by loading in query sequences from disk in batches and using 2-bits-per-base encoding. In practice, we have found housekeeping functions, temporary storage, and the like add ∼10%–20% to the total RAM usage.
The CPU time required for a search may be divided into two portions, the time T hash required to generate the hash table and the time T search required for the search itself.
It is easy to see that the SSAHA algorithm will under no circumstances detect a match of less than k consecutive matching base pairs between query and subject, and almost as easy to see that, in fact, we require 2k−1 consecutive matching bases to guarantee that the algorithm will register a hit at some point in the matching region. In comparison, with their default settings FASTA, BLAST, and MegaBLAST require at least 6, 12, and 30 base pairs, respectively, to register a match.
Our company, BioinfoLytics, is affiliated with BioCode and is a project, which is covering many topics on Genomics, Proteomics, their analysis using many tools in a cool way, Sequence Alignment & Analysis, Bioinformatics Scripting & Software Development, Phylogenetic and Phylogenomic Analysis, Functional Analysis, Biological Data Analysis & Visualization, Custom Analysis, Biological Database Analysis, Molecular Docking, Protein Structure Prediction and Molecular Dynamics etc for the seekers of Biocode to further develop their interest to take part in these services to fulfill their requirements and obtain their desired results. We are providing such a platform where one can find opportunities to learn, research projects analysis and get help and huge knowledge based on molecular, computational and analytical biology.
We are providing “Genome Database Searching” service to our customers to study searching genome databases and to strive for high quality research and will advance science in the domain of Genome Analysis.