There are unique requirements for applying algorithms for sequence database searching. The first criterion is sensitivity, which refers to the ability to find as many correct hits as possible. It is measured by the extent of addition of correctly identified sequence members of the same family. These correct hits are considered “true positives” in the database searching exercise. The second criterion is selectivity, also called specificity, which refers to the ability to exclude incorrect hits. These incorrect hits are unrelated sequences mistakenly identified in database searching and are considered “false positives.” The third criterion is speed, which is the time it takes to get results from database searches. Depending on the size of the database, speed sometimes can be a primary concern. Ideally, one wants to have the greatest sensitivity, selectivity, and speed in database searches. A compromise between the three criteria often has to be made. In database searching, as well as in many other areas in bioinformatics, are two fundamental types of algorithms. One is the exhaustive type, which uses a rigorous algorithm to find the best or exact solution for a particular problem by examining all mathematical combinations. Dynamic programming is an example of the exhaustive method and is computationally very intensive. Another is the heuristic type, which is a computational strategy to find an empirical or near optimal solution by using rules of thumb. Essentially, this type of algorithm takes shortcuts by reducing the search space according to some criteria.
Database searching is the matching of query nucleotide or protein sequences with database sequences. To do this, we align the query sequence with database sequences to find similarity among them. Database searching is the application of knowledge achieved from previous biological experiments to the gene discovery problem.
The most clear first stage in the analysis of any new sequence is to perform comparisons with sequence databases to find homologues. These searches can now be performed just about anywhere and on just about any computer. In addition, there are certain web servers for doing searches, where one can post or paste a sequence into the server and receive the results interactively.
There are many methods for sequence searching. By far the most well-known are the BLAST suite of programs. One can easily obtain versions to run locally and there are many web pages that allow one to compare a protein or DNA sequence against a multitude of gene and protein sequence databases.
- National Center for Biotechnology Information (USA) Searches
- European Bioinformatics Institute (UK) Searches
- BLAST search through SBASE (domain database; ICGEB, Trieste)
Other methods for comparing a single sequence to a database include:
- The FASTA suite (William Pearson, University of Virginia, USA)
- SCANPS (Geoff Barton, European Bioinformatics Institute, UK)
- BLITZ (Compugen’s fast Smith Waterman search)