Alignment algorithms, both global and local, are fundamentally similar and only differ in the optimization strategy used in aligning similar residues. Both types of algorithms can be based on one of the three methods: the dot matrix method, the dynamic programming method and the word method.
Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming.
These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family.
Word methods identify a series of short, non-overlapping subsequences (“words”) in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found through a number of web portals, such as EMBL FASTA and NCBI BLAST.
Working of BLAST Algorithm:
- Query sequence is taken and analyzed for low complex regions. Low complexity regions are regions which contain less information or variations like AAAAAAAA or ATATATATA etc.
- These low complex regions are masked with alphabet s like X or N
- List of words of a certain word size is made. Usually the word size is 3 for proteins and 11 for DNA
- Scores are calculated for each pair of words(query sequence word and database word) using substitution scoring matrices (like PAM or BLOSUM),and only the high scoring words i.e. above a threshold value or a cutoff score is taken for further alignment. A cutoff score is selected to reduce the number of matches so as to decrease the computation time.
- This scoring and checking is repeated for all the words in the query sequence.
- The remaining high-scoring words are organized into efficient search tree and rapidly compared to the database sequence. This is done to find out the exact matches.
- If an exact or good match is found then an alignment is extended in both directions from the position where the exact match occurred
- High scoring pairs (HSP) which have scores greater than a threshold are taken for consideration.
- Significance of the HSP score is calculated.