WATER is a tool for local pairwise sequence alignment using a modified Smith-Waterman algorithm for speed enhancements. Sequence pairs should be provided in either GCG, FASTA, EMBL, GenBank, PIR, NBRF, Phylip or UniProtKB/Swiss-Prot format. The gap insertion penalty, gap extension penalty and substitution matrix used to calculate the alignments are specified. The output is a standard EMBOSS alignment file.
Smith-Waterman algorithm is the variation of Needleman–Wunsch algorithm and it is a dynamic programming method for local alignment. WATER can find similarity for both nucleotides and for proteins. The main restriction with WATER is that it can find and align only local best matching areas.
The Smith-Waterman algorithm is a member of the class of algorithms that can calculate the best score and local alignment in the order of mn steps, where n and m are the lengths of the two sequences. These dynamic programming algorithms were first developed for protein sequence comparison by Smith and Waterman, though similar methods were independently devised during the late 1960’s and early 1970’s for use in the fields of speech processing and computer science.
Dynamic programming methods ensure the optimal local alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. WATER finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.
Smith-Waterman algorithm Determine the substitution matrix and the gap penalty scheme, initialize the scoring matrix, core each element from left to right, top to bottom in the matrix, considering the outcomes of substitutions (diagonal scores) or adding gaps and then starting at the element with the highest score, traceback based on the source of each score recursively, until 0 is encountered.
An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the ‘gap open’ penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the ‘gap extension’ penalty). Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.
- There are two ways to compute a penalty for a gap of n positions : gap opening penalty + (n – 1) * gap extension penalty gap penalty
- + n * gap length penalty
The two methods are basically equivalent. The first way is used by EMBOSS and WU-BLAST. The second way is used by NCBI-BLAST, GCG, Staden and CLUSTAL. Fasta used it for a long time the first way, but Prof. Pearson decided recently to shift to the second.