The algorithm is an iterative method in which all possible pairs of amino acids (one from each string) are set up in a 2D matrix and alignments are represented as pathways through this array. The optimum alignment is the path (or paths) connecting maximum scoring values. This approach is an example of dynamic programming. It is a global optimization process which yields a solution to the problem of pairwise alignment, meaning that we are interested in finding the best fit between only two strings. If alignment of more than two strings is of interest, the problem can be broken down into a cascade of pairwise alignments and thus solved. It must extend from the beginning to the end of both sequences to achieve the highest total score. In other words, the alignment path has to go from the bottom right corner of the matrix to the top left corner. This strategy is only suitable for aligning two closely related sequences that are of the same length. For divergent sequences or sequences with different domain structures, the approach does not produce optimal alignment.
Start the first string in the top of the third column and start the other string at the start of the third row. Fill out the rest of the column and row headers.
Next, decide how to score each individual pair of letters.
The letters may match, mismatch, or be matched to a gap (a deletion or insertion indel)):
- Match: The two letters at the current index are the same.
- Mismatch: The two letters at the current index are different.
- Indel (INsertion or DELetion): The best alignment involves one letter aligning to a gap in the other string.
Each of these scenarios is assigned a score and the sum of the scores of all the pairings is the score of the whole alignment candidate. Different systems exist for assigning scores.
- Match: +1
- Mismatch or Indel: −1
Start with a zero in the second row, second column. Move through the cells row by row, calculating the score for each cell. The score is calculated by comparing the scores of the cells neighboring to the left, top or top-left (diagonal) of the cell and adding the appropriate score for match, mismatch or indel. Calculate the candidate scores for each of the three possibilities:
- The path from the top or left cell represents an indel pairing, so take the scores of the left and the top cell, and add the score for indel to each of them.
- The diagonal path represents a match/mismatch, so take the score of the top-left diagonal cell and add the score for match if the corresponding bases (letters) in the row and column are matching or the score for mismatch if they do not.
The resulting score for the cell is the highest of the three candidate scores. Given there is no ‘top’ or ‘top-left’ cells for the second row only the existing cell to the left can be used to calculate the score of each cell. Hence −1 is added for each shift to the right as this shows an indel from the previous score.
Mark a path from the cell on the bottom right back to the cell on the top left by following the direction of the arrows. From this path, the sequence is constructed by these rules:
- A diagonal arrow represents a match or mismatch, so the letter of the column and the letter of the row of the origin cell will align.
- A horizontal or vertical arrow represents an indel. Horizontal arrows will align a gap (“-“) to the letter of the row (the “side” sequence), vertical arrows will align a gap to the letter of the column (the “top” sequence).
- If there are multiple arrows to choose from, they represent a branching of the alignments. If two or more branches all belong to paths from the bottom right to the top left cell, they are equally viable alignments.