Bioinformatics Sequence Alignment

Global/Local/Pairwise Alignment

Pinterest LinkedIn Tumblr

Pairwise Sequence Alignment is used to identify regions of similarity that may infer functional, structural or evolutionary relationships between two biological sequences (protein or nucleic acid).

There are two types of pairwise alignments,

  1. A local alignment is an alignment of two sub-regions of a pair of sequences. This type of alignment is appropriate when aligning two segments of genomic DNA that may have local regions of similarity embedded in a background of a non-homologous sequence and are of variable length.
  2. A global alignment is a sequence alignment over the entire length of two or more nucleic acid or protein sequences. In a global alignment, the sequences are assumed to be homologous along their entire length.

In order to align a pair of sequences, a scoring system is required to score matches and mismatches. The scoring system can be as simple as “+1” for a match and “-1” for a mismatch between the pair of sequences at any given site of comparison. However substitutions, insertions and deletions occur at different rates over evolutionary time. This variation in rates is the result of a large number of factors, including the mutation process, genetic drift and natural selection. PAM and BLOSUM matrices are used most commonly.

Algorithms for pairwise alignments

Once a scoring system has been chosen, we need an algorithm to find the optimal alignment of two sequences. This is done by inserting gaps in order to maximize the alignment score. If the sequences are related along their entire sequence, a global alignment is appropriate. However, if the relatedness of the sequences is unknown or they are expected to share only small regions of similarity, (such as a common domain) then a local alignment is more appropriate.

An efficient algorithm for global alignment was described by Needleman and Wunsch 1970, and their algorithms were later extended by Gotoh 1982 to model gaps more accurately. For local alignments, the Smith-Waterman algorithm is the most commonly used. 

The Needleman-Wunsch ( i.e Needle (EMBOSS), Stretcher (EMBOSS) ) and Smith-Waterman algorithms ( i.e Water (EMBOSS), Matcher (EMBOSS) and LALIGN ), types of Dynamic programing, will always find the best alignment between two sequences, whether or not they are evolutionarily related.

Steps in Dynamic programing are;

1.Initialization of the matrix with the scores possible
2.Matrix filling

3.Trace back the residues for appropriate alignment

Initialization Step

First row and first column of the matrix can be initially filled with 0. If the gap score is assumed, the gap score can be added to the previous cell of the row or column.

Matrix Fill Step

It starts from the upper left hand corner of the matrix. To find the maximum score of each cell, it is required to know the neighbouring scores (diagonal, left and right) of the current position. From the assumed values, add the match or mismatch (assumed) score to the diagonal value. Similarly add the gap score to the other neighbouring values. Thus, we can obtain three different values, from that take the maximum among them and fill the ith and jth position with the score obtained.

Trace back Step

The final step for the appropriate alignment is trace backing, firstly, one needs to find out the maximum score obtained in the entire matrix for the local alignment of the sequences. It is possible that the maximum scores can be present in more than one cell, in that case there may be possibility of two or more alignments, and the best alignment by scoring it (match =5, mismatch=-1, gap=-2) which may be user defined. By continuing the trace back step by this method, one would reach the 0th row, 0th column. 

Write A Comment