Bioinformatics Sequence Alignment

Gap Penalities in Sequence Alignment

Sequence alignment is typical in any analyses of evolutionary relationships, in extracting functional and even tertiary structure information from a protein amino acid sequence. Since evolutionary relationships consider that a certain number of the amino acid residues in a protein sequence are conserved, the simplest way to assess the relationships between two sequences would be to count the numbers of identical and similar amino acids. This is done by sequence alignment. The number of identical and similar amino acid residues may then be compared to the total number of amino acids in the protein. This gives the percentage of identical and similar residues percentage of sequence identity and sequence similarity. Similar residues are those that have similar chemical characteristics, like positively charged Lys and Arg, or hydrophobic Leu and Val, etc. Substitution of amino acids by chemically equivalent ones often does not have a dramatic effect on the structure or function of the protein. To count the number of identities and similarities in sequence alignment, some rules describing how alignment can be performed have been produced. 

Additional factors to take into account when analyzing sequences are insertions and deletions. It is quite common that when comparing sequences of members of a protein family we will find that at some positions in some of the sequences there will be some extra residues (insertions), or missing residues (deletions). Sometimes even larger parts or a whole domain may be inserted into or deleted in a protein. Depending on how we handle these insertions and deletions, different sequence alignments may be generated. 

The amino acid residues which are identical in the two sequences are marked in the third row by their names while the positions of those which are different are marked by x. Certain positions are marked by a dash. The percentage of identity for the sequence alignment is simply calculated in percentage.

(Score) S= number of matches – number of mismatches

we use a gap (marked by a dash in the first sequence). A gap in one of the sequences simply means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence. When introducing a gap many following questions may arise: 

  • How many gaps can we introduce? 
  • How to decide where to place them? 
  • How long can they be? 

A badly placed gap may result in a totally meaningless model. Normally, when we run sequence alignment software, we can notice that the number of gaps is limited. 

They are gap penalties. Each time the program introduces a gap it triggers a penalty score, which reduces the total score of the alignment. The gap penalty is a parameter that can be changed each time an alignment is run. This will affect the number of gaps, their length and position in the sequence alignment.

