Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities. Base pair probabilities are estimated using an estimate similar to the Boltzmann distribution. The partition function is calculated using a dynamic programming approach.
It performs statistically significantly better than the leading alignment programs Probcons, MAFFT and MUSCLE on BAliBASE, HOMSTRAD and OXBENCH benchmarks. Probalign improvements are largest on datasets containing N/C terminal extensions and on datasets with long and heterogeneous length sequences. On heterogeneous length datasets containing repeats, Probalign alignment accuracy is 10% and 15% than the other three methods when standard deviation of length is at least 300 and 400.
The maximum expected accuracy optimization criterion for multiple sequence alignment uses pairwise posterior probabilities of residues to align sequences. The partition function methodology is one way of estimating these probabilities.
Probalign is freely available with gap penalties optimized for standard protein and RNA alignment benchmarks is considerably faster than earlier ones. In practice, Probalign outperforms existing programs by large margins when the data contains sequences of varying lengths. Thus it is particularly suitable for protein and RNA datasets where the sequence length variation is high.
The Probalign web server, also called eProbalign, provides a useful tool for eliminating poorly aligned columns. Probalign introduced a number of new approaches for constructing a multiple alignment with posterior probabilities for all pairs of sequences. It first performs a probabilistic consistency transformation to improve posterior probabilities with the help of a third sequence. It then adapts three standard approaches in multiple sequence alignment, namely construction of a guide-tree, progressive alignment, and iterative refinement
to the expected accuracy alignment approach. The guide-tree construction is similar to UPGMA except that expected accuracies are used to measure distance between clusters. Profile-profile alignment, another standard technique in multiple sequence alignment, is extended to incorporate expected accuracies which facilitates the progressive and iterative alignment strategies.
Posterior probabilities for expected accuracy sequence alignment
The expected accuracy of an alignment is based upon the posterior probabilities of aligning residues in two sequences. Probcons uses hidden Markov models while Probalign uses the partition function of sequence alignments to generate the ensemble.
The terms δ and ε show transition probabilities for gap open and gap extensions. The probability of a sequence alignment under this model is well-defined and the one with the highest probability can be found with the Viterbi algorithm.
Amino acid scoring matrices that are normally used for sequence alignment are represented as log-odds scoring matrices as defined by Dayhoff. The commonly used sum-of-pairs score of an alignment is defined as the sum of residue-residue pairs and residue-gap pairs under an affine penalty scheme. The alignment partition function can be computed using recursions similar to the Needleman Wunsch dynamic algorithm.