# Statistical Parameters in Sequence Alignment

The distribution of optimal local alignment scores of random sequences plays an important role in evaluating the statistical significance of sequence alignments. These scores can be well defined by an extreme-value distribution. The distribution’s parameters depend upon the scoring system used and the random letter frequencies; in general they cannot be derived analytically, but must be estimated by curve fitting.

If a dynamic programming computation has probabilistically independent inputs, its successive states form a Markov chain. Thus, if the states are not too many, a “Markov computation” produces their distribution. Moreover, the concept of “islands” allows many statistical approximation problems in local alignment to be transformed to combinatorial problems.

Two popular measures for scoring entire multiple alignments are the sum of pairs (SP) score and the column score (CS). These scores can, however, only be used if a reference alignment of the same sequences is available. The SP score calculates the proportion of identically aligned residue pairs in the test and the reference alignments, whereas the CS score measures the fraction of identically aligned positions. The APDB (Analyze alignments with PDB) quality measure evaluates the quality of an alignment by using available tertiary structures of the sequences in the alignment. The recently introduced multiple overlap score (MOS) is a favorable approach, which does not need a reference alignment. The MOS searches for identically aligned regions in many alignments and presumes that the alignment with the highest number of such residues also has the highest quality.

The BLAST E-value is the number of expected hits of similar quality (score) that could be found just by chance. E-value of 10 means that up to 10 hits can be expected to be found just by chance, given the same size of a random database. E-value can be used as a first quality filter for the BLAST search result, to gain only results equal to or better than the number given by the E-value  option. Blast results are sorted by E-value by default (best hit in first line).

The E-value (expectation value) is a corrected bit-score adjusted to the sequence database size. The E-value therefore depends on the size of the used sequence database. Since large databases increase the chance of false positive hits, the E-value corrects for the higher chance. It’s a correction for multiple comparisons. This means that a sequence hit would get a better E-value when present in a smaller database.

Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda.

BLAST P-value is the probability of a chance alignment occurring with a particular score or a better score in a database search.

Users mainly used the E-value as a cutoff to define a Blast “hit”. However depending on what they do this cut-off change from 1e-2 to 1e-30

Z-values give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.