The statistical models used to correct homoplasy are called substitution models or evolutionary models. For constructing DNA phylogenies, there are a number of nucleotide substitution models available. These models differ in how multiple substitutions of each nucleotide are treated. The caveat of using these models is that if there are too many multiple substitutions at a particular position, which is often true for very divergent sequences, the position may become saturated. This means that the evolutionary divergence is beyond the ability of the statistical models to correct. In this case, true evolutionary distances cannot be derived. Therefore, only reasonably similar sequences are to be used in phylogenetic comparisons.
In the Jukes and Cantor (1969) model, the rate of nucleotide substitution is the same for all pairs of the four nucleotides A, T, C, and G. The multiple hit correction equation for this model produces a maximum likelihood estimate of the number of nucleotide substitutions between two sequences. It assumes an equality of substitution rates among sites, equal nucleotide frequencies, and it does not correct for higher rates of transitional substitutions as compared to transversion substitutions.
The Jukes and Cantor model is a model which computes probability of substitution from one state (originally the model was for nucleotides, but this can easily be substituted by codons or amino acids) to another. From this model we can also derive a formula for computing the distance between 2 sequences. The main idea behind this model is the assumption that the probability of changing from one state to a different state is always equal. As well, we assume that the different sites are independent meaning that all possible nucleotide substitutions occur at the same rate α per unit time.
Working for JC
A formula for deriving evolutionary distances that include hidden changes is introduced by using a logarithmic function.
dAB =−(3/4) ln[1−(4/3)pAB]
where dAB is the evolutionary distance between sequences A and B and pAB is the observed sequence distance measured by the proportion of substitutions over the entire length of the alignment. For example, if an alignment of sequences A and B is twenty nucleotides long and six pairs are found to be different, the sequences differ by 30%, or have an observed distance 0.3. To correct for multiple substitutions using the Jukes–Cantor model, the corrected evolutionary distance based is:
dAB =− 3/4ln[1−(4/3×0.3)]=0.38
The Jukes–Cantor model can only handle reasonably closely related sequences. According to the above equations, the normalized distance increases as the actual observed distance increases. For distantly related sequences, the correction can become too large to be reliable. If two DNA sequences have 25% similarity, pAB is 0.75. This leads the log value to be inﬁnitely large.