A substitution model describes the process from which a sequence of symbols changes into another set of traits. For example, in cladistics, each position in the sequence might correspond to a property of a species which can either be present or absent.
Substitution models are used, for example, for constructing evolutionary trees in phylogenetics or cladistics, and simulating sequences to test other methods and algorithms.
Most substitution models used to date are neutral, independent, finite sites models. Neutral sites mean selection does not operate on the substitutions, and so they are unconstrained. Independent sites mean changes in one site do not affect the probability of changes in another site. Finite sites are finitely many sites, and so over evolution, a single site can be changed multiple times.
An evolutionary model of sequence data is a model of nucleotide or amino-acid substitution and consequent divergence of sequences. The evolutionary (substitution) models play an important role in the analysis of molecular sequence data. These models filter the complexity of the biological mutation process into simpler patterns that can be described and predicted using a small number of parameters. Substitution models attempt to predict the rate of substitution for nucleotides or amino acids at a given site, and also the distribution of substitutions across the entire sequence. The differential rate of substitutions across the sequence is called the rate heterogeneity.
Multiple alignment is followed by the selection of an appropriate evolutionary model. There are many such models. All statistical models are based on certain assumptions. One assumption is that each position in the nucleic acid or protein evolves independently. In reality, that is not the case; there are hot spots of mutation, and also some mutations are more tolerated than others.
The simplest way to determine divergence is to count the number of substitutions. However, there are caveats in such a simplistic approach. For example, an observed substitution (e.g. A→G) may not be the original substitution, but may have involved an intermediate substitution (e.g. A→T→G). Likewise, the absence of substitution at a position may also mean that an original substitution has been reversed (reverse mutation) during evolution to restore the original residue (e.g. A→G→A). Substitution models are statistical models that are supposed to correct for these biases. Note that these methods are based on general mathematical and statistical principles that have their own set of assumptions. The simplest substitution model for nucleotides is the Jukes–Cantor (JC) one-parameter model, which assumes that all nucleotides occur in equal frequency (25%) and are substituted with equal probability. This model requires a single parameter denoting rate. However, it is well known that transition mutations are more common than transversion mutations. Kimura’s two-parameter model accounts for this, and proposes that transition mutations provide a better estimate of evolutionary divergence than transversion mutations. This model requires two parameters denoting rate. Like the Jukes–Cantor model, Kimura’s model also assumes that all nucleotides occur in equal frequency (25%). There are other more complex models of nucleotide substitution, such as the Felsenstein model and the Hasegawa–Kishono–Yano (HKY) model, which assume that nucleotides occur at different frequencies, and that transitions and transversions occur at different rates. The general time reversible (GTR) model, also known as the general reversible (REV) model is even more complex and assumes different rates of substitution for each pair of nucleotides, in addition to assuming different frequencies of occurrence of nucleotides. For these models, the nucleotide frequencies are estimated by the observed frequencies in the alignment. Some amino acid substitution models are the Dayhoff model (PAM), the Bishop–Friday model, the Jones–Taylor–Thornton (JTT) model, the Whelan and Goldman (WAG) model, and the Le Gascuel (LG) model. The simplest model is the Bishop–Friday model, which assumes that all amino acids occur at equal frequency and all substitutions occur at the same rate. All other models assume different amino-acid frequencies and different substitution rates, which are experimentally determined.
The substitution model utilized for a particular data set can be displayed by the software, such as MEGA version 5.