Gene Prediction Using Markov Models and Hidden Markov Models
Markov models and HMMs can be very helpful in providing ﬁner statistical description of a gene. A Markov model describes the probability of the distribution of nucleotides in a DNA sequence, in which the conditional probability of a particular sequence position depends on k previous positions. In this case, k is the order of a Markov model. A zero-order Markov model assumes each base occurs independently with a given probability. This is often the case for noncoding sequences. A ﬁrst-order Markov model assumes that the occurrence of a base depends on the base preceding it. A second-order model looks at the preceding two bases to determine which base follows, which is more characteristic of codons in a coding sequence. The use of Markov models in gene ﬁnding exploits the fact that oligonucleotide distributions in the coding regions are different from those for the noncoding regions. These can be represented with various orders of Markov models. Since a ﬁxed-order Markov chain describes the probability of a particular nucleotide that depends on previous k nucleotides, the longer the oligomer unit, the more non randomness can be described for the coding region. Therefore, the higher the order of a Markov model, the more accurately it can predict a gene. Because a protein-encoding gene is composed of nucleotides in triplets as codons, more effective Markov models are built in sets of three nucleotides, describing nonrandom distributions of trimmers or hexamers and so on. The parameters of a Markov model have to be trained using a set of sequences with known gene locations. Once the parameters of the model are established, it can be used to compute the nonrandom distributions of trimers or hexamers in a new sequence to ﬁnd regions that are compatible with the statistical proﬁles in the learning set. Statistical analyses have shown that pairs of codons tend to correlate. The frequency of six unique nucleotides appearing together in a coding region is much higher than by random chance. Therefore, a ﬁfth-order Markov model, which calculates the probability of hexamer bases, can detect nucleotide correlations found in coding regions more accurately and is in fact most often used.
A potential problem of using a ﬁfth-order Markov chain is that if there are not enough hexamers, which happens in short gene sequences, the method’s efﬁcacy may be limited. To cope with this limitation, a variable-length Markov model, called an interpolated Markov model (IMM), has been developed. The IMM method samples the largest number of sequence patterns with k ranging from 1 to 8 (dimers to ninemers) and uses a weighting scheme, placing less weight on rare k-mers and more weight on more frequent k-mers. The probability of the ﬁnal model is the sum of probabilities of all weighted k-mers. In other words, this method has more ﬂexibility in using Markov models depending on the amount of data available. Higher-order models are used when there is a sufﬁcient amount of data and lower-order models are used when the amount of data is smaller. It has been shown that the gene content and length distribution of prokaryotic genes can be either typical or atypical. Typical genes are in the range of 100 to 500 amino acids with a nucleotide distribution typical of the organism. The typical genes tend to escape detection using the typical gene model. This means that, to make the algorithm capable of fully describing all genes in a genome, more than one Markov model is needed. To combine different Markov models that represent typical and a typical nucleotide distributions creates an HMM prediction algorithm.
Tools using HMMs
- Glimmer (Gene Locator and Interpolated Markov Modeler)