Profiles are position-specific score matrices derived from multiple sequence alignments. They are typically used for protein classification, where an unknown amino acid sequence is compared to a set of profiles characterizing known protein families. Such comparisons between profiles and anonymous sequences are often more sensitive than pairwise comparisons. Profiles can be rewritten as profile hidden Markov models. Hidden Markov models are based on Markov chains. These consist of a number of conditions and the probabilities of switching between them.
In hidden Markov models (HMMs), the states of the Markov chain are said to be hidden and to discharge observation states, for example a DNA sequence. Given such data and a HMM, there are three classical problems that need to be solved for HMMs to be useful in biological sequence analysis: the scoring, the detection, and the training problem. Efficient solution of these problems leads to many applications of HMMs in biology, including homology detection, for which profiles were originally designed, but also extremely rapid multiple sequence alignment.
The following information is stored in any generalized profile:
- Each position is called a match state. A score for every residue is defined at every match state, just as in the PSSM.
- Each match state can be omitted in the alignment, by what is called a deletion state and that receives a position-dependent penalty.
- Insertions of variable length are possible between any two adjacent match (or deletion) states. These insertion states are given a position-dependent penalty that might also depend upon the inserted residues.
- Every possible transition between any two states (match, delete or insert) receives a position-dependent penalty. This is primarily to model the cost of opening and closing a gap.
- A couple of additional parameters permit to finely tune the behavior of the extremities of the alignment, which can be forced to be ’local’ or ’global’ at either ends of the profile and of the sequence.
The Hidden Markov Model (HMM) method is a mathematical approach to solving special types of problems:
- Given the model, find the probability of the observations
- Given the model and the observations, find the most likely state transition trajectory
- Maximize adjusting the model’s parameters. For each of these problems, algorithms have been developed:
- Baum-Welch (and the Segmental K-means alternative)
The HMM method has been traditionally used in signal processing, speech recognition, and, more recently, bioinformatics. It may usually be used in pattern recognition problems, anywhere there may be a model producing a sequence of observations. In bioinformatics, it has been used in sequence alignment, in silico gene detection, structure prediction, data-mining literature, and so on. Hidden Markov Models (HMMs) became recently important and popular among bioinformatics researchers, and many software tools are based on them.