A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are considered to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be found and phylogenetic analysis can be conducted to assess the sequences’ shared evolutionary origins.
It is often used to assess sequence conservation of protein domains, tertiary and secondary structures and even individual amino acids or nucleotides.
Multiple sequence alignment also means the process of aligning a sequence set. MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.
A general approach when calculating multiple sequence alignments is to use graphs to identify all of the different alignments. When finding alignments through graph, a complete alignment is created in a weighted graph that contains a set of vertices and a set of edges. Each of the graph edges has a weight based on a certain heuristic that helps to score each alignment or subset of the original graph.
When determining the best suited alignments for each MSA, a trace is usually generated. A trace is a set of realized or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices.
Analysis and evaluation calculation
Evaluation calculation is performed using three procedures: similarity calculation, distance calculation, and cluster validation calculation. The percent identify (ID) score is used to calculate the similarity between two sequences in the aligned matrices generated in the step named ‘alignment analyses’ .
Cluster validity index were designed to evaluate the fitness degree MSA or PSA aligned results and the real protein family divisions, the index should not be too sensitive to noise such as Dunn and Dunn like indices and should not add burden to the calculation such as importing the representative point for each cluster as many index required.
If one sequence alignment method got well-clustered results, the value will be near 1.0; otherwise the value will be near to 1 if it has poorly clustered results. Higher silhouette value meant intra-distances (distances among the same class) were much smaller than inter-distances (distances among different classes) which proved the partitioning to be a good one.
RS score is used to measure the dissimilarity of clusters. The values of RS ranged from 0 to 1. A higher RS value means better clustering. It was calculated as:
For each benchmark group, the cluster validity results of different alignment methods calculated on the 10 re-sampled datasets is done using the t test. A higher p-value means the performance of the two alignment methods was of no difference while a smaller p-value means there were significant differences between the two alignment methods.
After doing our Multiple Sequence Alignment (MSA) using any of the available problems, we could consider for each position (column) in our alignment that residues (amino-acids) in that column are homologs that means they share a common evolutionary history. If we are 100% sure our sequences are orthologs, then the residues in each column are orthologs between then. Depending on the amount of gaps we have in each column, we can consider other evolutionary events as insertions (few residues, a lot of gaps) and deletions (few gaps, a lot of residues). Regarding to the conserved blocks, they are those that look very well aligned (few gaps, few inconsistencies) but if we want to be sure about our conserved blocks, we could try programs such as GBlocks, trimAl or BMGE that detect and remove poorly aligned/misaligned columns in our alignment.
- Clustal Omega
- EMBOSS Cons