Multiple sequence alignments are center to many areas of bioinformatics and evolutionary biology. They are not only used in the phylogenetic analyses of biological sequences but also in many other bioinformatics applications such as homology modeling, database searchers and motif finding etc. Recently, such multiple sequence alignment based techniques have been integrated in high-throughput pipelines such as genome annotation and large-scale phylogenetic analysis. In all these applications the reliability and accuracy of the analyses depend critically on the quality of the underlying alignments.
A new tool TrimAl is used tool for automated alignment trimming, which is especially suited for large-scale analyses. Its speed and the possibility for automatically adjusting the parameters to optimize the phylogenetic signal-to-noise ratios for different families, makes TrimAl especially suited for large-scale phylogenomic analyses, involving thousands of large multiple sequence alignments. TrimAl has been implemented in C++ programming language.
TrimAl is a useful tool for the automated removal of spurious sequences or poorly aligned regions from a multiple sequence alignment. TrimAl can consider many parameters, alone or in multiple combinations, in order to select the most-reliable positions in the alignment. These include the proportion of sequences with a gap, the level of residue similarity and, if certain alignments for the same set of sequences are provided, the consistency level of columns among alignments. Moreover, TrimAl is able to manually select a set of columns to be removed from the alignment.
Additionally, TrimAl uses a series of automated algorithms that apply different thresholds, based on the characteristics of each alignment, to be used so that the signal-to-noise ratio after alignment trimming phase is optimized. Moreover, the user can remove spurious sequences from the alignment before using any method to improve the alignment’s quality.
Among TrimAl additional features, TrimAl allows getting the complementary alignment (columns that were trimmed), to compute statistics from the alignment, to select the output file format, to get a summary of TrimAl trimming in HTML format, and many other options.
TrimAl is being developed by the Comparative Genomics Group at the Centre for Genomic Regulation (CRG) at Barcelona, Spain.
TrimAl reads and renders protein or nucleotide alignments in several standard formats. TrimAl starts by reading all columns in an alignment and computes a score (Sx) for each of them. This score can be a gap score (Sg), a similarity score (Ss) or a consistency score (Sc). The score for each column can be computed based only on the information from that column or, if a window size of w is specified, it corresponds to the average value of w columns around the position considered.
The gap score (Sg) for a column is the fraction of sequences without a gap in that position. The residue similarity score (Ss) consists of mean distance (MD) scores. This score uses the MD between pairs of residues, as defined by a given scoring matrix. Finally, the consistency score (Sc) can only be computed when more than one alignment for the same set of sequences is provided.