The two older programs COILS and PairCoil/MultiCoil are significantly outperformed by two recent developments: Marcoil, a program built on hidden Markov models, and PCOILS, a new COILS version that uses profiles as inputs; and to a lesser extent by a PairCoil update, PairCoil2. Overall Marcoil provides a slightly better performance over the reference database than PCOILS and is considerably faster, but it is sensitive to highly charged false positives, whereas the weighting option of PCOILS allows the identification of such sequences.
The alpha-helical coiled coil is a simple structural motif found at high frequency in proteins of all organisms. Many coiled coils mediate oligomerization or protein–protein interaction and the motif is important to the structure and function of certain classes of fibrous structural proteins, motor proteins, transcription factors and membrane fusion proteins. Prediction of coiled coils in proteins can be used to identify putative oligomerization domains, to postulate functional mechanisms and to map sequence onto structure at a high level of detail. Moreover, such predictions are necessary as a first step in understanding coiled–coil interactions. Thus, efficient and highly accurate methods for predicting coiled coils are important for annotating the data that result from genome sequencing projects. Sequence-based methods for predicting coiled coils, such as COILS, Paircoil, MultiCoil and MARCOIL have been quite successful.
An initial coiled–coil database was constructed from sequences known to contain coiled coils, using information from structure and the literature. The coiled–coil regions were defined and annotated with the appropriate heptad register according to the following sources. The myosins, tropomyosins and paramyosins were annotated. Intermediate filament coiled coils were used. Viral coat proteins, laminins, fibrinogens, heat shock factors and flagellins were used. In addition, a number of coiled–coil sequences that did not fit into these categories were detected and annotated by SOCKET from the 2002 version of the Protein Data Bank (PDB). The coiled–coil database was generated from this initial database by adding homologous sequences from the NCBI NR database. PSI-BLAST was run for four iterations, using an E-value cutoff of 10. The BLAST sequence alignments were used to define the coiled–coil regions and assign heptad registers, which were verified with Paircoil. For cases where the Paircoil-derived and alignment-derived register disagreed, assignments were made manually. The database was filtered to 90% sequence identity with CD-HIT. Seven residues were removed from each side of a skip in the heptad register or a gap in the alignment to avoid introducing non-coiled–coil residues into the database. To further eliminate possible non-coiled–coil residues, seven residues were removed on each side of proline residues, and from the beginning and end of each coiled–coil region in all sequences. All regions with at least 28 contiguous coiled–coil residues were included in the coiled–coil database of 1371 protein chains, containing 95 517 coiled–coil residues.
PCOILS incorporates the new data as working. It was extended to allow window sizes of both 21 and 28 residues. The algorithm runs in linear time relative to the length of the input sequence. Confidence is reported as a P-score, which is a measure of the percentage of non-coiled–coil residues in PDB-minus that score better than a given PCOILS raw score. It has been found that the score distribution of PDB-minus is closely approximated by a Gaussian, and as such the P-score is calculated to be the area below this curve and to the right of the raw score. It performs extremely well in leave-family-out cross validation on the coiled–coil database. For each cross-validation, sequences in one coiled–coil family were placed in the test set, along with half of the sequences in PDB-minus, selected randomly. The remaining coiled coils were used to train PCOILS. Although P-scores and specificity both show false-positive rates, P-scores are defined using all of PDB-minus and specificity is evaluated during testing. Performance on the cross-validation test is very similar using the 21-length window.