MARCOIL is a hidden MARkov model-based program that predicts the existence and location of potential coiled-coil domains in protein sequences.
The utility of a prediction program is often best judged by experience gathered in using it and we would welcome comments or suggestions for improvements.
It was tested under UNIX and LINUX and can be compiled using the included makefile that calls the gcc compiler. A file with Instructions for Installation and Usage is included in the distribution and can be read online. Large-scale sequence data require methods for the automated annotation of protein domains. Many of the predictive methods are based either on a Position Specific Scoring Matrix (PSSM) of fixed length or on a window-less Hidden Markov Model (HMM). The performance of the two approaches is tested for Coiled-Coil Domains (CCDs). The prediction of CCDs is used frequently, and its optimization seems worthwhile.
A cross-validated study suggests that MARCOIL improves predictions compared to the traditional PSSM algorithm, especially for some protein families and for short CCDs. Potential confounding factors such as differences in the dimension of parameter space and in the parameter values were avoided by using the same amino acid propensities and by keeping the transition probabilities of the HMM constant during cross-validation.
The coiled-coil is a widespread protein structural motif known to have a stabilization function and to be involved in key interactions in cells and organisms. PS-COILS has been introduced to define a baseline approach for benchmarking new coiled-coil predictors. A new version of MARCOIL has been designed that can exploit evolutionary information in the form of sequence profiles. We show that the methods trained on sequence profiles perform better than the same methods only trained and tested on single sequences. Furthermore, a new structurally-annotated and freely-available dataset of coiled-coil structures have been created.
The only annotated dataset publicly available created for developing a predictor is the MARCOIL dataset of protein sequences. However, the same MARCOIL authors stated that the coiled-coil annotations in their database are not reliable. We generated our dataset of experimentally-determined coiled-coil structures following the suggestion and considering only the intersection between the SCOP coiled-coil class and the output of the SOCKET program. Sequences shorter than 30 residues or with coiled-coil domains shorter than 9 residues were excluded. This lower limit has been chosen since 9-residues long domains are the shorter ones classified by MARCOIL. More specifically, among the different protein chains that are labeled with a coiled-coil domain in SCOP, the subset for which SOCKET found at least a coiled-coil segment in that domain has retained. The final annotation of the coiled-coil segment is the one indicated by the SOCKET program. Furthermore, in order to test the different methods on a blind set, we selected a subset of 50 non-identical protein chains (S50) with sequence identity.