Coiled coils are protein structural domains that moderate an abundance of biological interactions and thus their reliable annotation is crucial for studies of protein structure and function.
DeepCoil is a new neural network-based tool for the detection of coiled-coil domains in protein sequences. DeepCoil significantly outperformed current state-of-the-art tools, such as PCOILS and Marcoil, both in the prediction of canonical and non-canonical coiled coils. Furthermore, in a scan of the human genome with DeepCoil, we detected many coiled-coil domains that remained undetected by other methods. This higher sensitivity of DeepCoil should make it a method of choice for accurate genome-wide detection of coiled-coil domains. DeepCoil is written in Python and uses the Keras machine learning library.
The dataset (structures and the corresponding sequences with per residue annotations of the coiled-coil domains) was generated by running SOCKET (cutoff 7.4 A˚ ) on crystallographic structures (biological assemblies) obtained from the PDB clustered to a maximum pairwise sequence identity of 50% with BLASTClust. To increase the number of positive examples, i.e. structures containing coiled-coil domains, from each cluster, we preferentially selected structures with coiled-coil domains. To ensure the quality of the dataset, it was again filtered with CD-HIT, an accurate tool for sequence clustering, to 50% sequence identity and structures with a resolution 500 amino acid residues were also removed, resulting in a final set of 21 138 entries, of which 2125 contained at least one canonical or non-canonical coiled-coil segment. For each entry, a position-specific scoring matrix (PSSM) was generated by searching the nr90 database with PSIBLAST.
The nr90 database was generated from the NCBI non-redundant protein sequence database (nr) using MMseqs2, a tool for fast and sensitive clustering of large datasets. Since coiled coils generally contain high proportions of low complexity regions, we did not filter out low-complexity sequences from nr90, which is generally a standard practice in the creation of filtered down sets. Entries shorter than 500 residues in the final set were randomly zero-padded from either left or right side to a constant length of 500 and one-hot encoded to generate 500 20 matrices (500 residues 20 amino-acid types). Information stored in the PSSMs was zero-padded using the same procedure and encoded by transforming the values with sigmoid function, yielding matrices of size 500 20.
DeepCoil neural network was implemented in Keras. It consists of two, stacked convolutional layers, with 64 filters each, that scan the sequence with window sizes of 28 (first layer) and 21 (second layer). The convolutional layers are followed by a densely connected layer of 128 neurons and the output layer. ‘ReLU’ activation functions were used for all layers, except the output layer, where ‘softmax’ was used. During the training process, two dropout layers (probabilities 0.5 and 0.25, respectively) were added after each of the two convolutional layers to avoid overfitting. The training was performed for 100 epochs with the ‘Adam’ optimizer with categorical cross-entropy as the loss function and a batch size of 64. From each CV round, a best model (according to the F1-score) was selected and the resulting five models were used to build the final ensemble predictor. The outlined procedure was used to train two variants of the predictor, DeepCoil_SEQ, utilizing only sequence data, and DeepCoil_PSSM, utilizing sequence as well as profile data. DeepCoil_SEQ and DeepCoil_PSSM were trained using 500 20 (encoded sequences) and 500 40 matrices (encoded sequences and PSSMs), respectively. 10,438 structures containing 4140 uninterrupted coiledcoil regions were used to train DeepCoil.