As a result of the genome sequencing and structural genomics initiatives, we have a wealth of protein sequence and structural data. However, only about 1% of these proteins have experimental functional annotations. As a result, computational approaches that can predict protein functions are essential in bridging this widening annotation gap.
A resource that classifies full-length proteins is PIRSF, in which a set of rules is applied to define primary and curated clusters that are also based on textual (protein names, literature) and parent-child relationships. These clusters (named superfamilies) are further divided into those with full-length similarity (that is, common domain architecture) and those sharing an ancestral domain. PIRSF covers more than two-thirds of the protein sequence space.
Sequence annotations describe regions or sites of interest in the protein sequence, such as post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics.
Family and superfamily classification also serves as the basis for rule-based procedures that provide rich automatic functional annotation among homologous sequences and perform integrity checks. Combining the classification information and sequence patterns or profiles, certain rules have been defined to predict position-specific sequence features such as active sites, binding sites, modification sites, and sequence motifs. We derive family-specific patterns for such features from alignments of closely related sequences for which some of the sequences have experimentally determined properties. While studying proteins at a domain level allows more accurate functional inference and is useful for predicting the function of novel domain combinations that possibly give rise to new protein functions.
The process of functional annotation involves assessing available evidence and reaching a conclusion about what we think the protein is doing in the cell and why.
- Functional annotations should only be as specific as the supporting evidence allows
- All evidence that led to the annotation conclusions that were made must be stored.
- In addition, detailed documentation of methodologies and general rules or guidelines used in any annotation process should be provided.
Basic set of protein Annotations
- Protein name: descriptive common name for the protein e.g. “ribokinase”
- gene symbol: mnemonic abbreviation for the gene e.g. “recA”
- EC number : only applicable to enzymes e.g. 184.108.40.206
- Role: what the protein is doing in the cell and why e.g. “amino acid biosynthesis”
- Supporting Evidence: accession numbers of BER and HMM matches – TmHMM, SignalP, LipoP whatever information we used to make the annotation
- Unique Identifier e.g. locus ids
Annotation of Proteins provide following useful information;
- Initiator methionine
- Transit peptide
- Topological domain
- Calcium binding
- Zinc finger
- DNA binding
- Nucleotide binding
- Coiled coil
- Compositional bias
- Active site
- Metal binding Binding site
Amino acid modifications
- Non-standard residue
- Modified residue
- Disulfide bond
- Alternative sequence
- Natural variant
- Sequence uncertainty
- Sequence conflict
- Non-adjacent residues
- Non-terminal residue
- Beta strand
- Beta strand