The CCDS project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. It searches for information on identified human protein coding sequences. The CCDS set includes coding regions that are annotated as full-length (with an initiating ATG and valid stop-codon), can be translated from the genome without frameshifts, and use consensus splice-sites. The long term goal is to support convergence towards a standard set of gene annotations. The number and type of quality tests performed may be expanded in the future but includes analysis to identify putative pseudogenes, retrotransposed genes, consensus splice sites, supporting transcripts, and protein homology.
Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed.
A CCDS report page includes links to genome browsers in the ‘Chromosomal Locations’ section that display the genomic span of the coding sequence (‘Genome Browser links’) or genomic span of individual coding exons
The primary data represented by a CCDS ID are the chromosome coordinates of the annotated protein-coding exons and the nucleotide and conceptually translated protein sequence obtained from those coordinates.
- human protein annotation
- human genome annotation
- human protein coding sequence prediction
- automated annotation methods
- quality assessment
- manual curation
- identification of loci for which additional experimental validation is needed
- in large-scale epigenomic studies
- In exome projects and exon array design
Annotation of genes on the human genome is provided by multiple public resources, using different methods, and resulting in information that is similar but not always identical. The human genome sequence is now sufficiently stable to start identifying those gene placements that are identical, and to make those data public and supported as a core set by the three major public human genome browsers.
The CCDS set is built by consensus among the European Bioinformatics Institute (EBI), the National Center for Biotechnology Information (NCBI), the Wellcome Trust Sanger Institute (WTSI), and the University of California at Santa Cruz (UCSC). Communication among the CCDS collaborating groups is an ongoing activity that will resolve differences and identify refinements between CCDS update cycles. All changes to existing CCDS genes are done by collaboration agreement; no single group will change the set unilaterally.
The general process flow for defining the CCDS gene set includes:
- compare genome annotation results
- identify annotated coding regions that have identical location coordinates on the genome
- quality evaluation
- remove lower quality CDSs from the core set pending additional review among the collaboration groups.