UniProt provides a proteome database, a set of proteins whose genomes have been completely sequenced. A proteome is the set of proteins to be expressed by an organism. The majority of the UniProt proteomes are based on the translation of a completely sequenced genome, and will include sequences that derive from extrachromosomal elements such as plasmids or organellar genomes in organisms where these occur. The majority of UniProt proteomes are based on translations of genome sequence submissions to the International Nucleotide Sequence Database Consortium (INSDC).
UniProtKB entries can be linked to one or more UPIDs, a unique identifier assigned to the set of proteins that form the proteome in Proteome. UPIDs mostly consist of “UP” characters followed by 9 digits. Therefore, it can be used to cite a UniProt proteome. For each proteome, the information displayed in this section consists of a proteome ID (UPID) and a component name. The subsection of the Names and taxonomy section is present for entries that are part of a proteome, i.e. of a set of proteins thought to be expressed by organisms whose genomes have been completely sequenced.
UniProt proteomes may include both manually reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries. The proportion of reviewed entries varies between proteomes, and is obviously greater for the proteomes of intensively curated model organisms. Complementary pipelines for import of protein sequences have been developed in collaboration with Ensembl for vertebrate species, Ensembl Genomes for non-vertebrate species, WormBase ParaSite for parasitic nematodes and VectorBase for pathogen vector genomes. In addition, a new pipeline imports selected non-redundant genomes annotated by NCBI RefSeq. These sources provide proteome sequences for a number of key genomes of special interest where the INSDC submission is lacking gene model annotation.
The Proteomes portal offers protein sequence sets obtained from the translation of completely sequenced genomes. Published genomes from NCBI Genome are brought into UniProt if they satisfy the following criteria:
- The genome is annotated and a set of coding sequences is available
- The number of predicted coding sequences falls within a statistically significant range of published proteomes from neighbouring species
All proteomes generated in this manner go through our Proteomes redundancy reduction pipeline. Proteomes can be retrieved via the Proteomes section of the UniProt website, which provides download links for various formats.
With the remarkable increase in the number of complete genomes sequenced and thus for the number of proteomes, it is very important to organize this data in a way that allows users to effectively navigate the growing number of available proteome sequences in an effective manner. To meet this challenge, UniProt has adopted a suitable approach to define a set of “reference proteomes” which are “landmarks” in proteome space.