The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt Knowledgebase (including isoforms) and selected UniParc records in order to obtain complete coverage of the sequence space at many resolutions while hiding redundant sequences (but not their descriptions) from view.In biological databases, the redundant protein sequences make it difficult to find the sequence similarity searches and make interpretation and analysis of results difficult and impossible. Clustering of protein sequence space based on sequence similarity helps organize and set all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. This database provides clustering sets of proteins obtained from UniProtKB and selected UniProt Archive records to obtain complete coverage of sequence space at many resolutions while hiding redundant sequences in a gentle manner.
UniRef enhances the sequence similarity by the arrangement of the digits of 100%>90%>50%, that’s why its components are named in this manner and it is a non-redundant reference database. Unlike in UniParc, sequence fragments are merged in UniRef in the following way:
The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues from any organism into a single UniRef entry, showing the sequence of a representative protein, the accession numbers of all the merged entries and links to the corresponding UniProtKB and UniParc records.
UniRef90 is developed by clustering UniRef100 sequences with 11 or more residues using the MMseqs2 algorithm such that each cluster is composed of sequences that have at least 90% sequence identity and 80% overlap with the longest of the cluster.
UniRef50 is developed by clustering UniRef90 seed sequences that have at least 50% sequence identity to and 80% overlap with the longest sequence in the cluster. Before 2013 there was no overlap threshold, so clusters were more heterogeneous in length. UniRef90 and UniRef50 produce a database size reduction of approximately 58% and 79%, respectively, providing for significantly faster sequence similarity searches. The seed sequence is the longest member of a cluster. However, the longest sequence is not always the most informative. There is often more biologically relevant information available on other cluster members.
All the proteins in a cluster are therefore ranked with the following priority to facilitate the selection of a biologically relevant representative for the cluster:
- quality of the entry
- annotation score
- length of the sequence
Usages of UniRef
- Speed up the similarity search
- Reduce the bias problems in homology sequences by providing more even sequence space
- Use the clusters for protein family prediction
- Using the clusters to check the consistency in UniProtKB annotations
- Using the clusters to annotate the other sequence databases
- Database coverage
- size reduction
- cluster distribution