CD-HIT stands for Cluster Database at High Identity with Tolerance, is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik’s Lab at the Burnham Institute.
CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and helps in understanding the data structure and correct the bias within a dataset.
The CD-HIT package has;
CD-HIT package can perform certain jobs like clustering a protein database, clustering a DNA/RNA database, comparing two databases (protein or DNA/RNA), and generating protein families.
CD-HIT clusters proteins that meet a similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format. It generates a fasta file of representative sequences and a text file of list of clusters.
CD-HIT-EST clusters nucleotide sequences that meet a similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format It generates a fasta file of representative sequences and a text file of list of clusters.
Multiple CD-HIT runs. Proteins are first clustered at a high identity (like 90%), the non-redundant sequences are further clustered at a low identity (like 60%). A third cluster can be performed at lower identity. Multi-step run is more efficient and more accurate than a single run.
CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.
Although cd-hit is very fast, clustering is still very computationally intensive. The program (cd-hit) takes a fasta format sequence database as input and produces a set of ‘non-redundant’ (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence ‘groupies’ for each nr sequence representative. The idea is to reduce the overall size of the database without removing any sequence information by only removing ‘redundant’ (or highly similar) sequences. This is why the resulting database is called non-redundant (nr). Essentially, cd-hit produces a set of closely related protein families from a given fasta sequence database.
CD-HIT uses a ‘longest sequence first’ list removal algorithm to remove sequences above a certain identity threshold. Additionally the algorithm uses a very fast heuristic to find high identity segments between sequences, and so can avoid many costly full alignments. With recent developments, the cd-hit package offers new programs for DNA sequence clustering and comparing two databases. It also has lots of new options for clustering control.