# Sequence Identity

Sequence Identity

The tendency to use the terms homology, similarity, and identity interchangeably persists in comparative biology. When translated to immunology, overlapping the concepts of homology, similarity, and identity complicates the exact definition of the self–nonself dichotomy.

Sequence identity is the amount of characters which match exactly between two different sequences. Hereby, gaps are not counted and the measurement is relational to the shorter of the two sequences. It is just a term that has been used in reference to the degree of similarity between two or more nucleotide sequences, generally in the context of “percentage of nucleotide sequence identity”.

The percentage identity for two sequences may take many different values.
For example divide the number of identities by:

1. length of shortest sequence.
2. length of alignment.
3. mean length of sequence.
4. number of non-gap positions.
5. number of equivalenced positions excluding overhangs.

The extent to which two (nucleotide or amino acid) sequences have the same residues at the same positions in an alignment, often expressed as a percentage.

Sequence identity is a way to measure the similarity between two sequences. For sequencing data, it is often thought as the opposite of sequencing error rate. When we say “the sequence divergence between two species is ABC” or “the sequencing error rate is XYZ”, we assume everyone knows how to compute identity. In fact, there are more than one ways to compute identity.

We exclude all gapped columns from the alignment in gap-excluded identity. The identity equals “#matches / (#matches + #mismatches)”. An obvious problem with this definition is that it doesn’t count gaps. However, it is an often used definition.

BLAST identity

BLAST identity is defined as the number of matching bases over the number of alignment columns. In a SAM file, the number of columns can be calculated by summing over the lengths of M/I/D CIGAR operators. The number of matching bases equals the column length minus the NM tag.

Interestingly, high sequence identity is a better indicator of function similarity than significant E-value. If two proteins have sequence identity more than 70%, they have about 90% chances or more to share the same biological process for GO index levels 1–8.

Closely related species are expected to have a higher percent identity for a given sequence than would more distantly related species, and thus percent identity to a degree reflects relatedness. The percent identity of genomic DNA sequence, intron and exon sequence, and amino acid sequence between humans and other species varies by species type, with chimpanzees having the highest percent identity with humans of all species in each category.

Using nucleotide or protein sequences as queries, scientists can search against public or customized nucleotide or protein databases using web‐based BLAST or UNIX‐based BLAST. UNIX‐based BLAST is less user‐friendly, but is more flexible and more powerful as any sequence files can be used as a subject database. Generally, web‐based BLAST provides more graphical results than UNIX‐based BLAST, and is easier to understand for beginners. BLAST searches provide a quick assessment of the sequence identities.