CD-HIT sequence identity

Version 6 of USEARCH and later uses the BLAST definition of sequence identity. Earlier versions supported several definitions, and used the CD-HIT definition of identity by default. The CD-HIT definition is:

CD-HIT identity = (number of alignment columns containing matching letters) / (length of shorter sequence)

This definition has a number of problems in practice. Notably, gaps do not count as differences. In extreme cases, 100% identity can be reported for an alignment with an arbitrary number of gaps.

The CD-HIT definition is more sensitive to the choice of alignment parameters than BLAST. This is because reduced gap penalties tend to produce "gappier" alignments with more identities, which are made possible by adding more gaps, especially with nucleotide sequences. This always increases identity with the CD-HIT definition because gaps don't count as differences. Consider this simple example:

   GATTTACATT
   ||| | ||
   GATACATCTT

This alignment has identity 6/10 = 60% according to both BLAST and CD-HIT. Now suppose gap penalties are reduced to allow the following alignment, which has two more identities:

   GATTTACA--TT
   || |||| ||
   GA--TACATCTT

Now the BLAST identity is 8/12 = 67% and the CD-HIT identity is 8/10 = 80%. Notice that the CD-HIT identity changes more.

The CD-HIT definition tends to produce fewer clusters (larger average size) at a given identity threshold, but these tend to be of lower quality.