Home Software Services About Contact     
 
USEARCH v11

Common scenario where MCC fails

See also
 
Matthews Correlation Coefficient (MCC)
  Comments on Wescott & Schloss 2017

Suppose there are three strains A, B and C with 16S similarities AB=99%, BC=98% and AC=96% as shown below.

Image

According to Westcott and Schloss's definition, A and B should belong to the same OTU, and B and C should belong to the same OTU, because both these pairs have >97% identity. But then A and C belong to the same OTU, which is invalid by their definition because this pair is <97%. Thus, in this example, it is not possible to make valid OTUs by their definition.

This situation is common in practice, even if the sequences are error-free. A well-known example is Lactobacillus, where the 16S sequence variation between species is low and there are many pairs of species with identity >97%. I found 47 "impossible" triplets like the example above when clustering V4 sequences for 37 different Lactobacillus species.

If a correct solution by Westcott and Schloss's definition is impossible, then a perfect MCC score cannot be achieved by any algorithm. W&S implicitly argue that even though correct OTUs by their definition cannot be constructed in practice, it is nevertheless self-evident, or at least uncontroversial, that maximizing MCC gives the best OTUs. I disagree -- iin fact, as the table below shows, MCC fails completely for two of the possible assignments (naive code would crash due to division by zero), and of the two assignments that are tied for the top score, one is clearly better because it puts the highest identity pair in the same species.

Image 

Assignments #1, #2 and #3 are all reasonable, but the MCC score is undefined for two of them (#1 and #2) because the denominator is zero.

Assignments #3 and #4 have the same MCC score (0.5). However, #3 is clearly better than #4 because AB have higher identity than BC, and AB are therefore more likely to belong to the same species.

Assignment #3 is the best according to the UPARSE definition, which I believe gives better OTUs than MCC.