Mapping reads to OTUs

The otutab command maps a read to an OTU by finding the OTU sequence with highest identity above a given threshold (usually 97%). If there is a tie, the tie is broken by choosing the first OTU in database file order.

A read sequence can match two or more OTUs with >=97% identity. It has been suggested (Ye et al. 2015) that the read should be assigned to the OTU with highest abundance rather than highest identity. I disagree, because high identity is a better signal that the sequence is from the same species.

Suppose a sequence (S) matches two OTU sequences: A with 97% identity and B with 98% identity. The unique sequence A has abundance 1,000 and B has abundance 100. Should we assign S to A or B? A has higher abundance but lower identity, vice versa for B.

There are three possible reasons why S does not exactly match an OTU sequence: 1. it is a correct biological sequence, 2. it has sequence errors due to PCR or sequencing, or 3. it is chimeric.

(1) S is a correct biological sequence
Here, either S is a paralog of A or B derived from the same genome, or S is from a different species.

(1a) If S is a paralog, we would prefer to assign it to the same OTU as the other paralog(s) from the species. This is more likely to be B because paralogs tend to have high identity. Paralogs in a given species almost always have higher identity to each other than to genes in another species. (Same argument applies to intra-species variation).

(1b) If S belongs to a different species, then we are lumping two species into the same OTU and there is no reason to prefer A or B.

Conclusion: if S is a correct biological sequence, it is better to choose the OTU with highest identity because it is the most likely to belong to the same species so we should assign S to B.

(2) S has sequencing errors.
Here, either S is a bad read of A or B, or S is a bad read of a correct biological sequence which is above the identity threshold so does not have its own OTU.

(2a) If we know that S is a bad read of an OTU sequence then we should again choose the highest identity match because this is much more likely to be the correct sequence. Suppose S is a bad read of A. Adding errors to A will probably reduce identity to both A and B. A bad read of A which has higher identity to B must have base call errors that reproduce letters in B by chance; this is very unlikely.

(2b) If S is a bad read of a biological sequence which is not A or B then this case is similar to (1) and we should therefore prefer the highest identity match.

Conclusion: if S has sequencer error we should assign it to the OTU with highest identity because this is much more likely to be the correct sequence, so we should assign S to B.

(3) S is an undetected chimera.
This scenario is less common than (1) or (2) because chimeras are rare as a fraction of the reads. If S is chimeric, there is no reason to prefer A or B.