calc_lcr_probs command

Calculates lowest common rank (LCR) probabilities from a distance matrix for sequences with taxonomy annotations.

The distance matrix can be calculated using the calc_distmx command.

Output is written to a tabbed text file specified by the -tabbedout option. Example output file.

Weighting
By default, probabilities are calculated assuming that pairs of sequences are selected by choosing an entry in the distance matrix at random with uniform probability. This method can have substantial taxonomic bias because reference databases often have highly over-represented species and genera, e.g. common pathogens such as E. colii may have many sequences while other species or genera have only a single sequence, plus many unnamed genera are absent. This bias can be mitigated by weighting. For example, if weighting is done at genus rank, this means that sequences are selected first by choosing a genus with uniform probability (with replacement), then selecting a sequence from the genus (again, with uniform probability and with replacement). Weighting is specified by the -weight_rank option; the value of the option is a single letter representing the rank, e.g. s for species or g for genus (see taxonomy annotations for valid letters).

Example

usearch -calc_distmx tax_16s.fa -maxdist 0.2 -termdist 0.3 -tabbedout distmx_16s.txt

usearch -calc_lcr_probs distmx_16s.txt -weight_rank g -tabbedout lcr_16s.txt

References (please cite)
R.C. Edgar (2018), Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences, PeerJ 6:e4652
  • Cross-validation by identity, novel benchmark strategy enabling realistic accuracy estimates
  • Genus accuracy of best methods is 50% on V4 sequences
  • Recent algorithms do not improve on RDP Classifier or SINTAX

R.C. Edgar (2018), Taxonomy annotation and guide tree errors in 16S rRNA databases, PeerJ 6:e5030
  • Approx. one in five SILVA and Greengenes taxonomy annotations are wrong
  • SILVA and Greengenes trees have pervasive conflicts with type strain taxonomies

R.C. Edgar (2017), Updating the 97% identity threshold for 16S ribosomal RNA OTUs, Bioinformatics 34(14) 2371-2375
  • Standard 97% OTU identity threshold is too low
  • Optimal OTU threshold is 99% for full-length 16S, 100% for V4