calc_lcr_probs command

Calculates lowest common rank (LCR) probabilities from a distance matrix for sequences with taxonomy annotations.

The distance matrix can be calculated using the calc_distmx command.

Output is written to a tabbed text file specified by the -tabbedout option. Example output file.

By default, probabilities are calculated assuming that pairs of sequences are selected by choosing an entry in the distance matrix at random with uniform probability. This method can have substantial taxonomic bias because reference databases often have highly over-represented species and genera, e.g. common pathogens such as E. colii may have many sequences while other species or genera have only a single sequence, plus many unnamed genera are absent. This bias can be mitigated by weighting. For example, if weighting is done at genus rank, this means that sequences are selected first by choosing a genus with uniform probability (with replacement), then selecting a sequence from the genus (again, with uniform probability and with replacement). Weighting is specified by the -weight_rank option; the value of the option is a single letter representing the rank, e.g. s for species or g for genus (see taxonomy annotations for valid letters).


usearch -calc_distmx tax_16s.fa -maxdist 0.2 -termdist 0.3 -tabbedout distmx_16s.txt

usearch -calc_lcr_probs distmx_16s.txt -weight_rank g -tabbedout lcr_16s.txt

