Home Software Services About Contact     
 
USEARCH v11

OTU importance

See also
  Machine learning
  K-fold cross-validation
  otutab_forest_kfold command
  otutab_select command
  Random forest parameter file

An OTU is informative if its count or frequency can be used effectively in a rule which sorts samples into a given set of metadata categories such as healthy / sick, day / night etc.

Usually, an OTU is informative because it has higher frequency in one category than other categories, as shown in the figure below. Such cases can be found using the otutab_select command.

With a random forest classifier, a so-called importance value in the range 0 (not informative) to 1 (maximally informative) is calculated for each OTU. Random forests can discover more complicated rules than the simple frequency sort assumed by otutab_select. To extract the OTU importance values from a random forest parameter file and sort them in order of decreasing importance, you can use:

grep -w "^varimp" forest.txt | cut -f3,5 | sort -rgk2

If an OTU is found to be informative by a random forest classifier but not by the otutab_select command, this implies that the implied rules incorporating this OTU are more complicated than the typical form "if count is high, sample is in category A, otherwise in a different category".