UPARSE home page
OTU benchmark results
Should I use UPARSE or UNOISE?
The UPARSE-OTU algorithm constructs a set of OTU representative sequences from
NGS amplicon reads. It is implemented in the
cluster_otus command. Reads should be pre-processed to overlap paired reads (if
appropriate), strip barcodes, perform quality filtering and
global trimming. Post-processing is
needed to map reads to OTUs and construct a OTU table. See
UPARSE pipeline for detailed discussion of
practical issues. This page describes the OTU clustering algorithm itself.
Assigning reads to OTUs is a separate task
which is not addressed by UPARSE-OTU. See otutab
command and defining OTUs for more details.
Input to UPARSE-OTU is a set of sequences. Each sequence is marked with an
integer value indicating its abundance. In practice, the abundance is usually
the number of reads having a given unique sequence, but it could also be the
predicted abundance of an amplicon after a denoising step.
The goal of UPARSE-OTU is to identify a set of OTU representative sequences
(a subset of the input sequences) satisfying the following criteria.
1. All pairs of OTU sequences should have <97% pair-wise
2. An OTU sequence should be the most abundant within
a 97% neighborhood.
3. Chimeric sequences should be discarded.
4. All non-chimeric input sequences should match at
least one OTU with ≥ 97% identity.
UPARSE-OTU uses a greedy algorithm to find a
biologically relevant solution, as follows. Since high-abundance reads are more
likely to be correct amplicon sequences, and hence are more likely to be true
biological sequences, UPARSE-OTU considers input sequences in order of
decreasing abundance. This means that OTU centroids tend to be selected from the
more abundant reads, and hence are more likely to be correct biological
Each input sequence is compared to the
current OTU database, and an maximum parsimony model of the sequence is found
using UPARSE-REF (figure below). There are
three cases. (a) The UPARSE-REF model is ≥ 97% identical to an existing OTU, (b)
the model is chimeric, or (c) the model is <97% identical to any existing OTU.
In case (a), the input sequence becomes a member of the OTU. In case (b), the
input sequence is discarded. In case (c), the input sequence is added to the
database and becomes the representative sequence (centroid) of a new OTU.
Edgar, R.C. (2013) UPARSE: Highly accurate OTU sequences from microbial amplicon reads,
Nature Methods [Pubmed:23955772,