Comparing diversity between sample groups

Diversity metrics can be compared between groups by amplicon sequencing
It is not possible to measure alpha diversity of a single sample. However, diversity metrics can nevertheless be compared between samples because the errors and biases are mostly systematic, i.e. occur with similar rates in all samples.

For an analogy, suppose we have a crooked piece of wood. We ask three-year Charlie to draw marks on the wood using the length of his finger (which is roughly one inch long) as the distance between marks. Charlie does his best, but the marks are not very evenly spaced. We write numbers 1, 2, 3... at each mark, giving us an inaccurate ruler. Take twenty men and twenty women and measure their heights using this ruler. We do not have a good measurement of the height of each person in inches, but if Bob is taller than Alice, we can trust that we will get a larger number for his height.

Using Charlie's ruler, the measured height is a monotonically increasing function of the true height. This means that despite the low accuracy of our measuring device, we can make a valid statistical test of the hypothesis that men tend to be taller than women because our measurement errors are systematic. With the abundance of a given OTU measured by amplicon sequencing, we have a similar situation, through the relationship is not strictly monotonic because of fluctuations. Also, the abundances of different OTUs are not comparable because a different inaccurate ruler is being used in each case -- in this analogy, Charlie's ruler is used to measure OTU1, Deborah's to measure OTU2, Eric's to measure OTU3 and so on, and we can't examine the rulers (i.e., we can't correct for their biases). However, despite these complications, it is still possible to make valid comparisons.

Suppose the number of spurious OTUs is roughly constant, say 100 per sample, keeping in mind that we can't estimate this number. Now, if sample X has 2,000 OTUs and sample Y has 1,000 OTUs, then we can tentatively conclude that X has higher richness than Y. For any given pair, this conclusion is not reliable because we don't know the number of spurious OTUs or how much thus number fluctuates between samples. However, if we have twenty samples in group A (say, gut samples from obese patients) and twenty samples in group B (say, gut samples from lean controls), and we find that samples in group A tend to have higher richness than samples in group B, then this conclusion can be statistically significant.

Similarly, abundance bias affects the count or frequency of a given OTU in the same way in all samples. For example, if OTU1 has one copy of the 16S gene per cell and OTU6 has six copies, then the count of OTU1 will be approximately six times too low compared to OTU6 in all samples. Thus, if the ratio count(OTU1) / count(OTU6) is higher in sample X than sample Y, this could be meaningful even though the numerical value of the ratio is wrong by a factor of six in both samples. The key point here is the unknown factor of six is the same in both samples. Thus, a difference in a frequency-based metric such as Shannon entropy can be statistically significant between groups, even if the numerical value of the metric cannot be measured for an individual sample.

Interpreting a change in number of OTUs (observed richness)
A change in the observed number of OTUs can be due entirely to a change in shape of the distribution. This is a subtle effect of incomplete sampling (I am writing a paper which explores this issue in detail). So even if there are no errors, a change in the observed number of OTUs does not necessarily imply a change in true richness of the sample.

Normalization can distort alpha diversity metric comparisons
Normalization effects can be misleading. For example, consider two samples A and B which have identical composition and frequencies (i.e., the same alpha diversity) except that A has an opportunistic infection by E. coli while B does not. The E. coli OTU is large, containing 80% of the reads. Now, if the OTU table is normalized or rarefied to the same number of reads per sample, the effect will be to reduce the size of all the other OTUs in A by a factor of 5 compared with B despite the fact that the same samples have the same frequencies (and perhaps the same number of cells) for those OTUs. Many smaller OTUs may disappear from A compared to B for this reason, despite the fact that they have similar numbers of reads in both samples.