Expected errors predicted by Phred (Q) scores

See also
fastq_filter command
Quality scores
Average Q is a bad idea!
Setting an e.e. threshold for long reads
Edgar & Flyvbjerg 2014 paper on expected errors and Bayesian assembly

Phred scores
Next-generation reads specify a Phred score for each base, also known as a Quality or Q score. The Q score is an integer, typically in the range 2 to 40. Q indicates the probability that the base call is incorrect (P_e). For example, Q=2 means that the error probability is 63%, so the machine is reporting that the base is more likely to be wrong than right, while Q=20 corresponds to an error probability of 1%.

Expected number of errors
Assume that the Phred scores are good, in other words assume that the machine makes good estimates of the error probabilities. Now we can measure the overall quality of a read by calculating the expected number of errors in the read. The expected value of a quantity under a probability distribution is a technical term in statistics, so let me explain what that means in this case.

Take the error probabilities according to the Q scores in a given read, and imagine a very large number of reads containing errors that occur with those probabilities. For example, if the first base in the read has Q=20, then 1% of the reads in our collection will have errors in that position. Some of the reads will have no errors, some will have one error, and so on. The Q scores can't tell us exactly how many errors there are in a given read. However, the Q scores can tell us what the average number of errors would be in that large collection, and this is the definition of the expected number of errors. Because we're taking an average, the value is real rather than integer, and can be less than one.

Calculating the expected number of errors and most probable number of errors
It turns out that it is easy to calculate the expected number of errors: it is the sum of the error probabilities. The most probable number of errors (E*) is also easy to calculate. First calculate E = expected errors = sum P_e. Then round down to the nearest integer, and this is the most probable number of errors. In symbols: E* = floor(E). For example, if E < 1, then the most probable number of errors is zero. Proofs of these results are given in Edgar & Flybjerg (2014).

Quality filtering by expected errors
Small E means high quality, large E means low quality. We can therefore filter low quality reads by setting a maximum number of expected errors (E_max) and discarding reads with E > E_max. This is done by running the fastq_filter command.

Choosing the maximum expected error threshold
A natural choice is E_max = 1, because the most probable number of errors of a filtered read is zero. Though of course, we expect some of them to have one or more errors (this can't be entirely avoided, however stringent the filter, because Q scores are probabilities, not certainties). If you want to filter more stringently, you might choose something like E_max = 0.5 or E_max = 0.25.

Overlapping paired reads
If you have paired reads that overlap, then these can be exploited to improve the error filtering. The paired read merger (assembler) should report re-calculated Phred scores in the overlapping region to reflect that we have two independent observations of the same bases. If the forward and reverse read agree on a base, then our confidence that the base is correct should increase. The Q score should be higher, corresponding to a lower error probability. Conversely, if the base calls disagree, then the re-calculated Phred score should be lower than both the original base calls. The Q score in the merged read (contig) should be calculated as the posterior probability according to Bayes theorem where the P_e's from the forward and reverse read are the prior probabilities.

Quality filtering by maximum expected errors should be performed as a post-processing step after reads have been merged in order to take advantage of the re-calculated Phred scores, which should be better predictions of the true error rates.

It is important to use the USEARCH fastq_mergepairs command for read merging because according to my tests, other read mergers do not correctly calculate the posterior Q scores, including PANDAseq, COPE, PEAR, fastq-join and FLASH. Some of these mergers also have high rates of false positive and incorrect merges, including especially PANDAseq and the make.contigs() command in mothur.

OK, the concept is nice, but how well does this work in practice?
For Illumina reads, this approach works very well, because the Illumina Q scores are pretty good (much better than 454). The accuracy of the Q scores varies with the sequencing machine, base-calling software and probably other factors such as reagents. Amplicon sequencing tends to have less accurate Q scores than shotgun, so you can't assume that results on a PhiX library will be similar to results for amplicon reads. The Q scores may systematically under- or over-estimate the error probabilities in different runs. Errors do have some tendency to correlate, i.e. to cluster in certain reads, so they're not perfectly independent as assumed by the theory. Also, Q scores can only predict sequencer error per se, they cannot predict errors introduced before sequencing, e.g. during PCR. But despite these caveats, the approach works well in practice, better than anything else I've tested.

Long reads
See this FAQ: Should I increase the expected error threshold for long reads?