Home Software Services About Contact     

Filtering artifacts by a merge length range

See also
  fastq_mergepairs command
  fastq_mergepairs options

With next-generation amplicon sequencing, the PCR reaction often creates artifacts which are not correctly constructed, i.e., do not contain an intended full-length biological sequence. Such artifacts can be due to primers binding to a different region of the genome, secondary structure formation in the reaction, and so on. Usually, these amplicons are shorter or longer than expected and can therefore be filtered by setting a length range for the merged read. This is supported by the -fastq_mimmergelen and -fastq_maxmergelen options.

Length range for 16S V4
Currently, a popular method is 2 x 250 reads of the 16S V4 hypervariable region. This region has well-conserved length, unlike other 16S hypervariable regions. You can therefore set a narrow range of length to exclude artifacts; With the typical primers V4F (GTGCCAGCMGCCGCGGTAA) and V4R (GGACTACHVGGGTWTCTAAT) I set a length range of 230 to 270. These values exceed the known variation in length to allow novel outliers.

How to determine the length range for a primer pair
The search_pcr command can generate amplicon sequences given primer sequences and a database of known genes (or genomes). The length range in the predicted amplicons can be determined by the fastx_info command. Be careful to include or exclude the primer sequences in the total length depending on whether they will appear in the reads (typically they do, but there are many variations in the library preparation protocols). I recommend using a range which exceeds the measured minimum and maximum because there may be novel outliers which are not in the database.