Introduction to paired read merging
Reviewing a fastq_mergepairs report to check for problems
Using the tabbedout file to investigate merging problems
Validating merged reads to check for problems
Filtering artifacts by setting a merge length range
Long overlaps are not needed so 2 x 250 can do better than V4
Trouble-shooting fastq_mergepairs problems
Staggered read pairs
Quality filtering while merging (not recommended)
Strategies for dealing with low-quality reverse reads (R2s)
2 x 250 reads with long overlap, e.g. 16S V4
2 x 300 reads with short overlap, e.g. 16S V3-V5
The fastq_mergepairs command merges (assembles) paired-end reads to create consensus sequences and, optionally, consensus quality scores. This command has many features and options so I recommend spending some time browsing the documentation to get familiar with the capabilities of fastq_mergepairs and issues that arise in read merging.
In the examples below, the forward read FASTQs have "R1" in the filename and the reverse FASTQs have "R2" as this is the convention currently used by Illumina.
The simplest way to use fastq_mergepairs is to specify the the forward and reverse FASTQ filenames and an output FASTQ filename.
usearch -fastq_mergepairs SampleA_R1.fastq -reverse SampleA_R2.fastq -fastqout merged.fq
Automatic R2 filename
If the -reverse option is omitted, the reverse FASTQ filename is constructed by replacing R1 with R2. The following command line is equivalent to the example above.
usearch -fastq_mergepairs SampleA_R1.fastq -fastqout merged.fq
Merging multiple FASTQ file pairs in a single command
You can specify two or more FASTQ filenames following -fastq_mergepairs. In the following example, SampleA and SampleB are both merged. The R2 filenames are constructed automatically as explained above, or can be given explicitly using the -reverse option.
usearch -fastq_mergepairs SampleA_R1.fastq SampleB_R1.fastq -fastqout merged.fq
Using shell wildcards to merge multiple FASTQ file pairs in a single command
You can use shell wildcards (* and ?) to give a pattern that matches the FASTQ files you want to merge. For example, this will merge all R1 files in the current directory:
usearch -fastq_mergepairs *R1*.fastq -fastqout merged.fq
Adding sample identifiers to read labels
If multiple samples are combined into a single file as shown in some of the above examples, then you lose track of which read came from which sample. This is addressed by adding a sample identifier to each read label. The simplest method is to use the -sample option, e.g.
usearch -fastq_mergepairs SampleA_R1.fastq -fastqout merged.fq -sample SampleA
The string sample=SampleA; will be added at the end of the read label.
Getting the sample identifier from the FASTQ filename
FASTQ filenames are often based on the sample identifier, e.g. SampleA_R1.fastq. If you specify -relabel @ then fastq_mergepairs gets the sample identifier from the FASTQ file name by truncating at the first underscore (_) or period (.). A period and the read number is added after the sample identifier to make the new read label, which replaces the original label. This differs from the -sample option, which adds the sample= annotation at the end of the label. The usearch_global command understands both of these methods for putting sample identifiers into read labels..
usearch -fastq_mergepairs SampleA_R1.fastq -fastqout merged.fq -relabel @
Merging multiple files with sample identifiers
By using wildcards and the -relabel @ option you can merge multiple files and add sample identifiers to the read labels, for example:
usearch -fastq_mergepairs *R1*.fastq -fastqout merged.fq -relabel @