Getting started¶
Basic usage¶
To run AdapterRemoval on single-end FASTQ data, specify the location of FASTQ file(s) using the --in-file1 command-line option. Use the --out-prefix to specify a prefix for the names of the resulting FASTQ and report files:
adapterremoval3 \
--in-file1 reads_1.fastq.gz \
--out-prefix output
When run in this manner, AdapterRemoval will generate the files output.fastq.gz containing the trimmed reads, output.html containing a human-readable QC report, and output.json containing the QC report in JSON format.
To run AdapterRemoval on paired-end FASTQ data, specify the location of the mate 1 and mate 2 FASTQ files using the --in-file1 and --in-file2 command-line options. Overlapping reads may optionally be merged using the --merge option:
adapterremoval3 \
--in-file1 reads_1.fastq.gz \
--in-file2 reads_2.fastq.gz \
--out-prefix output \
--merge
When run in this manner, AdapterRemoval will save the QC reports as described above, while FASTQ reads will be saved to output.r1.fastq.gz, output.r2.fastq.gz, output.merged.fastq.gz, and output.singleton.fastq.gz. The singleton file contains reads from pairs in which one of the reads was discarded.
If you prefer, you can manually specify just the output files you are interested in, via --out-file1, --out-file2, and other options starting with --out. See the AdapterRemoval manpage for a complete list of input and output options.
By default, AdapterRemoval will attempt to infer the adapter sequences present in the data automatically. See A note on specifying adapters below for more information.
Output formats¶
AdapterRemoval supports writing processed reads as compressed and uncompressed FASTQ files (.fastq, .fq, .fastq.gz, and .fq.gz), as uncompressed or compressed SAM files (.sam and .sam.gz), and as compressed BAM and uncompressed BAM files (.bam).
Formats are specified either using the --out-format option in conjunction with the --out-prefix option, or by specifying one of the above extensions when using one of the explicit --out options. If AdapterRemoval does not recognize the extension, then the format specified via --out-format is used, defaulting to gzip compressed FASTQ.
By default, the uncompressed version of the format specified via --out-format is used for records written to STDOUT (see below), under the assumption that the output is to be processed by another program, but this may be overridden using the --stdout-format option.
See the Input and output page for more information about files generated by AdapterRemoval.
Standard input (STDIN), standard output (STDOUT)¶
Reading from STDIN and writing to STDOUT can be accomplished either by using the special /dev/stdin and /dev/stdout files as input or output, or by using the filename - for either:
some-command | adapterremoval3 --in-file1 - --out-file1 - | some-other-command
Output written to STDOUT can be freely interleaved, as described in the Interleaved input and output below.
Disabling output files¶
Output that has not been enabled by setting the --out-prefix option (everything but discarded reads) or by setting the --out option for that specific output type (e.g. --out-singleton) is still processed, but is then discarded before the compression step. This allows statistics to be collected for all results, even if they are not saved.
AdapterRemoval additionally recognizes the special path /dev/null, and will discard the output --out for any output type for which this path is used. That means that you can use the --out-prefix option to specify a default output path, and disable the read types that you do not care about.
Meaning that instead of
adapterremoval3 \
--in-file1 reads_1.fastq \
--in-file2 reads_2.fastq \
--out-html output.html \
--out-json output.json \
--out-file1 output.r1.fastq.gz \
--out-file2 output.r2.fastq.gz \
--out-merged output.merged.fastq.gz
you could write
adapterremoval3 \
--in-file1 reads_1.fastq \
--in-file2 reads_2.fastq \
--out-prefix output \
--out-singleton /dev/null
Multiple input FASTQ files¶
More than one input file may be specified after the --in-file1 and --in-file2 options. Files are processed in the specified order, as if they had been concatenated using cat or zcat:
adapterremoval3 \
--in-file1 reads_1a.fastq reads_1b.fastq reads_1c.fastq
adapterremoval3 \
--in-file1 reads_1a.fastq reads_1b.fastq reads_1c.fastq \
--in-file2 reads_2a.fastq reads_2b.fastq reads_2c.fastq
The number of files provided for --in-file1 and --in-file2 need not match, as long as the total number of reads and read order is the same.
Interleaved input and output¶
AdapterRemoval is able to read and write paired-end reads stored in a single, so-called interleaved FASTQ file (one pair at a time, first mate 1, then mate 2). This is accomplished by specifying the location of the file using --in-file1 and also setting the --interleaved command-line option:
adapterremoval3 \
--interleaved \
--in-file1 interleaved.fastq \
--out-prefix output_interleaved
Other than taking just a single input file, this mode operates almost exactly like paired-end trimming (as described above); the mode differs only in that paired reads are not written to a 'r1' and a 'r2' file, but instead these are written to a single file. The location of this file is controlled using the --out-file1 option.
Enabling either reading or writing of interleaved FASTQ files, but not both, can be accomplished by using either of the --interleaved-input and --interleaved-output options, both of which are enabled by the --interleaved option.
Alternatively, you can specify the same output file for multiple output types, in order to write all of those reads to a single file in interleaved mode:
adapterremoval3 \
--in-file1 input_1.fastq.gz \
--in-file2 input_2.fastq.gz \
--out-file1 output_interleaved.fastq.gz \
--out-file2 output_interleaved.fastq.gz
The ability to interleave output extends to all output types, except for the two reports (--out-json and --out-html), and one could for example write both discarded and singleton reads to the same file (output_interleaved.discarded.fastq.gz) using the following command:
adapterremoval3 \
--in-file1 input_1.fastq.gz \
--in-file2 input_2.fastq.gz \
--out-prefix output_interleaved \
--out-discarded output_interleaved.discarded.fastq.gz \
--out-singleton output_interleaved.discarded.fastq.gz
Different quality score encodings¶
By default, AdapterRemoval expects the quality scores in FASTQ reads to be Phred+33 encoded, meaning that the error probabilities are encoded as round('!' - 10 * log10(p)). Most data will be encoded using Phred+33, but Phred+64 and 'Solexa' encoded quality scores are also supported. These are selected by specifying the --quality-format command-line option (specifying either '33', '64', or 'solexa'):
adapterremoval3 \
--quality-format 64 \
--in-file1 reads_q64.fastq \
--out-prefix output_phred_64
Output is always saved as Phred+33. See this Wikipedia article for a detailed overview of Phred encoding schemes currently and previously in use.
Demultiplexing¶
AdapterRemoval supports simultaneous demultiplexing and adapter trimming; demultiplexing is carried out using a simple comparison between the specified barcode (a sequence of A, C, G, and T) and the first N bases of the mate 1 read, where N is the length of the barcode. Demultiplexing of double-indexed sequences is also supported, in which case two barcodes must be specified for each sample. The first barcode is then compared to first N_1 bases of the mate 1 read, and the second barcode is compared to the first N_2 bases of the mate 2 read. By default, this comparison requires a perfect match. Reads identified as containing a specific barcode(s) are then trimmed using adapter sequences including the barcode(s) as necessary. Reads for which no (pair of) barcodes matched are written to a separate file or a pair of files (for paired-end reads).
Demultiplexing is enabled by creating a table of barcodes, the first column of which species the sample name (using characters a-z, A-Z, 0-9, or _) and the second and (optional) third columns specify the barcode sequences expected at the 5' termini of mate 1 and mate 2 reads, respectively.
For example, a table of barcodes from a double-indexed run might be as follows (see examples/barcodes.txt):
$ cat barcodes.txt
sample_1 ATGCGGA TGAATCT
sample_2 ATGGATT ATAGTGA
sample_7 CAAAACT TCGCTGC
AdapterRemoval is invoked with the --barcode-table option, specifying the path to this table:
adapterremoval3 \
--in-file1 demux_1.fastq \
--in-file2 demux_2.fastq \
--out-prefix output_demux \
--barcode-table barcodes.txt
This generates a set of output files for each sample specified in the barcode table, using output_demux as the prefix for output filenames, followed by a dot and the sample name, followed by a dot and the default name for a given file type. The reports generated by AdapterRemoval contains information about the number of reads identified for each sample and (in the JSON file) detailed per-sample quality metrics.
The maximum number of mismatches allowed when comparing barcodes is controlled using the options --barcode-mm, --barcode-mm-r1, and --barcode-mm-r2, which specify the maximum number of mismatches total, and the maximum number of mismatches for the mate 1 and mate 2 barcodes respectively. Thus, if mm_1(i) and mm_2(i) represents the number of mismatches observed for barcode-pair i for a given pair of reads, these options require that
mm_1(i) <=
--barcode-mm-r1mm_2(i) <=
--barcode-mm-r2mm_1(i) + mm_2(i) <=
--barcode-mm
If the --demultiplex-only option is used, then no trimming/processing is performed after the demultiplexing step:
adapterremoval3 \
--in-file1 demux_1.fastq \
--in-file2 demux_2.fastq \
--out-prefix output_only_demux \
--barcode-table barcodes.txt \
--demultiplex-only
Warning
Output produced in --demultiplex-only mode may still contain adapter sequences, and for paired reads/double indexed data these adapters will be prefixed by the barcode sequence(s). Downstream trimming will therefore have to account for these extra sequences.
Quality reports and identifying adapter sequences¶
AdapterRemoval generates a detailed report of input and output data, as part of its operation. This report can additionally be run without performing read processing, meaning that statistics are only provided for the "raw" input data.
By default, quality reports are written to prefix.html and prefix.json, if the --out-prefix option is used. This can also be controlled by the --out-json and --out-html options. This means that we can simply omit other output files to generate only the reports:
adapterremoval3 \
--in-file1 reads_1.fastq \
--in-file2 reads_2.fastq \
--out-json my_report.json \
--out-html my_report.html
AdapterRemoval can also generate a report without performing any processing of the input, using the --report-only option, which greatly increases throughput. When run without read processing enabled, AdapterRemoval attempts to infer a consensus adapter sequence, based on fragments identified as belonging to the adapters through pairwise alignments of the reads (PE mode only).
Since only the reports are generated in this mode, we can use the --out-prefix option to simplify the command:
adapterremoval3 \
--report-only \
--in-file1 reads_1.fastq \
--in-file2 reads_2.fastq \
--out-prefix my_report
The consensus sequences inferred are compared to those specified using the --adapter1 and --adapter2 command-line options, if specified, or the best matching adapter otherwise (see below). Pipes (|) indicate matches between the provided sequences and the consensus sequence, and * indicate the presence of unspecified bases (Ns).
The best practice is to compare the consensus with published Illumina or BGI/MGI adapter sequences and pick out the best matches. The built-in list of adapters can be viewed by using the --adapter-database option (see below). However, on occasion there may be consistent differences between the published sequences and the observed adapter sequences, in which case you should prefer the observed sequence.
A note on specifying adapters¶
By default, AdapterRemoval will attempt to identify the type of adapter sequences present in the input data, based on a database of adapter sequences included with AdapterRemoval. The selected adapter sequences (if any) will be listed in the resulting QC reports.
If AdapterRemoval cannot identify any potential adapter sequences in the input, then AdapterRemoval will either assume that the data contains no adapters (in single-end mode), or perform adapter trimming based on the pair-wise alignment of the input reads (paired-end mode). This behavior is controlled via the --adapter-selection and --adapter-fallback options.
You can use the --adapter-database option to list the known adapter sequences, in either tsv or json format:
adapterremoval3 --adapter-database tsv
This database can be extended by combining --adapter-selection auto with the options --adapter1 and --adapter2, or with --adapter-table.
Manually specifying adapters¶
Adapter sequences may also be set explicitly via the --adapter1 and --adapter2 options, should you be aware of the exact sequences. Adapter sequences are specified in the read orientation when using the --adapter1 and --adapter2 command-line options, directly corresponding to the sequence that is observed in the FASTQ files produced by the base calling software.
In other words, if we were processing data generated using Illumina TruSeq adapters, then the TruSeq read 1 adapter should be found in files passed to --in-file1:
$ grep "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" file1.fastq
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAACAAGAAT
CTGGAGTTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAA
GGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTGCAAATTGAAAACAC
And the TruSeq read 2 adapter should be found in files passed to --in-file2:
$ grep "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" file2.fastq CAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTCAAAAAAAGAAAAACATCTTG GAACTCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTCAAAAAAAATAGA GAACTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTCAAAAACATAAGACCTA
How much of these adapter sequences that can be found in your input (if anything) will depend on the read length and the size of the DNA fragments sequenced. AdapterRemoval is designed to detect even short adapter fragments.
To manually set these adapters, use the command-line options --adapter1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT.
Tip
An N in an adapter sequence is treated as a wildcard. An N will align against any other base, including other N``s, but does not affect the score of the resulting alignment and are not counted for the purpose of filters such as ``--min-overlap.
Tip
It is generally not worthwhile to specify more than the first ~30 bp of the adapter sequences to be trimmed. Doing so does not notably improve sensitivity or specificity, but does result in a lower throughput.
Trimming paired-end data with multiple adapter pairs¶
It is possible to provide multiple, different sets of adapters for trimming, in which case AdapterRemoval will select the single best match for each read (pair), and trim that adapter or adapter pair from the read or read pair.
Adapters must be written in a one or two-column table, for SE and PE trimming, respectively. Columns can be separated with any whitespace. For example, to specify both the recommended Illumina TruSeq and the recommended BGISeq adapters, one might save the following text in the file adapters.txt:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG
This file is then specified using the --adapter-table option:
adapterremoval3 \
--in-file1 reads_1.fastq \
--in-file2 reads_2.fastq \
--out-prefix output_multi \
--adapter-table adapters.txt
Pairs of adapters are used exactly as written, and the resulting QC reports lists how frequently each adapter or each pair of adapters was used.
Note that throughput decreases proportionally to the number of adapters, and it is therefore not recommended to use this functionality unless strictly necessary. When adapters differ only after the first N bases, for example due to an embedded barcode, then it is typically better to specify the common part of the adapter sequences with --adapter1 and (optionally) --adapter2, instead of specifying multiple, different adapter pairs in a table.