AdapterRemoval manpage

Synopsis

adapterremoval3 [options...] --in-file1 <filenames> [--in-file2 <filenames>]

Description

adapterremoval3 removes residual adapter sequences from single-end (SE) or paired-end (PE) FASTQ reads, optionally trimming Ns and low quality bases and/or merging overlapping paired-end mates into one read. Low quality reads are filtered based on the resulting length and the number of ambiguous nucleotides (N) present following trimming. These operations may be combined with simultaneous demultiplexing using 5' barcode sequences. Reports containing statistics and plots in HTML and JSON format are generated after each run.

Alternatively, adapterremoval3 may perform a report-only analysis of the input data, including the reconstruction of a consensus adapter sequences from paired-end data.

If you use this program, please cite the paper:

Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 12;9(1):88

http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2

For detailed documentation, please see

http://adapterremoval.readthedocs.io/

Options

-h, --help: Display summary of command-line options.

-v, --version: Print the version string.

--licenses: Print licenses for this software as well as for dependencies used by AdapterRemoval.

--threads n: Maximum number of threads. Defaults to 2.

--simd name: Selects the preferred SIMD instruction. Possible values are none, SSE2, AVX2, AVX512, and NEON. Options may be unavailable depending on the current system and depending on the compiler used to build AdapterRemoval. By default, AdapterRemoval will attempt to select the most appropriate instruction set for the input data.

Input files

--in-file1 filename [filenames...], --in-file2 filename [filenames...]: Read FASTQ mate 1 and mate 2 reads from one or more files, either uncompressed or gzip compressed. The --interleaved and --interleaved-input options may be used to enable reading of interleaved reads from the files specified with --in-file1, in which case --in-file2 is not used.

--head n: If set, AdapterRemoval will process only the first N reads/pairs of reads in the input, in single-end and paired-end mode, respectively. Accepts suffixes K (thousands), M (millions), and G (billions). By default, all data is processed.

Output file options

Output files for AdapterRemoval may be specified either via the --out-prefix option, which assigns default filenames to output, and/or via the individual --out options to set or override the output filename for a given output type. The same filename may be used for multiple --out options, in which case all of those reads are written to that file in input order.

Use - or /dev/stdin to read from standard input, use - or /dev/stdout to write to standard output, and use /dev/null to disable the generation of those types of reads. Statistics are still collected for read types written to /dev/null, but the data itself is not serialized or compressed, in order to save time.

--out-prefix path: Prefix for the output files for which no filename was set using the corresponding options below, except for --out-discarded which is not saved by default. If this option is not used, then files for which no --out option was set will not be saved.

--out-file1 filename, --out-file2 filename: Output files containing trimmed mate 1 reads and mate 2 reads. If interleaved output is enabled, via the --interleaved or the --interleaved-output options, then only --out-file1 is used, and this file will contain both mate 1 and mate 2 reads.

--out-merged filename: When used with --merge, this file contains overlapping mate-pairs which have been merged into a single read. Setting this option in demultiplexing mode overrides --out-prefix for just this read type.

--out-singleton filename: Output file containing orphaned paired reads for which the mate has been discarded. Setting this option in demultiplexing mode overrides --out-prefix for just this read type.

--out-unidentified1 filename, --out-unidentified2 filename: In demultiplexing mode, reads that could not be assigned to a single sample are written to these files. In interleaved mode, both mate 1 and mate 2 reads are written to --out-unidentified1.

--out-discarded filename: Contains reads discarded due to the --min-length, --max-length, --max-ns, --min-mean-quality, or --min-complexity options. This option is not enabled by setting --out-prefix.

--out-json filename, --out-html filename: Reports in JSON/HTML format containing information about the parameters used in the run, overall statistics on the reads before and after trimming, demultiplexing statistics, and the results of any analyses carried out during the run. Analyses include insert inference of sizes, duplication levels, and any inferred consensus adapter sequences.

FASTQ options

--quality-format name: The Phred quality score encoding used in input reads - either 64 for Phred+64 (Illumina 1.3+ and 1.5+) or 33 for Phred+33 (Illumina 1.8+, BGISeq, and more). In addition, the value solexa may be used to specify reads with Solexa encoded scores. The sam format may be used for Phred+33 encoded data with quality scores higher than that normally produced by sequencing machines. Defaults to 33.

--mate-separator separator: Character separating the mate number (1 or 2) from the read name in FASTQ records, such as @my-read/1 and @my-read/2. This is typically either / or .. By default, AdapterRemoval will attempt to infer this from the input data.

--normalize-mate-separator [value]: Replace the mate separator in FASTQ reads with the specified character (/ if no character is specified). If reads do not contain mate numbers, these are added. If strip, the mate separator is stripped from FASTQ reads. By default, mate separator is not normalized.

--interleaved-input: Enable reading of interleaved FASTQ reads from the files specified with --in-file1. Defaults to off.

--interleaved-output: Write paired-end reads to the file specified by --out-file1, interleaving mate 1 and mate 2 reads. Defaults to off.

--interleaved: Enables --interleaved-input and --interleaved-output. Defaults to off.

--mask-degenerate-bases: Mask degenerate/ambiguous IUPAC encoded bases (B/D/H/K/M/N/R/S/V/W/Y) in the input by replacing them with an N; if this option is not used, AdapterRemoval will abort upon encountering degenerate bases in the input.

--convert-uracils: Convert uracils (U) to thymine (T) in input reads; if this option is not used, AdapterRemoval will abort upon encountering uracils in the input.

Output compression options

--out-format name: Selects the default output format for files: fastq for uncompressed FASTQ reads, fastq.gz for gzip compressed FASTQ reads, sam for uncompressed SAM records, sam.gz for gzip compressed SAM records, bam for BGZF compressed BAM records, and ubam for uncompressed BAM records. Setting a --out option overrides this option based on the extension used (except .ubam). Defaults to fastq.gz.

--stdout-format name: Selects the output format for data written to STDOUT, i.e. when writing to - or /dev/stdout; choices are the same as for --out-format. By default, the uncompressed version of the current --out-format is used.

--read-group [n, ...]: Add read-group (RG) information to SAM/BAM output. Tags can either be specified as individual arguments, i.e. --read-group ID:foo SM:bar, or as a string containing tags separated by tabs. Normally the latter can be written as --read-group $'ID:DS-1\tSM:TK-421\tPL:ILLUMINA' on the command-line. If the ID tag is not provided, the default ID 1 will be used.

--compression-level n: Sets the compression level for compressed output. Valid values are 0 to 12: Level 0 is uncompressed but includes gzip headers/checksums, level 1 is streamed for FASTQ and SAM output, which may be required for compatibility in rare cases. FASTQ and SAM output with compression levels 2 to 12, and BAM output, is block compressed using the gzip compatible BGZF format. Lower this value to 1 for a 50-100% increase in throughput, at the cost of 10-20% larger output files. Defaults to compression level 4.

Adapter selection

--adapter1 adapter: Adapter sequence expected to be found in mate 1 reads, specified in read direction. For a detailed description of how to provide the appropriate adapter sequences, see the "Adapters" section of the online documentation. Default is AGATCGGAAGAGCACACGTCTGAACTCCAGTCA, intended for Illumina TruSeq and similar data.

--adapter2 adapter: Adapter sequence expected to be found in mate 2 reads, specified in read direction. For a detailed description of how to provide the appropriate adapter sequences, see the "Adapters" section of the online documentation. Default is AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT, intended for Illumina TruSeq and similar data.

--adapter-table filename: Read one or more adapter sequences from a table. The first two columns (separated by whitespace) of each line in the file are expected to correspond to values passed to --adapter1 and --adapter2. In single-end mode, only column one is required. Lines starting with # are ignored. When multiple rows are found in the table, AdapterRemoval will try each adapter (pair), and select the best aligning adapters for each FASTQ read processed.

--adapter-selection strategy: How to select the adapter sequences to trim: If auto, attempt to determine adapter sequences from the input data; if manual, use the user-defined adapter sequences; if undefined, trim based on overlap analyses (PE only) and/or 5' barcodes (SE if mate 2 barcodes are provided); and if none, assume that the data contains no adapter sequences. Possible values are auto, manual, undefined, and none. Defaults to auto, unless --adapter1, --adapter2, or --adapter-table are used, in which case the default is manual.

--adapter-fallback strategy: If --adapter-selection auto is used and no adapter sequences could be identified, either abort the program, or fall back to one of the other possible --adapter-selection options. Possible values are undefined, none, and abort. Defaults to undefined.

--adapter-database format: Output the adapters used for automatic adapter selection to STDOUT, either as a tsv table or in json format. If an adapter 2 sequence is not listed, then AdapterRemoval uses the same sequence for both adapter 1 and adapter 2.

FASTQ processing options

--min-overlap length: The minimum amount of bases that must overlap, not counting ambiguous nucleotides (Ns), for a single-end or paired-end alignment to be considered valid. For paired-end mode, this is important when no adapters were specified or detected, to prevent trimming of short palindromes. Defaults to 1 in single-end mode, and --merge-threshold in paired-end mode.

--mismatch-rate rate: Max error-rate allowed when aligning reads and/or adapters. The default value of 0.1667 corresponds to approximately 1 error every 6 bases.

--shift n: To allow for missing bases in the 5' end of the read, the program can let the alignment slip --shift bases in the 5' end. This corresponds to starting the alignment maximum --shift nucleotides into read 2 (for paired-end) or the adapter (for single-end). The default is 2.

--merge: In paired-end mode, merge overlapping mates into a single read and recalculate the quality scores of overlapping bases. The overlap needs to be at least --merge-threshold nucleotides, with a maximum number of mismatches determined by --mismatch-rate. This option has no effect in single-end mode.

--merge-threshold length: The minimum overlap between mate 1 and mate 2 before the reads are merged into one, potentially longer sequence, when merging is enabled. Does not include ambiguous nucleotides (Ns). Default is 11 bases.

--merge-strategy name: Determines how to assign quality scores to matching/mismatching bases during read merging. The maximum strategy uses Q=max(Q1, Q2) for matches while the additive strategy uses Q = Q1 + Q2. Both strategies use Q = abs(Q1 - Q2) for mismatches, and picks the highest quality base, unless the qualities are the same in which case N is used. Possible values are maximum, and additive. Defaults to maximum.

--merge-quality-max phred: Sets the maximum Phred score for re-calculated quality scores when read merging is enabled with the additive merging strategy. The value must be in the range 0 to 93, corresponding to Phred+33 encoded values of ! to ~. Defaults to 41.

--prefix-read1 X: Adds the specified prefix to the names of mate 1 reads. Defaults to no prefix.

--prefix-read2 X: Adds the specified prefix to the names of mate 2 reads. Defaults to no prefix.

--prefix-merged X: Adds the specified prefix to merged read names. Defaults to no prefix.

Quality trimming options

--pre-trim3p n [n]: Trim the 3' of reads by a fixed amount after demultiplexing but before removing adapters. Specify one value to trim mate 1 and mate 2 reads the same amount, or two values separated by a space to trim each mate a different amount. Off by default.

--post-trim5p n [n]: Trim the 5' of reads by a fixed amount after removing adapters, but before carrying out quality based trimming. See --pre-trim3p.

--post-trim3p n [n]: Trim the 3' of reads by a fixed amount after removing adapters, but before carrying out quality based trimming. See --pre-trim3p.

--quality-trimming method: The method used for performing quality trimming: none to disable quality trimming, mott to enable trimming using the modified Mott's algorithm, window to perform window-based quality trimming, and per-base to perform base-by-base trimming of low-quality bases and Ns (if enabled). Defaults to Mott's algorithm.

--trim-mott-rate rate: The inclusive threshold value used when trimming low-quality bases using the modified Mott's algorithm. A value of zero disables trimming. Defaults to 0.05.

--trim-windows size: Specifies the initial size of the dynamic window, when window-based quality trimming is enabled via --quality-trimming window. Trimming is performed using a sliding window-based approach inspired by sickle. See the "Window-based quality trimming" section of the manual page for a description of this algorithm. Defaults to 0.1.

--trim-min-quality minimum: Inclusive minimum quality used when trimming low-quality bases with --quality-trimming options window and per-base. Defaults to 2.

--pre-trim-polyx [nucleotides...]: Enable trimming of poly-X tails prior to read alignment and adapter trimming. Zero or more nucleotides (A, C, G, T) may be specified, separated by spaces, with zero nucleotides corresponding to all of A, C, G, and T. Defaults to no trimming.

--post-trim-polyx [nucleotides...]: Enable trimming of poly-X tails after read alignment and adapter trimming/merging, but before trimming of low-quality bases. Merged reads are not trimmed by this option, since both ends are derived from the 5' of reads. See --pre-trim-polyx. Off by default.

--trim-polyx-threshold n: The minimum length of a poly-X tail, when either --pre-trim-polyx or --post-trim-polyx is enabled. Defaults to 10 nucleotides.

--preserve5p: If set, bases at the 5' will not be trimmed by mott, window, or per-base trimming, except if the entire read consists of low-quality bases. Merged reads will not be quality trimmed when this option is enabled due to the 3' ends being located inside the reads or overlapping the 5' of the source sequences.

Filtering options

--max-ns n: Discard reads containing more than --max-ns ambiguous bases (N) after trimming. Default is no maximum.

--max-ns-fraction n: Discard reads where the number of ambiguous bases (N) divided by the read-length after trimming is greater than the specified value. Default is no maximum.

--min-length length: Reads shorter than this length are discarded following trimming. Defaults to 15.

--max-length length: Reads longer than this length are discarded following trimming. Defaults to no maximum.

--min-mean-quality X: Reads with a mean quality score less than this value following trimming are discarded. The value must be in the range 0 to 93, corresponding to Phred+33 encoded values of '!' to '~'. Defaults to no minimum.

--min-complexity X: Reads with a sequence quality less than this value after trimming are discarded. Complexity is measured as the fraction of positions that differ from the previous position, not counting ambiguous bases (N). A suggested value is 0.3. Defaults to no minimum.

Demultiplexing options

--barcode-table filename: Perform demultiplexing using table of one or two fixed-length barcodes for SE or PE reads. The table is expected to contain 2 or 3 white-space separated columns, the first of which represent the name of a given sample, and the second and third of which represent the mate 1 and (optionally) the mate 2 barcode sequence. For a detailed description, see the "Demultiplexing" section of the online documentation.

--multiple-barcodes: Allow for more than one barcode (pair) for each sample. If this option is not specified, AdapterRemoval will abort if multiple barcodes/barcode pairs identify the same sample.

--barcode-orientation orientation: Process barcode sequences in both the barcode1-insert-barcode2 (forward) orientation and barcode2-insert-barcode1 (reverse) orientation. If forward or reverse is used, the barcode in the barcode table are assumed to be in that orientation, and the opposite sequence is generated. If explicit is used, the barcode table is expected to contain a 4th column specifying the orientation (forward or reverse for each barcode), and only that orientation is used. Default is unspecified.

--normalize-orientation: Reverse complement merged reads found to be in the reverse orientation, based on barcode orientation.

--barcode-mm n: Maximum total number of mismatches allowed, when counting mismatches in both the mate 1 and the mate 2 barcode for paired reads.

--barcode-mm-r1 n: Maximum number of mismatches allowed for the mate 1 barcode; if not set, this value is equal to the --barcode-mm value; cannot be higher than the --barcode-mm value.

--barcode-mm-r2 n: Maximum number of mismatches allowed for the mate 2 barcode; if not set, this value is equal to the --barcode-mm value; cannot be higher than the --barcode-mm value.

--demultiplex-only: Only carry out demultiplexing using the list of barcodes supplied with --barcode-table. No other processing is done.

Reporting options

--report-only: Write a report of the input data without performing any processing of the FASTQ reads. In addition, attempt to build a consensus adapter sequence from fully overlapping pairs of paired-end reads. The minimum overlap is controlled by --merge-threshold and the result will be compared with the values set using --adapter1 and --adapter2. Default is off.

--report-title title: Set the title used in the HTML report. Defaults to AdapterRemoval v3.0.1.

--report-sample-rate x: The fraction of reads to sample when generating base quality/composition curves for trimming reports. Using all data (--report-sample-rate 1.0) results in an 10-30% decrease in throughput and is typically not necessary, except for tiny datasets. Default is 0.10.

--report-duplication [n]: FastQC based duplicate detection, based on the frequency of the first N unique sequences observed. If the option is used without an explicit value, the FastQC default of 100k unique reads is used; a value of 0 disables the analysis. Accepts suffixes K, M, and G. Default is 0.

Logging options

--log-level name: The minimum severity of messages to be written to stderr. Possible values are debug, info, warning, and error. Default is info.

--log-colors name: Enable/disable the use of colors when writing log messages. If set to auto, colors will only be enabled if STDERR is a terminal and the NO_COLORS environmental variable is not set. Possible values are auto, always, and never. Defaults to auto.

--log-progress name: Specify the type of progress report used. If set to auto, then a spinner will be used if STDERR is a terminal and the NO_COLORS environmental variable is not set, otherwise a log line will be written for every 1 million records processed. Possible values are auto, spin, log, and never. Default is auto.

Window-based quality trimming

AdapterRemoval implements sliding window-based approach to quality based base-trimming inspired by sickle. If --trim-windows is greater than or equal to 1, that number is used as the window size for all reads. If --trim-windows is a number greater than or equal to 0 and less than 1, then that number is multiplied by the length of individual reads to determine the window size. If the window length is zero or is greater than the current read length, then the read length is used instead.

Reads are trimmed as follows for a given window size:

The new 5' is determined by locating the first window where both the average quality and the quality of the first base in the window is greater than --trim-min-quality.

The new 3' is located by sliding the first window right, until the average quality becomes less than or equal to --trim-min-quality. The new 3' is placed at the last base in that window where the quality is greater than or equal to --trim-min-quality.

If no 5' position could be determined, the read is discarded.

Exit status

AdapterRemoval exits with status 0 if the program ran successfully, and with a non-zero exit code if any errors were encountered. Output from AdapterRemoval should not be used if the program returned a non-zero exit code.

Reporting bugs

Please report any bugs using the AdapterRemoval issue-tracker:

https://github.com/MikkelSchubert/adapterremoval/issues

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or at your option any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.