################# Getting started ################# ************* Basic usage ************* To run AdapterRemoval on single-end FASTQ data, specify the location of FASTQ file(s) using the ``--in-file1`` command-line option. Use the ``--out-prefix`` to specify a prefix for the names of the resulting FASTQ and report files: .. code-block:: console adapterremoval3 \ --in-file1 reads_1.fastq.gz \ --out-prefix output When run in this manner, AdapterRemoval will generate the files ``output.fastq.gz`` containing the trimmed reads, ``output.html`` containing a human-readable QC report, and ``output.json`` containing the QC report in JSON_ format. To run AdapterRemoval on paired-end FASTQ data, specify the location of the mate 1 and mate 2 FASTQ files using the ``--in-file1`` and ``--in-file2`` command-line options. Overlapping reads may optionally be merged using the ``--merge`` option: .. code-block:: console adapterremoval3 \ --in-file1 reads_1.fastq.gz \ --in-file2 reads_2.fastq.gz \ --out-prefix output \ --merge When run in this manner, AdapterRemoval will save the QC reports as described above, while FASTQ reads will be saved to ``output.r1.fastq.gz``, ``output.r2.fastq.gz``, ``output.merged.fastq.gz``, and ``output.singleton.fastq.gz``. The ``singleton`` file contains reads from pairs in which one of the reads was discarded. If you prefer, you can manually specify just the output files you are interested in, via ``--out-file1``, ``--out-file2``, and other options starting with ``--out``. See the :doc:`manpage` for a complete list of input and output options. By default, AdapterRemoval will attempt to infer the adapter sequences present in the data automatically. See :ref:`specifying_adapters` below for more information. **************** Output formats **************** AdapterRemoval supports writing processed reads as compressed and uncompressed FASTQ files (``.fastq``, ``.fq``, ``.fastq.gz``, and ``.fq.gz``), as uncompressed or compressed SAM files (``.sam`` and ``.sam.gz``), and as compressed BAM and uncompressed BAM files (``.bam``). Formats are specified either using the ``--out-format`` option in conjunction with the ``--out-prefix`` option, or by specifying one of the above extensions when using one of the explicit ``--out`` options. If AdapterRemoval does not recognize the extension, then the format specified via ``--out-format`` is used, defaulting to gzip compressed FASTQ. By default, the uncompressed version of the format specified via ``--out-format`` is used for records written to STDOUT (see below), under the assumption that the output is to be processed by another program, but this may be overridden using the ``--stdout-format`` option. See the :doc:`input_and_output` page for more information about files generated by AdapterRemoval. ************************************************** Standard input (STDIN), standard output (STDOUT) ************************************************** Reading from STDIN and writing to STDOUT can be accomplished *either* by using the special ``/dev/stdin`` and ``/dev/stdout`` files as input or output, or by using the filename ``-`` for either: .. code-block:: console some-command | adapterremoval3 --in-file1 - --out-file1 - | some-other-command Output written to STDOUT can be freely interleaved, as described in the *Interleaved input and output* below. ************************ Disabling output files ************************ Output that has not been enabled by setting the ``--out-prefix`` option (everything but discarded reads) or by setting the ``--out`` option for that specific output type (e.g. ``--out-singleton``) is still processed, but is then discarded before the compression step. This allows statistics to be collected for all results, even if they are not saved. AdapterRemoval additionally recognizes the special path ``/dev/null``, and will discard the output ``--out`` for any output type for which this path is used. That means that you can use the ``--out-prefix`` option to specify a default output path, and disable the read types that you do not care about. Meaning that instead of .. code-block:: console adapterremoval3 \ --in-file1 reads_1.fastq \ --in-file2 reads_2.fastq \ --out-html output.html \ --out-json output.json \ --out-file1 output.r1.fastq.gz \ --out-file2 output.r2.fastq.gz \ --out-merged output.merged.fastq.gz you could write .. code-block:: console adapterremoval3 \ --in-file1 reads_1.fastq \ --in-file2 reads_2.fastq \ --out-prefix output \ --out-singleton /dev/null **************************** Multiple input FASTQ files **************************** More than one input file may be specified after the ``--in-file1`` and ``--in-file2`` options. Files are processed in the specified order, as if they had been concatenated using ``cat`` or ``zcat``: .. code-block:: console adapterremoval3 \ --in-file1 reads_1a.fastq reads_1b.fastq reads_1c.fastq .. code-block:: console adapterremoval3 \ --in-file1 reads_1a.fastq reads_1b.fastq reads_1c.fastq \ --in-file2 reads_2a.fastq reads_2b.fastq reads_2c.fastq The number of files provided for ``--in-file1`` and ``--in-file2`` need not match, as long as the total number of reads and read order is the same. ****************************** Interleaved input and output ****************************** AdapterRemoval is able to read and write paired-end reads stored in a single, so-called interleaved FASTQ file (one pair at a time, first mate 1, then mate 2). This is accomplished by specifying the location of the file using ``--in-file1`` and *also* setting the ``--interleaved`` command-line option: .. code-block:: console adapterremoval3 \ --interleaved \ --in-file1 interleaved.fastq \ --out-prefix output_interleaved Other than taking just a single input file, this mode operates almost exactly like paired-end trimming (as described above); the mode differs only in that paired reads are not written to a 'r1' and a 'r2' file, but instead these are written to a single file. The location of this file is controlled using the ``--out-file1`` option. Enabling either reading or writing of interleaved FASTQ files, but not both, can be accomplished by using either of the ``--interleaved-input`` and ``--interleaved-output`` options, both of which are enabled by the ``--interleaved`` option. Alternatively, you can specify the same output file for multiple output types, in order to write all of those reads to a single file in interleaved mode: .. code-block:: console adapterremoval3 \ --in-file1 input_1.fastq.gz \ --in-file2 input_2.fastq.gz \ --out-file1 output_interleaved.fastq.gz \ --out-file2 output_interleaved.fastq.gz The ability to interleave output extends to all output types, except for the two reports (``--out-json`` and ``--out-html``), and one could for example write both discarded and singleton reads to the same file (``output_interleaved.discarded.fastq.gz``) using the following command: .. code-block:: console adapterremoval3 \ --in-file1 input_1.fastq.gz \ --in-file2 input_2.fastq.gz \ --out-prefix output_interleaved \ --out-discarded output_interleaved.discarded.fastq.gz \ --out-singleton output_interleaved.discarded.fastq.gz *********************************** Different quality score encodings *********************************** By default, AdapterRemoval expects the quality scores in FASTQ reads to be Phred+33 encoded, meaning that the error probabilities are encoded as ``round('!' - 10 * log10(p))``. Most data will be encoded using Phred+33, but Phred+64 and 'Solexa' encoded quality scores are also supported. These are selected by specifying the ``--quality-format`` command-line option (specifying either '33', '64', or 'solexa'): .. code-block:: console adapterremoval3 \ --quality-format 64 \ --in-file1 reads_q64.fastq \ --out-prefix output_phred_64 Output is always saved as Phred+33. See `this Wikipedia article`_ for a detailed overview of Phred encoding schemes currently and previously in use. **************** Demultiplexing **************** AdapterRemoval supports simultaneous demultiplexing and adapter trimming; demultiplexing is carried out using a simple comparison between the specified barcode (a sequence of A, C, G, and T) and the first N bases of the mate 1 read, where N is the length of the barcode. Demultiplexing of double-indexed sequences is also supported, in which case two barcodes must be specified for each sample. The first barcode is then compared to first ``N_1`` bases of the mate 1 read, and the second barcode is compared to the first ``N_2`` bases of the mate 2 read. By default, this comparison requires a perfect match. Reads identified as containing a specific barcode(s) are then trimmed using adapter sequences including the barcode(s) as necessary. Reads for which no (pair of) barcodes matched are written to a separate file or a pair of files (for paired-end reads). Demultiplexing is enabled by creating a table of barcodes, the first column of which species the sample name (using characters a-z, A-Z, 0-9, or _) and the second and (optional) third columns specify the barcode sequences expected at the 5' termini of mate 1 and mate 2 reads, respectively. For example, a table of barcodes from a double-indexed run might be as follows (see examples/barcodes.txt): .. code-block:: console $ cat barcodes.txt sample_1 ATGCGGA TGAATCT sample_2 ATGGATT ATAGTGA sample_7 CAAAACT TCGCTGC AdapterRemoval is invoked with the ``--barcode-table`` option, specifying the path to this table: .. code-block:: console adapterremoval3 \ --in-file1 demux_1.fastq \ --in-file2 demux_2.fastq \ --out-prefix output_demux \ --barcode-table barcodes.txt This generates a set of output files for each sample specified in the barcode table, using ``output_demux`` as the prefix for output filenames, followed by a dot and the sample name, followed by a dot and the default name for a given file type. The reports generated by AdapterRemoval contains information about the number of reads identified for each sample and (in the JSON file) detailed per-sample quality metrics. The maximum number of mismatches allowed when comparing barcodes is controlled using the options ``--barcode-mm``, ``--barcode-mm-r1``, and ``--barcode-mm-r2``, which specify the maximum number of mismatches total, and the maximum number of mismatches for the mate 1 and mate 2 barcodes respectively. Thus, if mm_1(i) and mm_2(i) represents the number of mismatches observed for barcode-pair i for a given pair of reads, these options require that 1. mm_1(i) <= ``--barcode-mm-r1`` 2. mm_2(i) <= ``--barcode-mm-r2`` 3. mm_1(i) + mm_2(i) <= ``--barcode-mm`` If the ``--demultiplex-only`` option is used, then no trimming/processing is performed after the demultiplexing step: .. code-block:: console adapterremoval3 \ --in-file1 demux_1.fastq \ --in-file2 demux_2.fastq \ --out-prefix output_only_demux \ --barcode-table barcodes.txt \ --demultiplex-only .. warning:: Output produced in ``--demultiplex-only`` mode may still contain adapter sequences, and for paired reads/double indexed data these adapters will be prefixed by the barcode sequence(s). Downstream trimming will therefore have to account for these extra sequences. .. _s_identifying_adapters: *************************************************** Quality reports and identifying adapter sequences *************************************************** AdapterRemoval generates a detailed report of input and output data, as part of its operation. This report can additionally be run without performing read processing, meaning that statistics are only provided for the "raw" input data. By default, quality reports are written to ``prefix.html`` and ``prefix.json``, if the ``--out-prefix`` option is used. This can also be controlled by the ``--out-json`` and ``--out-html`` options. This means that we can simply omit other output files to generate only the reports: .. code-block:: console adapterremoval3 \ --in-file1 reads_1.fastq \ --in-file2 reads_2.fastq \ --out-json my_report.json \ --out-html my_report.html AdapterRemoval can also generate a report without performing any processing of the input, using the ``--report-only`` option, which greatly increases throughput. When run without read processing enabled, AdapterRemoval attempts to infer a consensus adapter sequence, based on fragments identified as belonging to the adapters through pairwise alignments of the reads (PE mode only). Since only the reports are generated in this mode, we can use the ``--out-prefix`` option to simplify the command: .. code-block:: console adapterremoval3 \ --report-only \ --in-file1 reads_1.fastq \ --in-file2 reads_2.fastq \ --out-prefix my_report The consensus sequences inferred are compared to those specified using the ``--adapter1`` and ``--adapter2`` command-line options, if specified, or the best matching adapter otherwise (see below). Pipes (``|``) indicate matches between the provided sequences and the consensus sequence, and ``*`` indicate the presence of unspecified bases (Ns). The best practice is to compare the consensus with published `Illumina `_ or `BGI/MGI`_ adapter sequences and pick out the best matches. The built-in list of adapters can be viewed by using the ``--adapter-database`` option (see below). However, on occasion there may be consistent differences between the published sequences and the observed adapter sequences, in which case you should prefer the observed sequence. .. _specifying_adapters: ******************************* A note on specifying adapters ******************************* By default, AdapterRemoval will attempt to identify the type of adapter sequences present in the input data, based on a database of adapter sequences included with AdapterRemoval. The selected adapter sequences (if any) will be listed in the resulting QC reports. If AdapterRemoval cannot identify any potential adapter sequences in the input, then AdapterRemoval will either assume that the data contains no adapters (in single-end mode), or perform adapter trimming based on the pair-wise alignment of the input reads (paired-end mode). This behavior is controlled via the ``--adapter-selection`` and ``--adapter-fallback`` options. You can use the ``--adapter-database`` option to list the known adapter sequences, in either ``tsv`` or ``json`` format: .. code-block:: console adapterremoval3 --adapter-database tsv This database can be extended by combining ``--adapter-selection auto`` with the options ``--adapter1`` and ``--adapter2``, or with ``--adapter-table``. Manually specifying adapters ============================ Adapter sequences may also be set explicitly via the ``--adapter1`` and ``--adapter2`` options, should you be aware of the exact sequences. Adapter sequences are specified in the read orientation when using the ``--adapter1`` and ``--adapter2`` command-line options, directly corresponding to the sequence that is observed in the FASTQ files produced by the base calling software. In other words, if we were processing data generated using `Illumina TruSeq adapters `_, then the TruSeq read 1 adapter should be found in files passed to ``--in-file1``: .. code-block:: console $ grep "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" file1.fastq AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAACAAGAAT CTGGAGTTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAA GGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGAATCTCGTATGCCGTCTTCTGCTTGCAAATTGAAAACAC And the TruSeq read 2 adapter should be found in files passed to ``--in-file2``: .. code-block:: console $ grep "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" file2.fastq CAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTCAAAAAAAGAAAAACATCTTG GAACTCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTCAAAAAAAATAGA GAACTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTCAAAAACATAAGACCTA How much of these adapter sequences that can be found in your input (if anything) will depend on the read length and the size of the DNA fragments sequenced. AdapterRemoval is designed to detect even short adapter fragments. To manually set these adapters, use the command-line options ``--adapter1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT``. .. tip:: An ``N`` in an adapter sequence is treated as a wildcard. An ``N`` will align against any other base, including other ``N``s, but does not affect the score of the resulting alignment and are not counted for the purpose of filters such as ``--min-overlap``. .. tip:: It is generally not worthwhile to specify more than the first ~30 bp of the adapter sequences to be trimmed. Doing so does not notably improve sensitivity or specificity, but does result in a lower throughput. Trimming paired-end data with multiple adapter pairs ==================================================== It is possible to provide multiple, different sets of adapters for trimming, in which case AdapterRemoval will select the single best match for each read (pair), and trim that adapter or adapter pair from the read or read pair. Adapters must be written in a one or two-column table, for SE and PE trimming, respectively. Columns can be separated with any whitespace. For example, to specify both the recommended Illumina TruSeq and the recommended BGISeq adapters, one might save the following text in the file ``adapters.txt``: .. code-block:: text AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG This file is then specified using the ``--adapter-table`` option: .. code-block:: console adapterremoval3 \ --in-file1 reads_1.fastq \ --in-file2 reads_2.fastq \ --out-prefix output_multi \ --adapter-table adapters.txt Pairs of adapters are used exactly as written, and the resulting QC reports lists how frequently each adapter or each pair of adapters was used. Note that throughput decreases proportionally to the number of adapters, and it is therefore *not* recommended to use this functionality unless strictly necessary. When adapters differ only after the first N bases, for example due to an embedded barcode, then it is typically better to specify the common part of the adapter sequences with ``--adapter1`` and (optionally) ``--adapter2``, instead of specifying multiple, different adapter pairs in a table. .. _bgi/mgi: https://en.mgitech.cn/Download/download_file/id/71 .. _illumina_truseq_adapters: https://emea.support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html .. _json: https://www.json.org/ .. _this wikipedia article: https://en.wikipedia.org/wiki/FASTQ_format#Encoding