.. _reference: ======================= Tag reference ======================= The :ref:`reference` lists all the :term:`sections` and :term:`directives` that may appear in a benchmark configuration file. .. _title: title ===== String (required) The :ref:`title` section contains the title. For example:: title : >- Evaluate variant caller performance .. _description: description =========== String (required) The :ref:`description` section contains a description of the experiment. For example:: description: >- This experiment calls variants on the Platinum data set using a variety of variant callers. Performance is evaluated through comparison to the GiaB Standard of Truth. .. _tags: tags ==== List of strings (required) The :ref:`tags` section contains a list of strings that can be used to categorize the experiment. For example:: tags: - Variant calling - Genome In a Bottle - Platinum data .. _setup: setup ===== Collection of directives (required) .. glossary:: tools (required) A list of tools that should be executed on the input data, for example:: tools: - weCall - platypus Each tool can be parameterized in a separate :ref:`tool_task` section. metrics (required) A list of metrics that should be executed on the output of the tools, for example:: metrics: - bcftools_stats - rtg_vcfeval Each tool can be parameterized in a separate :ref:`metric_task` section. collate (optional) A list of collation tasks that should be executed on the tool output. See :ref:`collation` for an example and :ref:`collate-task`. split (optional) A list of split tasks that should be executed on the tool output. See :ref:`splitting` for an example. ignore (optional) A list of input/tool/metric names that should be ignored. only_collate (optional) Flag. Do not compute any metrics on the output of tool tasks, but only on the output after collation. collate_metrics (optional) A list of metrics that will only be applied to the output of collation tasks. split_metrics (optional) A list of metrics that will only be applied to the output of split tasks. external (optional) Add external data to the daisy. Metrics are computed on the external data alongside the output by the tools run by the benchmark system. This section must contain an :ref:`input` section and an `output` directive:: setup: tools: - bwa_mem metrics: - samtools_stats external: input: file: "/other_data/2016_sample*.bam" regex: 2016_(sample.*).bam output: result.bam add_glob: .bai The functionaly of the external section is similar to the :ref:`run_tool_identity` and in fact uses the same implementation and accepts the same options. The difference is that the :term:`external` directive permits computing metrics on data created by the pipeline alongside external data, while the :ref:`run_tool_identity` assumes that only metrics will be applied. export (optional) A list of tasks which output should be exported. By default, the output of the :term:`tools`, :term:`collate` and the :term:`split` tasks will be exported. To only export the output of the :term:`tools` section, use:: setup: export: - tools Additionally, there are directives for determining the name of output files when exporting tool output data. Currently defined is: .. glossary:: prefix (optional) add a prefix to exported output files. For more information, see :ref:`exporting`. replication (optional) If set and larger than 1, each tool will be run multiple times with identical input and parameterization. .. _input: input ===== Collection of directives (required) The input section contains the filenames for the input data. Files are labeled according to slots defined by a tool, for example:: input: reference_fasta: hg19.fa bam: - individual1.bam - individual2.bam Filenames can contain keywords such as :term:`glob` or :term:`find` to refer to a group of files. Each file encountered by a :term:`glob` or :term:`find` expression will be added as an item to the list of files to be processed and thus each file will instantiate a separate task. Multiple glob statements can be separated by a comma (``,``):: input: reference_fasta: hg19.fa bam: individual*.bam, sample*.bam In addition, the :ref:`input` section may contain additional directives. .. glossary:: regex (optional) A string containg a regular expression to extract a name from a filename. The regular expression should contain at least one ``()``-group. For example:: input: reference_fasta: hg19.fa bam: - family1_individual1.bam - family2_individual2.bam regex: (\S+)_(\S+).bam alias: \2 will set the aliases ``indivial1`` and ``individual2`` ignoring the family. alias (optional) A pattern that can be used to build a name from the regular expression given by :term:`regex`. The default is to concatenate all groups in the regular expression separated by an underscore (``_``). See :term:`regex`. groupby (optional) Either ``option`` or ``label``. This option determines how input files should be grouped. The default is ``option``, so that files will be grouped across labels. For example:: input: reference_fasta: hg19.fa bam: pair1: - individual1.bam - individual2.bam pair2: - individual1.bam - individual2.bam will result in the following pairs:: {"reference_fasta": "hg19.fa", "pair1": {"bam": ("individual1.bam", "individual2.bam")}} {"reference_fasta": "hg19.fa", "pair2": {"bam": ("individual3.bam", "individual4.bam")}} while :: input: pair1: reference_fasta: hg19.fa bam: - individual1.bam - individual2.bam pair2: reference_fasta: hg38.fa bam: - individual3.bam - individual4.bam groupby: label will result in:: {"pair1": {"reference_fasta": "hg19.fa", "bam": ("individual1.bam", "individual2.bam")}} {"pair2": {"reference_fasta": "hg38.fa", "bam": ("individual3.bam", "individual4.bam")}} group_regex (optional) A regular expression used to group input files. For example, if you are interested in calling variants inside families and the files are named ``family-sample.bam``, use:: input: bam: *.bam group_regex: (\S+)-(\S+).bam group_alias: \1 group_alias (optional) String used to build an alias for a group. See :term:`group_regex`. ignore (optional) Ignore a particular tool or metric. This directives accepts a list of patterns:: ignore: - gatk_haplotype_caller_WES_NA12891_remapped_dedup replicate_alias (optional) When replicates are set, provide a format for a replicate id. The default is to output the replicate number after the tool name, for example: ``tool_1_input.dir``. If set to ``replicate_alias: replicate=\1``, the directory will be called ``tool_replicate=1_input.dir``. .. _tool_task: tool-task ========= A tool task paramaterizes a tool further. For example:: setup: tools: - weCall weCall: options: =regions=1 will run weCall only on chromosome 1. The benchmark system allows the user the specify multiple alternative configurations of a tool. Thus, if given a list of configurations, all of these will be run alongside each other. For example, the following will run the tool ``weCall`` twice, once on chromosome 1 and once on chromosome 2:: weCall: options: - =regions=1 - =regions=2 The system requires unique names for each task. By default, these will be created through hashing the options. To define names explicitely to facilitate further analysis, use the :term:`alias=` directive. Instead of setting them explicitely, aliases can be derived automatically from option names using the :term:`regex` and :term:`alias` directives. Option strings can be created programmatically with the :term:`generate=` directive. The full list of directives is below: .. glossary:: prefix= (optional) Shared list of values for a particular option. Default options that are common to all tasks can be specified with the :term:`prefix=` directive:: weCall: options: - prefix==jobThreads=10 - =regions=1 - =regions=2 alias= (optional) Set an explicit alias for an option:: weCall: options: - prefix==jobThreads=10 - alias=chr1; =regions=1 - alias=chr2; =regions=2 regex (optional) Regular expression to derive a name using the options submitted to the task:: weCall: options: - prefix==jobThreads=10 - =regions=1 - =regions=2 regex: =regions=(\S+) alias: chr\1 alias (optional) String used to build a name from the parts extracted by a regular expression (see :term:`regex`). generate= (optional) Generator expression to create a list of options automatically:: weCall: options: - prefix==jobThreads=10 - generate=["alias=chr{}; {}".format(x, x) for x in [1, 2]] ignore (optional) Ignore a particular tool or metric. This directives accepts a list of patterns. Any task matching that contains any of the strings in the list will be ignored:: ignore: - gatk_haplotype_caller_WES_NA12891_remapped_dedup task_specific: (optional) Apply task specific options to a particular command. This directive accepts a collection of patterns and appropriate parameters. For example, to apply additional filters to metrics compute on freebayes output, use:: task_specific: freebayes.*: filter_exclude: "FORMAT/GT == '.' || DP < 5 || QUAL < 20 || N_ALT >= 2" .. _metric_task: metric-task =========== A tool task paramaterizes a metric further. This section is identical to :ref:`tool_task`. .. _collate_task: collate-task ============ A collate task describes how output data should be grouped. See :ref:`collation` for an example. .. glossary:: regex_in (required) regular expression that determines how files should be grouped. pattern_out (required) output pattern. If all files should be merged, this will simply be the filename used by the preceeding tools, for example:: regex_in: (\S+).dir/result.vcf.gz pattern_out: result.vcf.gz runner (required) name of the tool to be run for combining multiple files into one. .. _split: split-task ========== A split task describes how output data should be split before computing metrics. See :ref:`splitting` for an example. .. glossary:: runner (required) name of the tool to be run for combining multiple files into one. .. _database: database ======== Collection of directives (optional) This section contains directives with database connection details. .. glossary:: url (optional) Database URL. See `here `_ for a list of accepted formats. The system is currently tested with sqlite and postgres. schema (optional) Database schema to use for data tables. If not given or the database does not support schemas, the data tables will sit alongside the meta tables in the database. .. _cluster: cluster ======= Collection of directives (optional) A collection of options to specify cluster parameters. Typically, parameters are set with either defaults hardcoded or in a user-specific configuration file. If there are experiment specific options, they can also be specified in the :file:`benchmark.yml` file. .. glossary:: queue (optional) The cluster queue. priority (optional) The job priority. This should be a negative number. num_jobs (optional) Number of jobs to submit in parallel to the queueing system. memory_resource (optional) Name of the memory resource memory_default (optional) Default amount of memory to allocate parallel_environment (optional) Name of the parallel environment to use for multi-threaded applications. options (optional) Generic options to use for job submissions.