Tag reference

The Tag reference lists all the sections and directives that may appear in a benchmark configuration file.

title

String (required)

The title section contains the title. For example:

title : >-
  Evaluate variant caller performance

description

String (required)

The description section contains a description of the experiment. For example:

description: >-
   This experiment calls variants on the Platinum data set using a
   variety of variant callers. Performance is evaluated through
   comparison to the GiaB Standard of Truth.

tags

List of strings (required)

The tags section contains a list of strings that can be used to categorize the experiment. For example:

tags:
  - Variant calling
  - Genome In a Bottle
  - Platinum data

setup

Collection of directives (required)

tools

(required) A list of tools that should be executed on the input data, for example:

tools:
   - weCall
   - platypus

Each tool can be parameterized in a separate tool-task section.

metrics

(required) A list of metrics that should be executed on the output of the tools, for example:

metrics:
   - bcftools_stats
   - rtg_vcfeval

Each tool can be parameterized in a separate metric-task section.

collate
(optional) A list of collation tasks that should be executed on the tool output. See Collation for an example and collate-task.
split
(optional) A list of split tasks that should be executed on the tool output. See Splitting for an example.
ignore
(optional) A list of input/tool/metric names that should be ignored.
only_collate

(optional)

Flag. Do not compute any metrics on the output of tool tasks, but only on the output after collation.

collate_metrics

(optional)

A list of metrics that will only be applied to the output of collation tasks.

split_metrics

(optional)

A list of metrics that will only be applied to the output of split tasks.

external

(optional)

Add external data to the daisy. Metrics are computed on the external data alongside the output by the tools run by the benchmark system. This section must contain an input section and an output directive:

 setup:
   tools:
     - bwa_mem
   metrics:
     - samtools_stats
   external:
     input:
       file: "/other_data/2016_sample*.bam"
       regex: 2016_(sample.*).bam
     output: result.bam
     add_glob: .bai

The functionaly of the external section is similar to the
:ref:`run_tool_identity` and in fact uses the same
implementation and accepts the same options.  The difference is
that the :term:`external` directive permits computing metrics
on data created by the pipeline alongside external data, while
the :ref:`run_tool_identity` assumes that only metrics will be
applied.
export

(optional)

A list of tasks which output should be exported. By default, the output of the tools, collate and the split tasks will be exported. To only export the output of the tools section, use:

setup:
  export:
    - tools

Additionally, there are directives for determining the name of output files when exporting tool output data. Currently defined is:

prefix
(optional) add a prefix to exported output files.

For more information, see Exporting.

lication (optional)

If set and larger than 1, each tool will be run multiple times with identical input and parameterization.

input

Collection of directives (required)

The input section contains the filenames for the input data. Files are labeled according to slots defined by a tool, for example:

input:
   reference_fasta: hg19.fa
   bam:
      - individual1.bam
      - individual2.bam

Filenames can contain keywords such as glob or find to refer to a group of files. Each file encountered by a glob or find expression will be added as an item to the list of files to be processed and thus each file will instantiate a separate task. Multiple glob statements can be separated by a comma (,):

input:
   reference_fasta: hg19.fa
   bam: individual*.bam, sample*.bam

In addition, the input section may contain additional directives.

regex

(optional)

A string containg a regular expression to extract a name from a filename. The regular expression should contain at least one ()-group. For example:

input:
   reference_fasta: hg19.fa
   bam:
     - family1_individual1.bam
     - family2_individual2.bam
   regex: (\S+)_(\S+).bam
   alias: \2

will set the aliases indivial1 and individual2 ignoring the family.

alias

(optional)

A pattern that can be used to build a name from the regular expression given by regex. The default is to concatenate all groups in the regular expression separated by an underscore (_). See regex.
groupby

(optional)

Either option or label. This option determines how input files should be grouped. The default is option, so that files will be grouped across labels. For example:

input:
  reference_fasta: hg19.fa
  bam:
    pair1:
      - individual1.bam
      - individual2.bam
    pair2:
      - individual1.bam
      - individual2.bam

will result in the following pairs:

{"reference_fasta": "hg19.fa", "pair1": {"bam": ("individual1.bam", "individual2.bam")}}
{"reference_fasta": "hg19.fa", "pair2": {"bam": ("individual3.bam", "individual4.bam")}}

while

input:
  pair1:
    reference_fasta: hg19.fa
    bam:
      - individual1.bam
      - individual2.bam
  pair2:
    reference_fasta: hg38.fa
    bam:
      - individual3.bam
      - individual4.bam
  groupby: label

will result in:

{"pair1": {"reference_fasta": "hg19.fa", "bam": ("individual1.bam", "individual2.bam")}}
{"pair2": {"reference_fasta": "hg38.fa", "bam": ("individual3.bam", "individual4.bam")}}
group_regex

(optional)

A regular expression used to group input files. For example, if you are interested in calling variants inside families and the files are named family-sample.bam, use:

input:
   bam: *.bam
   group_regex: (\S+)-(\S+).bam
   group_alias: \1
group_alias

(optional)

String used to build an alias for a group. See group_regex.

ignore

(optional)

Ignore a particular tool or metric. This directives accepts a list of patterns:

ignore:
  - gatk_haplotype_caller_WES_NA12891_remapped_dedup
plicate_alias

(optional)

When replicates are set, provide a format for a replicate id. The default is to output the replicate number after the tool name, for example: tool_1_input.dir. If set to replicate_alias: replicate=\1, the directory will be called tool_replicate=1_input.dir.

tool-task

A tool task paramaterizes a tool further. For example:

setup:
  tools:
    - weCall

weCall:
  options: =regions=1

will run weCall only on chromosome 1.

The benchmark system allows the user the specify multiple alternative configurations of a tool. Thus, if given a list of configurations, all of these will be run alongside each other. For example, the following will run the tool weCall twice, once on chromosome 1 and once on chromosome 2:

weCall:
  options:
     - =regions=1
     - =regions=2

The system requires unique names for each task. By default, these will be created through hashing the options. To define names explicitely to facilitate further analysis, use the alias= directive. Instead of setting them explicitely, aliases can be derived automatically from option names using the regex and alias directives.

Option strings can be created programmatically with the generate= directive. The full list of directives is below:

prefix=

(optional) Shared list of values for a particular option.

Default options that are common to all tasks can be specified with the prefix= directive:

weCall:
  options:
     - prefix==jobThreads=10
     - =regions=1
     - =regions=2
alias=

(optional) Set an explicit alias for an option:

weCall:
  options:
     - prefix==jobThreads=10
     - alias=chr1; =regions=1
     - alias=chr2; =regions=2
regex

(optional) Regular expression to derive a name using the options submitted to the task:

weCall:
  options:
     - prefix==jobThreads=10
     - =regions=1
     - =regions=2
  regex: =regions=(\S+)
  alias: chr\1
alias
(optional) String used to build a name from the parts extracted by a regular expression (see regex).
generate=

(optional) Generator expression to create a list of options automatically:

weCall:
  options:
    - prefix==jobThreads=10
    - generate=["alias=chr{}; {}".format(x, x) for x in [1, 2]]
ignore

(optional) Ignore a particular tool or metric. This directives accepts a list of patterns. Any task matching that contains any of the strings in the list will be ignored:

ignore:
  - gatk_haplotype_caller_WES_NA12891_remapped_dedup
task_specific:

(optional) Apply task specific options to a particular command. This directive accepts a collection of patterns and appropriate parameters. For example, to apply additional filters to metrics compute on freebayes output, use:

task_specific:
  freebayes.*:
     filter_exclude: "FORMAT/GT == '.' || DP < 5 || QUAL < 20 || N_ALT >= 2"

metric-task

A tool task paramaterizes a metric further. This section is identical to tool-task.

collate-task

A collate task describes how output data should be grouped. See Collation for an example.

regex_in
(required) regular expression that determines how files should be grouped.
pattern_out

(required) output pattern. If all files should be merged, this will simply be the filename used by the preceeding tools, for example:

regex_in: (S+).dir/result.vcf.gz pattern_out: result.vcf.gz

runner
(required) name of the tool to be run for combining multiple files into one.

split-task

A split task describes how output data should be split before computing metrics. See Splitting for an example.

runner
(required) name of the tool to be run for combining multiple files into one.

database

Collection of directives (optional)

This section contains directives with database connection details.

url (optional)
Database URL. See here for a list of accepted formats. The system is currently tested with sqlite and postgres.
schema (optional)
Database schema to use for data tables. If not given or the database does not support schemas, the data tables will sit alongside the meta tables in the database.

cluster

Collection of directives (optional)

A collection of options to specify cluster parameters. Typically, parameters are set with either defaults hardcoded or in a user-specific configuration file. If there are experiment specific options, they can also be specified in the benchmark.yml file.

queue
(optional) The cluster queue.
priority
(optional) The job priority. This should be a negative number.
num_jobs
(optional) Number of jobs to submit in parallel to the queueing system.
memory_resource
(optional) Name of the memory resource
memory_default
(optional) Default amount of memory to allocate
parallel_environment
(optional) Name of the parallel environment to use for multi-threaded applications.
options
(optional) Generic options to use for job submissions.