Tag reference¶
The Tag reference lists all the sections and directives that may appear in a benchmark configuration file.
title¶
String (required)
The title section contains the title. For example:
title : >-
Evaluate variant caller performance
description¶
String (required)
The description section contains a description of the experiment. For example:
description: >-
This experiment calls variants on the Platinum data set using a
variety of variant callers. Performance is evaluated through
comparison to the GiaB Standard of Truth.
tags¶
List of strings (required)
The tags section contains a list of strings that can be used to categorize the experiment. For example:
tags:
- Variant calling
- Genome In a Bottle
- Platinum data
setup¶
Collection of directives (required)
- tools
(required) A list of tools that should be executed on the input data, for example:
tools: - weCall - platypus
Each tool can be parameterized in a separate tool-task section.
- metrics
(required) A list of metrics that should be executed on the output of the tools, for example:
metrics: - bcftools_stats - rtg_vcfeval
Each tool can be parameterized in a separate metric-task section.
- collate
- (optional) A list of collation tasks that should be executed on the tool output. See Collation for an example and collate-task.
- split
- (optional) A list of split tasks that should be executed on the tool output. See Splitting for an example.
- ignore
- (optional) A list of input/tool/metric names that should be ignored.
- only_collate
(optional)
Flag. Do not compute any metrics on the output of tool tasks, but only on the output after collation.
- collate_metrics
(optional)
A list of metrics that will only be applied to the output of collation tasks.
- split_metrics
(optional)
A list of metrics that will only be applied to the output of split tasks.
- external
(optional)
Add external data to the daisy. Metrics are computed on the external data alongside the output by the tools run by the benchmark system. This section must contain an input section and an output directive:
setup: tools: - bwa_mem metrics: - samtools_stats external: input: file: "/other_data/2016_sample*.bam" regex: 2016_(sample.*).bam output: result.bam add_glob: .bai The functionaly of the external section is similar to the :ref:`run_tool_identity` and in fact uses the same implementation and accepts the same options. The difference is that the :term:`external` directive permits computing metrics on data created by the pipeline alongside external data, while the :ref:`run_tool_identity` assumes that only metrics will be applied.
- export
(optional)
A list of tasks which output should be exported. By default, the output of the tools, collate and the split tasks will be exported. To only export the output of the tools section, use:
setup: export: - tools
Additionally, there are directives for determining the name of output files when exporting tool output data. Currently defined is:
- prefix
- (optional) add a prefix to exported output files.
For more information, see Exporting.
lication (optional)
If set and larger than 1, each tool will be run multiple times with identical input and parameterization.
input¶
Collection of directives (required)
The input section contains the filenames for the input data. Files are labeled according to slots defined by a tool, for example:
input:
reference_fasta: hg19.fa
bam:
- individual1.bam
- individual2.bam
Filenames can contain keywords such as glob or find to
refer to a group of files. Each file encountered by a glob or
find expression will be added as an item to the list of files
to be processed and thus each file will instantiate a separate task.
Multiple glob statements can be separated by a comma (,
):
input:
reference_fasta: hg19.fa
bam: individual*.bam, sample*.bam
In addition, the input section may contain additional directives.
- regex
(optional)
A string containg a regular expression to extract a name from a filename. The regular expression should contain at least one
()
-group. For example:input: reference_fasta: hg19.fa bam: - family1_individual1.bam - family2_individual2.bam regex: (\S+)_(\S+).bam alias: \2
will set the aliases
indivial1
andindividual2
ignoring the family.- alias
(optional)
- groupby
(optional)
Either
option
orlabel
. This option determines how input files should be grouped. The default isoption
, so that files will be grouped across labels. For example:input: reference_fasta: hg19.fa bam: pair1: - individual1.bam - individual2.bam pair2: - individual1.bam - individual2.bam
will result in the following pairs:
{"reference_fasta": "hg19.fa", "pair1": {"bam": ("individual1.bam", "individual2.bam")}} {"reference_fasta": "hg19.fa", "pair2": {"bam": ("individual3.bam", "individual4.bam")}}
while
input: pair1: reference_fasta: hg19.fa bam: - individual1.bam - individual2.bam pair2: reference_fasta: hg38.fa bam: - individual3.bam - individual4.bam groupby: label
will result in:
{"pair1": {"reference_fasta": "hg19.fa", "bam": ("individual1.bam", "individual2.bam")}} {"pair2": {"reference_fasta": "hg38.fa", "bam": ("individual3.bam", "individual4.bam")}}
- group_regex
(optional)
A regular expression used to group input files. For example, if you are interested in calling variants inside families and the files are named
family-sample.bam
, use:input: bam: *.bam group_regex: (\S+)-(\S+).bam group_alias: \1
- group_alias
(optional)
String used to build an alias for a group. See group_regex.
- ignore
(optional)
Ignore a particular tool or metric. This directives accepts a list of patterns:
ignore: - gatk_haplotype_caller_WES_NA12891_remapped_dedup
- plicate_alias
(optional)
When replicates are set, provide a format for a replicate id. The default is to output the replicate number after the tool name, for example:
tool_1_input.dir
. If set toreplicate_alias: replicate=\1
, the directory will be calledtool_replicate=1_input.dir
.
tool-task¶
A tool task paramaterizes a tool further. For example:
setup:
tools:
- weCall
weCall:
options: =regions=1
will run weCall only on chromosome 1.
The benchmark system allows the user the specify multiple alternative
configurations of a tool. Thus, if given a list of configurations, all
of these will be run alongside each other. For example, the following
will run the tool weCall
twice, once on chromosome 1 and once on
chromosome 2:
weCall:
options:
- =regions=1
- =regions=2
The system requires unique names for each task. By default, these will be created through hashing the options. To define names explicitely to facilitate further analysis, use the alias= directive. Instead of setting them explicitely, aliases can be derived automatically from option names using the regex and alias directives.
Option strings can be created programmatically with the generate= directive. The full list of directives is below:
- prefix=
(optional) Shared list of values for a particular option.
Default options that are common to all tasks can be specified with the prefix= directive:
weCall: options: - prefix==jobThreads=10 - =regions=1 - =regions=2
- alias=
(optional) Set an explicit alias for an option:
weCall: options: - prefix==jobThreads=10 - alias=chr1; =regions=1 - alias=chr2; =regions=2
- regex
(optional) Regular expression to derive a name using the options submitted to the task:
weCall: options: - prefix==jobThreads=10 - =regions=1 - =regions=2 regex: =regions=(\S+) alias: chr\1
- alias
- (optional) String used to build a name from the parts extracted by a regular expression (see regex).
- generate=
(optional) Generator expression to create a list of options automatically:
weCall: options: - prefix==jobThreads=10 - generate=["alias=chr{}; {}".format(x, x) for x in [1, 2]]
- ignore
(optional) Ignore a particular tool or metric. This directives accepts a list of patterns. Any task matching that contains any of the strings in the list will be ignored:
ignore: - gatk_haplotype_caller_WES_NA12891_remapped_dedup
- task_specific:
(optional) Apply task specific options to a particular command. This directive accepts a collection of patterns and appropriate parameters. For example, to apply additional filters to metrics compute on freebayes output, use:
task_specific: freebayes.*: filter_exclude: "FORMAT/GT == '.' || DP < 5 || QUAL < 20 || N_ALT >= 2"
collate-task¶
A collate task describes how output data should be grouped. See Collation for an example.
- regex_in
- (required) regular expression that determines how files should be grouped.
- pattern_out
(required) output pattern. If all files should be merged, this will simply be the filename used by the preceeding tools, for example:
regex_in: (S+).dir/result.vcf.gz pattern_out: result.vcf.gz
- runner
- (required) name of the tool to be run for combining multiple files into one.
split-task¶
A split task describes how output data should be split before computing metrics. See Splitting for an example.
- runner
- (required) name of the tool to be run for combining multiple files into one.
database¶
Collection of directives (optional)
This section contains directives with database connection details.
- url (optional)
- Database URL. See here for a list of accepted formats. The system is currently tested with sqlite and postgres.
- schema (optional)
- Database schema to use for data tables. If not given or the database does not support schemas, the data tables will sit alongside the meta tables in the database.
cluster¶
Collection of directives (optional)
A collection of options to specify cluster parameters. Typically,
parameters are set with either defaults hardcoded or in a
user-specific configuration file. If there are experiment specific
options, they can also be specified in the benchmark.yml
file.
- queue
- (optional) The cluster queue.
- priority
- (optional) The job priority. This should be a negative number.
- num_jobs
- (optional) Number of jobs to submit in parallel to the queueing system.
- memory_resource
- (optional) Name of the memory resource
- memory_default
- (optional) Default amount of memory to allocate
- parallel_environment
- (optional) Name of the parallel environment to use for multi-threaded applications.
- options
- (optional) Generic options to use for job submissions.