Configuring a benchmark¶
Benchmark reads its configuration from configuration files in yaml format. The following configuration files are read:
- A default configuration file within the package which sets default values for a variety of options.
- A global configuration file in the user’s home directory called
~/.genomics_benchmark.yml
. This is a good place to put site-wide options for example to control which cluster queue to use. - A specific configuration file in the current directory called
benchmark.yml
which describes the workflow.
A configuration setting in a later file overrides an earlier one.
The benchmark configuration file¶
A benchmark configuration has the following mandatory sections:
title : >-
A title describing the benchmark set-up
description: >-
A verbose, high-level description of the benchmark - its objective
and rationale, the data sets, methods and metrics.
tags:
- A tag describing the benchmark
- Another tag describing the benchmark
- ...
setup:
# list of tools
tools:
- tool1
- tool2
# list of metrics
metrics:
- metric1
- metric2
# input data sets
input:
reference: ref.fa
bam:
- A.bam
- B.bam
# set options for tool1
tool1:
options: --regions=chr1
The labels in the input
section need to correspond to the slots
defined by the expected
attribute in the ToolRunner
that have been listed in the setup section. For example, variant
callers are derived from a class VariantCaller
that defines
the common data types for all callers:
class VariantCaller(ToolRunner):
expected = ["reference", "bam"]
Based on the benchmark configuration file, a workflow is created that will instantiate all possible combinations of tools, metrics and data sets and execute them. The rules for building combinations are as follows:
- Each tool is combined with each input data set.
- If an input data slot contains a list of values, each tool will be run on each data set.
- Each combination/tool will be run against each metric.
In the example above, the following will be executed:
tool1 X metric1 X ref.fa + A.bam
tool1 X metric1 X ref.fa + B.bam
tool1 X metric2 X ref.fa + A.bam
tool1 X metric2 X ref.fa + B.bam
tool2 X metric1 X ref.fa + A.bam
tool2 X metric1 X ref.fa + B.bam
tool2 X metric2 X ref.fa + A.bam
tool2 X metric2 X ref.fa + B.bam
It is possible to group input sets. Variant callers typically accept several .bam files for joint calling. To implement this, group bam files in an additional level:
bam:
- pedigree1
- A.bam
- B.bam
- pedigree2
- C.bam
- D.bam
This will result in the following combinations:
tool1 X metric1 X ref.fa + (A.bam + B.bam)
tool1 X metric1 X ref.fa + (C.bam + D.bam)
tool1 X metric2 X ref.fa + (A.bam + B.bam)
tool1 X metric2 X ref.fa + (C.bam + D.bam)
tool2 X metric1 X ref.fa + (A.bam + B.bam)
tool2 X metric1 X ref.fa + (C.bam + D.bam)
tool2 X metric2 X ref.fa + (A.bam + B.bam)
tool2 X metric2 X ref.fa + (C.bam + D.bam)
For this mechanism to work, the tool needs to be aware
that it might receive a single or multiple files. The method
resolve_argument()
helps here. In the example below, the
tool expects a , separated list of input files:
def run(self, outfile, params):
bam = resolve_argument(params.bam, sep=",")
retval = P.run("{params.path} "
"--inputs {bam} "
"> {outfile} ")
Tool/metric configuration¶
Tools and metrics can receive optional (or required) configuration arguments in their own sections. The configuration options are grouped into sections within the configuration file named according to the metric or tool:
tool1:
options: --region=chr1
metric1:
reference: ref.fa
This will provide the option --verbose
when running tool1 and the
parameter reference
to metric1. Note that the tool and metric
runner need to be aware of these options. See more about writing
tools and metrics in Task Library.
Multiple versions can be specified to provide an additional level of combinations. For example:
tool1:
options:
- --region=chr1
- --region=chr2
metric1:
reference:
- ref.fa
- other_ref.fa
will run tool1 with options --region1
and --region2
and
metric1 with two different reference data sets. Shared options can
be specified using the prefix
special command.
tool1:
options:
- prefix=--verbose
- --region=chr1
- --region=chr2
By default, tools and metrics are expected to reside in the user’s
PATH
variable. To run a particular version of a tool, use
the path configuration value:
weCall:
path: /path/to/weCall/bin/weCall
Note that this can also be multiplexed. To run several versions of a tool in a benchmark, type:
weCall:
path:
- /path/to/weCall-old/bin/weCall
- /path/to/weCall-new/bin/weCall
Note that this assumes that the executables are entirely self-contained and automatically pick up references relative to their location.
Automatic file expansion¶
To help with the combinatorics, the benchmark file is aware of glob and find expressions. For example:
input:
file: find /data/library -name "*.bam"
Will execute the unix find
command and enter all files that
have been found into the daisy.
Filenames containing a * are interpreted as glob expressions:
input:
file: /data/library/1000Genomes/LowCovChr20BAMs/CEU_chr20/NA127*.bam
Collation¶
Occasionally, tools need to be run individually, but metrics are computed on an aggregation of the tool output. For example, you might want to call variants across a population, but then compute allele frequencies on the aggregate VCF. For such a workflow, define a collate task:
setup:
tools:
- weCall
collate:
- mergegvcf_agg
metrics:
- bcftools_stats
input:
reference_fasta: /data/library/reference/hs37d5/hs37d5.fa
bam: sample*.bam
regex: (\S+).bam
mergegvcf_agg:
regex_in: (\S+).dir/result.vcf.gz
pattern_out: result.vcf.gz
runner: illumina_agg
illumina_agg:
reference_fasta: /data/library/reference/hs37d5/hs37d5.fa
The workflow above will run weCall on all bam files matching the glob
expression. The output will then be submitted to a collate task called
mergegvcf_agg
. The task describes how input files should be
grouped (regex_in
and pattern_out
) and which tool should be
used for merging (runner
). The tool (illumina_agg
) is then
configured in a separate section.
Splitting¶
The output of tools may be split in order to compute metrics on parts of the output separately. For example, the following will split the output by chromosome and then apply all metrics on both the original output and all the split files:
setup:
...
split:
- split_by_chrom
split_by_chrom:
runner: vcf_by_chromosome
Exporting¶
The benchmark system can export tool data for further use. To export, simply type:
benchmark run make export
This will move all output files into a directory called
export.dir
and place symbolic links into the pipeline
directives to preserve workflow state.
Files in the directory export.dir
will be renamed to label
them according to the experiment. For example,
weCall_NA12878.dir/result.vcf.gz
will become
export.dir/weCall_NA12878.vcf.gz
.
The export target is a convenience function to collect all the tool data computed in an experiment if the tool data is of further interest, for example for additional processing in other benchmarks.
Global configuration¶
Below is a configuration values for interfacing Benchmark with the system.
Cluster¶
Options to configure behaviour for running jobs on the cluster are
in the section cluster
. The default values are:
cluster:
queue: main.q
priority: -1
num_jobs: 100
memory_resource: h_vmem
memory_default: 4G
options: ""
parallel_environment: smp
Note that some cluster options can be overridden at the command
line. For example, --cluster-queue=slow.q
will send jobs to
slow.q
. The options --local
will run jobs without the
queuing system.
Database¶
Database access is implemented through setting a database URL. The default is:
database:
url: postgres://localhost:5432/Benchmark
With postgresql it is possible to use schema to organise metric tables. To use a schema, use:
database:
url: postgresql://andreas@trafalgar.camdc.genomicsplc.com/benchmark
schema: cnv