Configuring a benchmark

Benchmark reads its configuration from configuration files in yaml format. The following configuration files are read:

  1. A default configuration file within the package which sets default values for a variety of options.
  2. A global configuration file in the user’s home directory called ~/.genomics_benchmark.yml. This is a good place to put site-wide options for example to control which cluster queue to use.
  3. A specific configuration file in the current directory called benchmark.yml which describes the workflow.

A configuration setting in a later file overrides an earlier one.

The benchmark configuration file

A benchmark configuration has the following mandatory sections:

title : >-
  A title describing the benchmark set-up

description: >-
  A verbose, high-level description of the benchmark - its objective
  and rationale, the data sets, methods and metrics.

tags:
  - A tag describing the benchmark
  - Another tag describing the benchmark
  - ...

setup:
  # list of tools
  tools:
    - tool1
    - tool2

  # list of metrics
  metrics:
    - metric1
    - metric2

# input data sets
input:

  reference: ref.fa

  bam:
    - A.bam
    - B.bam

# set options for tool1
tool1:
   options: --regions=chr1

The labels in the input section need to correspond to the slots defined by the expected attribute in the ToolRunner that have been listed in the setup section. For example, variant callers are derived from a class VariantCaller that defines the common data types for all callers:

class VariantCaller(ToolRunner):
    expected = ["reference", "bam"]

Based on the benchmark configuration file, a workflow is created that will instantiate all possible combinations of tools, metrics and data sets and execute them. The rules for building combinations are as follows:

  1. Each tool is combined with each input data set.
  2. If an input data slot contains a list of values, each tool will be run on each data set.
  3. Each combination/tool will be run against each metric.

In the example above, the following will be executed:

tool1 X metric1 X ref.fa + A.bam
tool1 X metric1 X ref.fa + B.bam
tool1 X metric2 X ref.fa + A.bam
tool1 X metric2 X ref.fa + B.bam
tool2 X metric1 X ref.fa + A.bam
tool2 X metric1 X ref.fa + B.bam
tool2 X metric2 X ref.fa + A.bam
tool2 X metric2 X ref.fa + B.bam

It is possible to group input sets. Variant callers typically accept several .bam files for joint calling. To implement this, group bam files in an additional level:

bam:
  - pedigree1
    - A.bam
    - B.bam
  - pedigree2
    - C.bam
    - D.bam

This will result in the following combinations:

tool1 X metric1 X ref.fa + (A.bam + B.bam)
tool1 X metric1 X ref.fa + (C.bam + D.bam)
tool1 X metric2 X ref.fa + (A.bam + B.bam)
tool1 X metric2 X ref.fa + (C.bam + D.bam)
tool2 X metric1 X ref.fa + (A.bam + B.bam)
tool2 X metric1 X ref.fa + (C.bam + D.bam)
tool2 X metric2 X ref.fa + (A.bam + B.bam)
tool2 X metric2 X ref.fa + (C.bam + D.bam)

For this mechanism to work, the tool needs to be aware that it might receive a single or multiple files. The method resolve_argument() helps here. In the example below, the tool expects a , separated list of input files:

def run(self, outfile, params):
    bam = resolve_argument(params.bam, sep=",")
    retval = P.run("{params.path} "
                   "--inputs {bam} "
                   "> {outfile} ")

Tool/metric configuration

Tools and metrics can receive optional (or required) configuration arguments in their own sections. The configuration options are grouped into sections within the configuration file named according to the metric or tool:

tool1:
   options: --region=chr1

metric1:
   reference: ref.fa

This will provide the option --verbose when running tool1 and the parameter reference to metric1. Note that the tool and metric runner need to be aware of these options. See more about writing tools and metrics in Task Library.

Multiple versions can be specified to provide an additional level of combinations. For example:

tool1:
   options:
     - --region=chr1
     - --region=chr2

metric1:
   reference:
     - ref.fa
     - other_ref.fa

will run tool1 with options --region1 and --region2 and metric1 with two different reference data sets. Shared options can be specified using the prefix special command.

tool1:
   options:
     - prefix=--verbose
     - --region=chr1
     - --region=chr2

By default, tools and metrics are expected to reside in the user’s PATH variable. To run a particular version of a tool, use the path configuration value:

weCall:
   path: /path/to/weCall/bin/weCall

Note that this can also be multiplexed. To run several versions of a tool in a benchmark, type:

weCall:
   path:
      - /path/to/weCall-old/bin/weCall
      - /path/to/weCall-new/bin/weCall

Note that this assumes that the executables are entirely self-contained and automatically pick up references relative to their location.

Automatic file expansion

To help with the combinatorics, the benchmark file is aware of glob and find expressions. For example:

input:
   file: find /data/library -name "*.bam"

Will execute the unix find command and enter all files that have been found into the daisy.

Filenames containing a * are interpreted as glob expressions:

input:
   file: /data/library/1000Genomes/LowCovChr20BAMs/CEU_chr20/NA127*.bam

Collation

Occasionally, tools need to be run individually, but metrics are computed on an aggregation of the tool output. For example, you might want to call variants across a population, but then compute allele frequencies on the aggregate VCF. For such a workflow, define a collate task:

setup:

  tools:
    - weCall
  collate:
    - mergegvcf_agg
  metrics:
    - bcftools_stats

input:
  reference_fasta: /data/library/reference/hs37d5/hs37d5.fa
  bam: sample*.bam
  regex: (\S+).bam

mergegvcf_agg:
  regex_in: (\S+).dir/result.vcf.gz
  pattern_out: result.vcf.gz
  runner: illumina_agg

illumina_agg:
  reference_fasta: /data/library/reference/hs37d5/hs37d5.fa

The workflow above will run weCall on all bam files matching the glob expression. The output will then be submitted to a collate task called mergegvcf_agg. The task describes how input files should be grouped (regex_in and pattern_out) and which tool should be used for merging (runner). The tool (illumina_agg) is then configured in a separate section.

Splitting

The output of tools may be split in order to compute metrics on parts of the output separately. For example, the following will split the output by chromosome and then apply all metrics on both the original output and all the split files:

setup:
  ...

  split:
    - split_by_chrom

split_by_chrom:
  runner: vcf_by_chromosome

Exporting

The benchmark system can export tool data for further use. To export, simply type:

benchmark run make export

This will move all output files into a directory called export.dir and place symbolic links into the pipeline directives to preserve workflow state.

Files in the directory export.dir will be renamed to label them according to the experiment. For example, weCall_NA12878.dir/result.vcf.gz will become export.dir/weCall_NA12878.vcf.gz.

The export target is a convenience function to collect all the tool data computed in an experiment if the tool data is of further interest, for example for additional processing in other benchmarks.

Global configuration

Below is a configuration values for interfacing Benchmark with the system.

Cluster

Options to configure behaviour for running jobs on the cluster are in the section cluster. The default values are:

cluster:
   queue: main.q
   priority: -1
   num_jobs: 100
   memory_resource: h_vmem
   memory_default: 4G
   options: ""
   parallel_environment: smp

Note that some cluster options can be overridden at the command line. For example, --cluster-queue=slow.q will send jobs to slow.q. The options --local will run jobs without the queuing system.

Database

Database access is implemented through setting a database URL. The default is:

database:
   url: postgres://localhost:5432/Benchmark

With postgresql it is possible to use schema to organise metric tables. To use a schema, use:

database:
  url: postgresql://andreas@trafalgar.camdc.genomicsplc.com/benchmark
  schema: cnv