.. _configuration: ======================= Configuring a benchmark ======================= Benchmark reads its configuration from configuration files in :term:`yaml` format. The following configuration files are read: 1. A default configuration file within the package which sets default values for a variety of options. 2. A global configuration file in the user's home directory called :file:`~/.genomics_benchmark.yml`. This is a good place to put site-wide options for example to control which cluster queue to use. 3. A specific configuration file in the current directory called :file:`benchmark.yml` which describes the workflow. A configuration setting in a later file overrides an earlier one. The benchmark configuration file ================================ A benchmark configuration has the following mandatory sections: .. code-block:: yaml title : >- A title describing the benchmark set-up description: >- A verbose, high-level description of the benchmark - its objective and rationale, the data sets, methods and metrics. tags: - A tag describing the benchmark - Another tag describing the benchmark - ... setup: # list of tools tools: - tool1 - tool2 # list of metrics metrics: - metric1 - metric2 # input data sets input: reference: ref.fa bam: - A.bam - B.bam # set options for tool1 tool1: options: --regions=chr1 The labels in the ``input`` section need to correspond to the slots defined by the :attr:`expected` attribute in the :class:`.ToolRunner` that have been listed in the setup section. For example, variant callers are derived from a class :class:`.VariantCaller` that defines the common data types for all callers:: class VariantCaller(ToolRunner): expected = ["reference", "bam"] Based on the benchmark configuration file, a workflow is created that will instantiate all possible combinations of tools, metrics and data sets and execute them. The rules for building combinations are as follows: 1. Each tool is combined with each input data set. 2. If an input data slot contains a list of values, each tool will be run on each data set. 3. Each combination/tool will be run against each metric. In the example above, the following will be executed:: tool1 X metric1 X ref.fa + A.bam tool1 X metric1 X ref.fa + B.bam tool1 X metric2 X ref.fa + A.bam tool1 X metric2 X ref.fa + B.bam tool2 X metric1 X ref.fa + A.bam tool2 X metric1 X ref.fa + B.bam tool2 X metric2 X ref.fa + A.bam tool2 X metric2 X ref.fa + B.bam It is possible to group input sets. Variant callers typically accept several .bam files for joint calling. To implement this, group bam files in an additional level: .. code-block:: yaml bam: - pedigree1 - A.bam - B.bam - pedigree2 - C.bam - D.bam This will result in the following combinations:: tool1 X metric1 X ref.fa + (A.bam + B.bam) tool1 X metric1 X ref.fa + (C.bam + D.bam) tool1 X metric2 X ref.fa + (A.bam + B.bam) tool1 X metric2 X ref.fa + (C.bam + D.bam) tool2 X metric1 X ref.fa + (A.bam + B.bam) tool2 X metric1 X ref.fa + (C.bam + D.bam) tool2 X metric2 X ref.fa + (A.bam + B.bam) tool2 X metric2 X ref.fa + (C.bam + D.bam) For this mechanism to work, the :term:`tool` needs to be aware that it might receive a single or multiple files. The method :func:`.resolve_argument` helps here. In the example below, the tool expects a `,` separated list of input files:: def run(self, outfile, params): bam = resolve_argument(params.bam, sep=",") retval = P.run("{params.path} " "--inputs {bam} " "> {outfile} ") Tool/metric configuration ========================= Tools and metrics can receive optional (or required) configuration arguments in their own sections. The configuration options are grouped into sections within the configuration file named according to the metric or tool: .. code-block:: yaml tool1: options: --region=chr1 metric1: reference: ref.fa This will provide the option ``--verbose`` when running `tool1` and the parameter ``reference`` to `metric1`. Note that the tool and metric runner need to be aware of these options. See more about writing tools and metrics in :ref:`tasklibrary`. Multiple versions can be specified to provide an additional level of combinations. For example: .. code-block:: yaml tool1: options: - --region=chr1 - --region=chr2 metric1: reference: - ref.fa - other_ref.fa will run `tool1` with options ``--region1`` and ``--region2`` and `metric1` with two different reference data sets. Shared options can be specified using the ``prefix`` special command. .. code-block:: yaml tool1: options: - prefix=--verbose - --region=chr1 - --region=chr2 By default, tools and metrics are expected to reside in the user's :envvar:`PATH` variable. To run a particular version of a tool, use the `path` configuration value: .. code-block:: yaml weCall: path: /path/to/weCall/bin/weCall Note that this can also be multiplexed. To run several versions of a tool in a benchmark, type: .. code-block:: yaml weCall: path: - /path/to/weCall-old/bin/weCall - /path/to/weCall-new/bin/weCall Note that this assumes that the executables are entirely self-contained and automatically pick up references relative to their location. Automatic file expansion ======================== To help with the combinatorics, the benchmark file is aware of glob and find expressions. For example: .. code-block:: yaml input: file: find /data/library -name "*.bam" Will execute the unix ``find`` command and enter all files that have been found into the daisy. Filenames containing a `*` are interpreted as glob expressions: .. code-block:: yaml input: file: /data/library/1000Genomes/LowCovChr20BAMs/CEU_chr20/NA127*.bam .. _collation: Collation ========= Occasionally, tools need to be run individually, but metrics are computed on an aggregation of the tool output. For example, you might want to call variants across a population, but then compute allele frequencies on the aggregate VCF. For such a workflow, define a :ref:`collate` task:: setup: tools: - weCall collate: - mergegvcf_agg metrics: - bcftools_stats input: reference_fasta: /data/library/reference/hs37d5/hs37d5.fa bam: sample*.bam regex: (\S+).bam mergegvcf_agg: regex_in: (\S+).dir/result.vcf.gz pattern_out: result.vcf.gz runner: illumina_agg illumina_agg: reference_fasta: /data/library/reference/hs37d5/hs37d5.fa The workflow above will run weCall on all bam files matching the glob expression. The output will then be submitted to a collate task called ``mergegvcf_agg``. The task describes how input files should be grouped (``regex_in`` and ``pattern_out``) and which tool should be used for merging (``runner``). The tool (``illumina_agg``) is then configured in a separate section. .. _splitting: Splitting ========= The output of tools may be split in order to compute metrics on parts of the output separately. For example, the following will split the output by chromosome and then apply all metrics on both the original output and all the split files:: setup: ... split: - split_by_chrom split_by_chrom: runner: vcf_by_chromosome .. _exporting: Exporting ========= The benchmark system can export tool data for further use. To export, simply type:: benchmark run make export This will move all output files into a directory called :file:`export.dir` and place symbolic links into the pipeline directives to preserve workflow state. Files in the directory :file:`export.dir` will be renamed to label them according to the experiment. For example, :file:`weCall_NA12878.dir/result.vcf.gz` will become :file:`export.dir/weCall_NA12878.vcf.gz`. The :term:`export` target is a convenience function to collect all the tool data computed in an experiment if the tool data is of further interest, for example for additional processing in other benchmarks. Global configuration ==================== Below is a configuration values for interfacing Benchmark with the system. Cluster ------- Options to configure behaviour for running jobs on the cluster are in the section ``cluster``. The default values are:: cluster: queue: main.q priority: -1 num_jobs: 100 memory_resource: h_vmem memory_default: 4G options: "" parallel_environment: smp Note that some cluster options can be overridden at the command line. For example, ``--cluster-queue=slow.q`` will send jobs to ``slow.q``. The options ``--local`` will run jobs without the queuing system. Database -------- Database access is implemented through setting a database URL. The default is: .. code-block:: yaml database: url: postgres://localhost:5432/Benchmark With postgresql_ it is possible to use schema to organise metric tables. To use a schema, use: .. code-block:: yaml database: url: postgresql://andreas@trafalgar.camdc.genomicsplc.com/benchmark schema: cnv