Arvados integration

This section describes the various ways that the benchmark system can work with Arvados Keep and Crunch.

Note

This functionality is experimental.

Note

The commands below assume that the arvados CLI environment and python SDK have been installed and a working environment has been set up. Also, you will need git access

The benefit of the benchmark system is that commands that have been added to the Task Library can be executed on data from Keep or on the local file system and while at the same time allowing execution on a local machine or on a cluster. The system takes care of the mounting/unmounting of data and the job submission and execution.

Running jobs locally

To recap, this is how you execute a task with run-task:

benchmark run-task \
  --task=metric_filestat \
  --input-file=/etc/passwd \
  --output-file=test.out \

This will execute a metric task called filestat. On a cluster, it will submit the task to the cluster, see clustersetup on how to set it up.

Running a task with data from Keep

To run a single task using data in Keep, prefix the location in the library with the arv= prefix. The following command will run the metric filestat on the file hs37d5.fa and output the results in test.out:

benchmark run-task \
  --task=metric_filestat \
  --input-file=arv=by_id/cce431bdd40f4428666e45bd59c7d41b+6278/hs37d5.fa \
  --output-file=test.out \

The same could be achieved by manually mounting Arvados Keep and using the file system path in its mounted location:

mkdir -p /tmp/keep
arv-mount /tmp/keep
benchmark run-task \
    --task=metric_filestat \
    --input-file=arv=by_id/cce431bdd40f4428666e45bd59c7d41b+6278/hs37d5.fa \
    --output-file=test.out \
fusermount -u /tmp/keep

In fact, this is exactly what happens behind the scenes.

run-task is cluster enabled. If a cluster is detected, the job will be submitted to the appropriate queue. See the command line options to run-task on how to control cluster submission.

The command above does not upload the results back to arvados keep. This happens in the next step.

Running a task as an arvados job

Running a task through arv-crunch-job will upload the results to Keep while at the same time linking the output to the input and the command executed. This happens when specifying the arvados engine option:

benchmark run-task \
   --engine=arvados \
   --task=metric_filestat \
   --input-file=arv=by_id/cce431bdd40f4428666e45bd59c7d41b+6278/hs37d5.fa \
   --output-file=stat.tsv

This will create a job in the arvados system, but run it locally. As before, if the presence of a cluster is detected, the job will be submitted. The UUID of the output and log files generated will be echoed at the end of the script. Through the job, input and output are linked.

_images/arvados_input_output.png

Note

Note that we are referring to the data set by UUID. This is necessary for arvados run-job to identify files within Keep so that it can maintain the link between input and output.

The input files need not come from Keep. This method can also be used to upload data to Keep that has been generated from a local file:

benchmark run-task \
   --engine=arvados \
   --task=metric_filestat \
   --input-file=/etc/passwd \
   --output-file=stat.tsv

The provenance graph will link the uploaded file to the command that created it:

_images/arvados_only_output.png

At the moment, the execution uses a locally installed benchmark. As a consequence, the version of the code is not known to Keep. In the future it will be possible to specify a docker image hosted within Keep. This should then permit full provenance tracking that includes the version of code and tools.

Running a workflow with files from Arvados Keep

It is possible to run workflows on the local system using the benchmark workflow system but using files within Keep. Files that reside within Keep are identified by the arv= prefix. Save the following in a file called benchmark.yml:

title : >-
  Simple Variant-calling + QC workflow

description: >-
  A simple workflow calling variants in a set of BAM files
  and running bcftools stats for QC.

tags:
  - SNV calling

setup:

  tools:
    - weCall

  metrics:
    - bcftools_stats

database:
  url: sqlite:///./csvdb

input:
  reference_fasta: arv=by_id/303fdf14a0443728c5b2e3e0ab25a155+251771/Sequence/WholeGenomeFasta/genome.fa

  bam: arv=by_id/17322baa637c0b5152c8eef24a6c2b52+2559856/*.bam

  regex: ([^/ ,]+).bam

weCall:

  options: --numberOfJobs 8 --allowMNPCalls 0 --recalibrateBaseQs 1 --regions=chr1:30000000-31000000

bcftools_stats:

  options: --fasta-ref arv=by_id/303fdf14a0443728c5b2e3e0ab25a155+251771/Sequence/WholeGenomeFasta/genome.fa --apply-filters
  "PASS,."

To run the workflow, type:

benchmark run -v 5 -p 10 make all

This will execute the workflow on the local cluster using data sets stored within Keep.

Note that it is possible to mix both files located on the file system and files residing in Keep. As before, specifying --engine=arvados will run the workflow as a crunch script and upload the data to Keep:

benchmark run --engine=arvados -v 5 -p 10 make all

The dependencies between input and output are stored in Keep:

_images/arvados_workflow.png

To upload metrics to a database, use the target upload instead of all:

benchmark run --engine=arvados -v 5 -p 10 make upload

Note

When using an sqlite database, make sure to use an absolute path name as the crunch script will run the benchmark workflow within a temporary local directory.

The UUID of user, job and output will be added to the database into the table arvados_job.

Often a workflow is run and updated iteratively and only the final output should be uploaded to Keep. To do this, it is possible to do:

benchmark run -v 5 -p 10 make all
...
benchmark run --engine=arvados -v 5 -p 10 make keep

The first statemement will run the workflow locally. Once you think all is ok, the second statement will upload the data into Keep without re-running any computations - unless there are files that are out-of-date.

To upload to Keep while simultaneously uploading the metric data to the database, use:

benchmark run -v 5 -p 10 make all
...
benchmark run --engine=arvados -v 5 -p 10 make keep-and-load

This will make sure that the UUID between Keep and the database are synchronized.

Running jobs through Crunch

Not implemented yet as crunch not fully installed in London. Should be straight-forward using the CLI interface to daisy.

Notes

arv-crunch-job

Benchmark tasks can be run using Arvados’ run-command utility. Create the following script and call it run.job:

{
 "script": "run-command",
 "script_version": "master",
 "repository": "arvados",
 "script_parameters": {
   "command": [
       "benchmark",
       "run-task",
       "-v 5",
       "--task=metric_filestat",
       "--input-file=arv=by_id/cce431bdd40f4428666e45bd59c7d41b+6278/hs37d5.fa",
       "--output-file=stat.tsv"
   ]
 }
}

Then run this script using the following command:

arv-crunch-job --job="$(cat run.job)"

This happens under the hood when --engine=arvados is set in benchmark run-task.

Todo

  • Upload benchmark.yml file when running a workflow.
  • Implement use of docker image for code and config files from benchmark library.
  • Link to collections in Keep within postgres.
    • add output_uuid, requires setting the field from CollectionWriter within run-command or querying arvados system after crunch job has finished with job_uuid.
    • add collection ids to input files?
  • Add meta data/description to new collection created by benchmark jobs: When it was run, by whom, take text from daisy.yml file?
  • Run workflows inside Arvados crunch.
uuid
Unique identifier for an object in Keep
Arvados
The `Arvados Project`_ is dedicated to building a new generation of open source distributed computing software for bioinformatics, data science, and production analysis using massive data sets
Keep
Arvados Keep is the data storage of arvados.
Crunch
Arvados Crunch is the execution framework of arvados.