========================
Variant caller benchmark
========================

Inside daisy's :ref:`task library` are some common tools used for
variant calling. The following guide shows how to install daisy and
run a variant-caller benchmark on publicly available data. It starts
from scratch, though you will have to have installed a conda_
environment.

Installation
============

We start by creating a conda environment and installing daisy::

    conda create -y -n daisy python=3.6 matplotlib
    pip install cgatcore
    pip install cgat-daisy
    conda install ruamel_yaml pysam

.. note::

   We use a `pip install` until conda packages is available and are up-to-date.

Next, we install a few variant callers (freebayes_, octopus_ and
bcftools_) and variant metrics that we want to test::

    conda install -y -c bioconda freebayes octopus bcftools samtools bedtools

All of these have already been wrapped in daisy's :ref:`tasklibrary`
and are ready to be used. Next, we download the aligned exome
sequencing data of the NA12878. We will only use chromosome 20 to
avoid downloading the whole file (26Gb)::

    samtools view -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20120117_ceu_trio_b37_decoy/CEUTrio.HiSeq.WEx.b37_decoy.NA12878.clean.dedup.recal.20120117.bam 20 | samtools view -bS > NA12878.bam
    samtools index NA12878.bam

For variant calling, we will also need the reference genome sequence::

    wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
    gunzip hs37d5.fa

Running the benchmarks
======================

Create a benchmark.yml file::

    title : >-
      Benchmarking variant callers on NA12878 exome data

    description: >-
      This benchmark calls short variants on NA12878 exome data
      evaluates the results by comparison against a truth data set
      (Genomes in a Bottle, NA12878).

    tags:
      - SNV calling
      - NA12878

    database:
      # we will uload results to a local sqlite database
      url: sqlite:///./csvdb

    setup:
      suffix: vcf.gz

      tools:
	- bcftools
	- freebayes
	- octopus

      metrics:
	- bcftools_stats

    input:
      reference_fasta: hs37d5.fa

      bam:
	- NA12878.bam

    bcftools:
      options: --format-fields GQ,GP --multiallelic-caller

    bcftools_stats:
      options: --fasta-ref hs37d5.fa --apply-filters "PASS,."
      # apply a hard filter to freebayes output
      task_specific:
	freebayes.*:
	  filter_exclude: "FORMAT/GT == '.' || INFO/DP < 5 || QUAL < 20"

Now run the benchmark::

   daisy run -v 5 make all

You can now upload the results to the database::

   daisy run -v 5 make upload

This will organize the metric data into an sqlite database. To get the
number of variants called, you can query the database::

   sqlite3 -header -csv csvdb "select i.tool_name, m.key, m.value
   	   FROM bcftools_stats_summary_numbers AS m, instance AS i
	   WHERE i.id = m.instance_id AND key='number_of_SNPs' "

which will produce the following output:

.. csv-table::
   :header: tool_name,key,value
  
   bcftools,number_of_SNPs,14454
   freebayes,number_of_SNPs,2308
   octopus,number_of_SNPs,9386

Note that such tables can be easily obtained within pandas and used
for plotting. For example, the following small python snippet::

    import pandas
    import sqlalchemy
    import matplotlib.pyplot as plt

    database = sqlalchemy.create_engine("sqlite:///./csvdb")
    df = pandas.read_sql(
	"SELECT i.tool_name, m.key, m.value "
	"FROM bcftools_stats_summary_numbers AS m, instance AS i "
	"WHERE i.id = m.instance_id AND key='number_of_SNPs' ", database).set_index("tool_name")

    df.plot.bar()
    plt.tight_layout()
    plt.savefig("number_variants.png")

will create the following figure:

.. image:: number_variants.png

Adding another metric
=====================

For a proper variant caller comparison, we should compare against a
gold standard of variant calls for our data set. This is available from the
NIST/Genome in a bottle initiative::

   wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz
   wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz.tbi
   wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed

Because this is an exome data set, we restrict the high-confidence regions to captured regions::

    bedtools genomecov -ibam NA12878.bam -bg | awk '$4 >= 10 ' | bedtools merge -d 10 -i stdin | bgzip > high_coverage_regions.bed.gz
    bedtools intersect -a HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed -b high_coverage_regions.bed.gz | bedtools sort | bgzip > callable_regions.bed.gz
    tabix -p bed callable_regions.bed.gz

For the comparison, we will use the vcfeval tool from RealTimeGenomics::

   conda install -c bioconda rtg-tools

The toolkit requires its specially formatted reference sequence::

    rtg RTG_MEM=16G format -o hs37d5.sdf hs37d5.fa

Now we can amend our :file:`benchmark.yml` file by adding the rtg_vcfeval
metric to the :term:`metrics` section::

  metrics:
    - bcftools_stats
    - rtg_vcfeval

The RTG_ vcfeval tool requires a bit of configuration, so we add the following to :file:`benchmark.yml`::

    rtg_vcfeval:
      path: rtg RTG_MEM=16G
      map_unknown_genotypes_to_reference: 1
      reference_sdf: hs37d5.sdf
      reference_vcf: HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz
      callable_bed: callable_regions.bed.gz
      options: --sample=HG001,NA12878 --ref-overlap

We re-run our benchmark::

   daisy run -v 5 make all

Note that the variant callers are not re-run, but only additional
metrics are computed. Behind the scenes, daisy builds a ruffus_
workflow which means only tasks that are not up-to-date will be
executed. After uploading::

   daisy run -v 5 make all

We now have false positive rates and false negative rates in our table::

  s3 csvdb "select i.tool_name, m.* from rtg_vcfeval AS m, instance AS i where i.id = m.instance_id "

.. csv-table::
   :header: tool_name,threshold,true_positive_baseline,true_positive_count,false_positive_count,false_negative_count,false_discovery_rate,false_negative_rate,f_measure,instance_id
  
   bcftools,12.000,1542,1542,63,103,0.0393,0.0626,0.9489,4
   bcftools,None,1544,1544,67,101,0.0416,0.0614,0.9484,4
   freebayes,None,1607,1586,167,38,0.0953,0.0231,0.9394,5
   octopus,5.000,1518,1518,25,127,0.0162,0.0772,0.9523,6
   octopus,None,1518,1518,26,127,0.0168,0.0772,0.952,6

Next steps
==========

The following are some advanced features not covered in this tutorial:

* Running on multiple data sets within the same benchmark.

* Aggregation of data sets, for example for trio analysis.

* Running time-series analysis for monitoring tool performance.


.. _freebayes: https://github.com/ekg/freebayes
.. _bcftools: https://samtools.github.io/bcftools/bcftools.html
.. _octopus: https://github.com/luntergroup/octopus