Genome Sequencing for Mosquitos

Who am I?

I’m Alistair Miles and I work for OxfordUniversity Big Data Institute but am alsoaffiliated with the Wellcome Sanger Institute. Ilead the malaria vector (mosquito) genomics programme within the malariagenomic epidemiology network, an international network ofresearchers and malaria control professionals developing new technologies basedon genome sequencing to aid in the effort towards malaria elimination. I alsohave a technical role as Head of Epidemiological Informatics for the Centrefor Genomics and Global Health, which means I have someoversight and responsibility for computing and software architecture anddirection within our teams at Oxford and Sanger.

What problem am I trying to solve?

Malaria is still a major cause of mortality, particularly in sub-SaharanAfrica. Research has shown that the best way to reduce malaria is to controlthe mosquitoes that transmit malaria between people. Unfortunately mosquitopopulations are becoming resistant to the insecticides used to control them.New mosquito control tools are needed. New systems for mosquito populationsurveillance/monitoring are also needed to help inform and adapt controlstrategies to respond to mosquito evolution. We have established a project toperform an initial survey of mosquito genetic diversity, by sequencing wholegenomes of approximately 3,000 mosquitoes collected from field sites across 18African countries, The Anopheles gambiae 1000 Genomes Project. We are currently working to scale up oursequencing operations to be able to sequence ~10,000 mosquitoes per year, andto integrate genome sequencing into regular mosquito monitoring programmesacross Africa and Southeast Asia.

How does Dask help?

Whole genome sequence data is a relatively large scale data resource, requiringspecialised processing and analysis to extract key information, e.g.,identifying genes involved in the evolution of insecticide resistance. We useconventional bioinformatic approaches for the initial phases of data processing(alignment, variant calling, phasing), however beyond that point we switch tointeractive and exploratory analysis using Jupyter notebooks.

Making interactive analysis of large-scale data is obviously a challenge,because inefficient code and/or use of computational resources vastly increasesthe time taken for any computation, destroying the ability of an analyst toexplore many different possibilities within a dataset. Dask helps by providingan easy-to-use framework for parallelising computations, either across multiplecores on a single workstation, or across multiple nodes in a cluster. We havebuilt a software package calledscikit-allel to help with ourgenetic analyses, and use Dask within that package to parallelise a number ofcommonly used computations.

Why did I choose Dask?

Normally the transition from a serial (i.e., single-core) implementation of anygiven computation to a parallel (multi-core) implementation requires the codeto be completely rewritten, because parallel frameworks usually offer acompletely different API, and managing complex parallel workflows is asignificant challenge.

Originally Dask was appealing because it provided a familiar API,with the dask.array package following the numpy API (which we were alreadyusing) relatively closely. Dask also handled all the complexity of constructingand running complex, multi-step computational workflows.

Today, we’re also interested in Dask’s offered flexibility to initiallyparallelise over multiple cores in a single computer via multi-threading, andthen switch to running on a multi-node cluster with relatively little change inour code. Thus computations can be scaled up or down with great convenience.When we first started using Dask we were focused on making effective use ofmultiple threads for working on a single computer, now as data is growing weare moving data and computation into a cloud setting and looking to make use ofDask via Kubernetes.

Pain points?

Initially when we started using Dask in 2015 we hit a few bugs and some of theerror messages generated by Dask were very cryptic, so debugging some problemswas hard. However the stability of the code base, the user documentation, andthe error messages have improved a lot recently, and the sustained investmentin Dask is clearly adding a lot of value for users.

It is still difficult to think about how to code up parallel operations overmultidimensional arrays where one or more dimensions are dropped by thefunction being mapped over the data, but there is some inherent complexitythere so probably not much Dask can do to help.

The Dask code base itself is tidy and consistent but quite hard to get into tounderstand and debug issues. Again Dask is handling a lot of inherentcomplexity so maybe not much can be done.

Technology I use around around Dask

We are currently working on deploying both JupyterHub and Dask on top ofKubernetes in the cloud, following the approach taken in the Pangeoproject. We use Dask primarily through thescikit-allel package. We also use Dask primarily with theZarr array storage library (in factthe original motivation for writing Zarr was to provide a storage library thatenabled Dask to efficiently parallelise I/O bound computations).

Anything else to know?

Our analysis code is still quite heterogeneous, with some code making use of abespoke approach to out-of-core computing which we developed prior to beingaware of Dask, and the remainder using Dask. This is just a legacy of timing,with some work having started prior to knowing about Dask. With the stabilityand maturity of Dask now I am very happy to push towards full adoption.

One cognitive shift that this requires is for users to get used to lazy(deferred) computation. This can be a stumbling block to start with, but isworth the effort of learning because it gives the user the ability to runlarger computations. So I have been thinking about writing a blog post tocommunicate the message that we are moving towards adopting Dask whereverpossible, and to give an introduction to the lazy coding style, with examplesfrom our domain (population genomics). There are also still quite a fewfunctions in scikit-allel that could be parallelised via Dask but haven’t yetbeen, so I still have an aspiration to work on that. Not sure when I’ll get tothese, but hopefully conveys the intention to adopt Dask more widely and alsohelp train people in our immediate community to use it.