Sidewalk Labs: Civic Modeling

Sidewalk Labs: Civic Modeling

Who am I?

What problem am I trying to solve?

My team @ Sidewalk (“Model Lab”) uses machine learning models to study humantravel behavior in cities and produce high-fidelity simulations of the travelpatterns/volumes in a metro area. Our process has three main steps:

Construct a “synthetic population” from census data and other sources ofdemographic information; this population is statistically representative ofthe true population but contains no actual identifiable individuals.
Train machine learning models on anonymous mobile location data tounderstand behavioral patterns in the region (what times do people go tolunch, what factors affect an individual’s likelihood to use publictransit, etc.).
For each person in the synthetic population, generate predictions fromthese models and combine the resulting into activities into a singlemodel of all the activity in a region.

For more information see our blogpost Introducing Replica: A Next-Generation Urban Planning Tool.

How Dask Helps

Generating activities for millions of synthetic individuals is extremelycomputationally intensive; even with for example, a 96 core instance,simulating a single day in a large region initially took days. It was importantto us to be able to run a new simulation from scratch overnight, and scaling tohundreds/thousands of cores across many workers with Dask let us accomplish ourgoal.

Why we chose Dask originally, and how these reasons have changed over time

Our code consists of a mixture of legacy research-quality code and newerproduction-quality code (mostly Python). Before I started we were using GoogleCloud Dataflow (Python 2 only, massively scalable but generally an astronomicalpain to work with / debug) and multiprocessing (something like ~96 cores max).

Dask let us scale beyond a single machine with only minimal changes to our datapipeline. If we had been starting from scratch I think it’s likely we wouldhave gone in a different direction (something like C++ or Go microservices,especially since we have strong Google ties), but from my perspective as ahybrid infrastructure engineer/data scientist, having all of our models inPython makes it easy to experiment and debug statistical issues.

Some of the pain points of using Dask for our problem

There is lots of special dask knowledge that only I possess, for example:

In which formats we can serialize data that will allow for it to bereloaded efficiently? Sometimes we can use parquet, other times it shouldbe CSVs so we can easily chunk them dynamically at runtime
Sometimes we load data on the client and scatter to the workers, and othertimes we load chunks directly on the workers
The debugging process is sufficiently more complicated compared to localcode that it’s harder for other people to help resolve issues that occuron workers
The scheduler has been the source of most of our scaling issues: whenthe number of tasks/chunks of data gets too large, the scheduler tendsto fall over silently in some way.

Some of these failures might be to Kubernetes (if we run out of RAM, wedon’t see an OOM error; the pod just disappears and the job will restart).We had to do some hand-tuning of things like timeouts to make things morestable, and there was quite a bit of trial and error to get to a relativelyreliable state

This has more to do with our deploy process but we would sometimesend up in situations where the scheduler and worker were runningdifferent dask/distributed versions and things will crash when tasksare submitted but not when the connection is made, which makes ittake a while to diagnose (plus the error tends to be somethinginscrutable like KeyError: … that others besides me would have noidea how to interpret)

Some of the technology that we use around Dask

Google Kubernetes Engine: lots of worker instances (usually 16 cores each), 1scheduler, 1 job runner client (plus some other microservices)
Make + Helm
For debugging/monitoring I usually kubectl port-forward to 8786 and 8787and watch the dashboard/submit tasks manually. The dashboard is not veryreliable over port-forward when there are lots of workers (for some reasonthe websocket connection dies repeatedly) but just reconnecting to the podand refreshing always does the trick