Distributed and parallel training
- Parallel
  - DataParallel.reset [source]
class ParallelTrainer [source]
Distributed
- Helper functions
- DataLoader
class DistributedDL [source]
class DistributedTrainer [source]
- Learner.to_distributed [source]
- Learner.detach_distributed [source]
distrib_ctx context manager
- Learner.distrib_ctx [source]
- rank0_first [source]

Distributed and parallel training

Callbacks and helper functions to train in parallel or use distributed training

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0

When using multiple GPUs, you will most probably want to fit using distributed training. See examples/distrib.py for a complete example. To use distributed training, there are only two required steps:

Add with learn.distrib_ctx(): before your learn.fit call
Run your training script with python -m fastai.launch scriptname.py ...args...

After fastai.launch you can add --gpus 0,1 for instance, to use only using GPUs 1 and 2.

If you’re using untar_data, or may be downloading or uncompressing data or models as part of your script, you should wrap that code with rank0_first, which forces that step to occur first just once on the master process, prior to the remaining processes running it in parallel. E.g. instead of:

path = untar_data(URLs.IMAGEWOOF_320)

…you instead use:

path = rank0_first(untar_data, URLs.IMAGEWOOF_320)

See below for details on the full API and underlying helper functions, if needed — however, note that you will not need anything except the above unless you need to change how the distributed training is implemented.

Parallel

`DataParallel.reset`[source]

DataParallel.reset()

Patch required reset call into DataParallel

`class` `ParallelTrainer`[source]

ParallelTrainer(device_ids) :: Callback

Wrap a model DataParallel automatically

`Learner.to_parallel`[source]

Learner.to_parallel(device_ids=None)

Add ParallelTrainer callback to a Learner

`Learner.detach_parallel`[source]

Learner.detach_parallel()

Remove ParallelTrainer callback from a Learner

`Learner.parallel_ctx`[source]

Learner.parallel_ctx(device_ids=None)

A context manager to adapt a learner to train in data parallel mode.

Distributed

Helper functions

`DistributedDataParallel.reset`[source]

DistributedDataParallel.reset()

Patch required reset call into DistributedDataParallel

`setup_distrib`[source]

setup_distrib(gpu=None)

Setup this process to participate in distributed training

`teardown_distrib`[source]

teardown_distrib()

Free distributed training resources

DataLoader

`class` `DistributedDL`[source]

DistributedDL(dl, rank=None, world_size=None) :: TfmdDL

A TfmdDL which splits a batch into equal size pieces for each worker

dl = TfmdDL(list(range(50)), bs=12, num_workers=2)
for i in range(4):
    dl1 = DistributedDL(dl, i, 4)
    test_eq(list(dl1), (torch.arange(i*13, i*13+12)%50,torch.tensor([i*13+12])%50))

`class` `DistributedTrainer`[source]

DistributedTrainer(cuda_id=0, sync_bn=True) :: Callback

Wrap model in DistributedDataParallel and dls in DistributedDL

`Learner.to_distributed`[source]

Learner.to_distributed(cuda_id, sync_bn=True)

Add DistributedTrainer to a learner

`Learner.detach_distributed`[source]

Learner.detach_distributed()

Remove DistributedTrainer from a learner

`distrib_ctx` context manager

`Learner.distrib_ctx`[source]

Learner.distrib_ctx(cuda_id=None, sync_bn=True)

A context manager to adapt a learner to train in distributed data parallel mode.

distrib_ctx prepares a learner to train in distributed data parallel mode. It assumes these environment variables have all been setup properly, such as those launched by python -m fastai.launch.

Typical usage:

with learn.distrib_ctx(): learn.fit(.....)

It attaches a DistributedTrainer callback and DistributedDL data loader to the learner, then executes learn.fit(.....). Upon exiting the context, it removes the DistributedTrainer and DistributedDL, and destroys any locally created distributed process group. The process is still attached to the GPU though.

`rank0_first`[source]

rank0_first(func, *args, **kwargs)

Execute func in the Rank-0 process first, then in other ranks in parallel.

rank0_first calls f() in rank-0 process first, then in parallel on the rest, in distributed training mode. In single process, non-distributed training mode, f() is called only once as expected.

One application of rank0_first() is to make fresh downloads via untar_data safe in distributed training scripts launched by python -m fastai.launch <script>:

path = untar_data(URLs.IMDB)

becomes:

path = rank0_first(lambda: untar_data(URLs.IMDB))

Some learner factory methods may use untar_data to download pretrained models:

learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

becomes:

learn = rank0_first(lambda: text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy))

Otherwise, multiple processes will download at the same time and corrupt the data.

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021

Distributed

Distributed and parallel training

Parallel

DataParallel.reset[source]

class ParallelTrainer[source]

Learner.to_parallel[source]

Learner.detach_parallel[source]

Learner.parallel_ctx[source]

Distributed

Helper functions

DistributedDataParallel.reset[source]

setup_distrib[source]

teardown_distrib[source]

DataLoader

class DistributedDL[source]

class DistributedTrainer[source]

Learner.to_distributed[source]

Learner.detach_distributed[source]

distrib_ctx context manager

Learner.distrib_ctx[source]

rank0_first[source]

`DataParallel.reset`[source]

`class` `ParallelTrainer`[source]

`Learner.to_parallel`[source]

`Learner.detach_parallel`[source]

`Learner.parallel_ctx`[source]

`DistributedDataParallel.reset`[source]

`setup_distrib`[source]

`teardown_distrib`[source]

`class` `DistributedDL`[source]

`class` `DistributedTrainer`[source]

`Learner.to_distributed`[source]

`Learner.detach_distributed`[source]

`distrib_ctx` context manager

`Learner.distrib_ctx`[source]

`rank0_first`[source]