DATA PARALLELISM TRAINING

In Common Distributed Parallel Strategies, we introduced the characteristics of data parallel.

OneFlow provides oneflow.nn.parallel.DistributedDataParallel module and launcher, which allows users to run data parallel training almost without modifying the script of a single node.

A quick start of OneFlow’s data parallelsim training:

  1. wget https://docs.oneflow.org/master/code/parallelism/ddp_train.py #Download
  2. python3 -m oneflow.distributed.launch --nproc_per_node 2 ./ddp_train.py #data parallel training

Out:

  1. 50/500 loss:0.004111831542104483
  2. 50/500 loss:0.00025336415274068713
  3. ...
  4. 500/500 loss:6.184563972055912e-11
  5. 500/500 loss:4.547473508864641e-12
  6. w:tensor([[2.0000],
  7. [3.0000]], device='cuda:1', dtype=oneflow.float32,
  8. grad_fn=<accumulate_grad>)
  9. w:tensor([[2.0000],
  10. [3.0000]], device='cuda:0', dtype=oneflow.float32,
  11. grad_fn=<accumulate_grad>)

Codes

Click “Code” below to expand the code of the above running script.

Code

  1. import oneflow as flow
  2. from oneflow.nn.parallel import DistributedDataParallel as ddp
  3. train_x = [
  4. flow.tensor([[1, 2], [2, 3]], dtype=flow.float32),
  5. flow.tensor([[4, 6], [3, 1]], dtype=flow.float32),
  6. ]
  7. train_y = [
  8. flow.tensor([[8], [13]], dtype=flow.float32),
  9. flow.tensor([[26], [9]], dtype=flow.float32),
  10. ]
  11. class Model(flow.nn.Module):
  12. def __init__(self):
  13. super().__init__()
  14. self.lr = 0.01
  15. self.iter_count = 500
  16. self.w = flow.nn.Parameter(flow.tensor([[0], [0]], dtype=flow.float32))
  17. def forward(self, x):
  18. x = flow.matmul(x, self.w)
  19. return x
  20. m = Model().to("cuda")
  21. m = ddp(m)
  22. loss = flow.nn.MSELoss(reduction="sum")
  23. optimizer = flow.optim.SGD(m.parameters(), m.lr)
  24. for i in range(0, m.iter_count):
  25. rank = flow.env.get_rank()
  26. x = train_x[rank].to("cuda")
  27. y = train_y[rank].to("cuda")
  28. y_pred = m(x)
  29. l = loss(y_pred, y)
  30. if (i + 1) % 50 == 0:
  31. print(f"{i+1}/{m.iter_count} loss:{l}")
  32. optimizer.zero_grad()
  33. l.backward()
  34. optimizer.step()
  35. print(f"\nw:{m.w}")

There are only two differences between the data parallelism training code and the stand-alone single-card script:

  • Use DistributedDataParallel to wrap the module object (m = ddp(m))
  • Use get_rank to get the current device number and distribute the data to the device.

Then use launcher to run the script, leave everything else to OneFlow, which makes distributed training as simple as stand-alone single-card training:

  1. python3 -m oneflow.distributed.launch --nproc_per_node 2 ./ddp_train.py

DistributedSampler

The data used is manually distributed in this context to highlight DistributedDataParallel. However, in practical applications, you can directly use DistributedSampler with data parallel.

DistributedSampler will instantiate the Dataloader in each process, and each Dataloader instance will load a part of the complete data to automatically complete the data distribution.

Please activate JavaScript for write a comment in LiveRe