CONSISTENT VIEW

The concept of consistent view in OneFlow is introduced to simplify distributed training. In short, the cluster is abstracted as a “Super Computing Device” under OneFlow consistent view.

Instead of caring about the details of computing and communication in a cluster, users can program like on a single node, and OneFlow can train the model in a distributed way.

consistent view

OneFlow’s consistent view relies on several important concepts: Placement, SBP and SBP Signature.

Placement

The Tensors of OneFlow has a placement attribute in consistent view; the placement specifies which physical device the Tensor is placed on.

OneFlow will automatically number the devices in the cluster. For example, if there are four hosts in a cluster and each host has eight cards, then the four hosts correspond to ID: 0,1,2,3. The cards on each host correspond to numbers 0 to 7. To place a Tensor on the first four cards on machine 0, simply configure: placement("cuda", {0: [0, 1, 2, 3]}).

Placement makes it easy for OneFlow to support pipelining parallelism, and we’ll see examples of placement in other articles on this topic.

SBP

SBP is a unique concept in OneFlow, which describes the mapping of data from a “Super Computing Device” perspective to data on real physical devices in a cluster. It is a combination of the initials of three words: split, broadcast, partial.

In detail:

  • split means that the physical Tensor is obtained by splitting the logical Tensor along a certain dimension. An axis parameter is used to indicate the dimension of the split. If multiple physical Tensors are concatenated along the dimension of Split, the logical Tensor can be restored.
  • broadcast indicates that each physical Tensor is exactly a copy of the logical Tensor.
  • partial indicates that although the physical Tensor has the same shape as the logical Tensor, the value in the physical Tensor is a part of the value in the corresponding position in the logical Tensor, if you add multiple physical Tensors at the same positions, you can restore the logical Tensor. Besides sum, min, max and some other opreations are made available for partial.

The figures below show some examples of SBP, including split(0), split(1), broadcast and partial sum.

SBP Example

When you create a Consistent Tensor, you can specify the SBP of the Tensor. The example will be seen in the next article: Consistent Tensor.

SPB Signature

SBP describes the mapping relationship between the data under the consistent view and the data on the physical devices. When doing distributed training, OneFlow distributes the data to the physical devices, computes the results according to the SBP attributes of the data.

For an isolated Tensor, we can set its SBP attributes at will. However, for an operator with input and output data, we can not arbitrarily set the SBP attributes of its input and output. This is because arbitrarily setting the SBP attributes of an operator’s input and output may not conform to the algorithm of the operator under consistent view.

Let us discuss this problem with the example of matrix multiplication. Look at how the input and output SBP of matrix multiplication are combined to be legal and illegal in a distributed system with tow devices.

Suppose, from the consistent view, that a matrix Consistent View - 图3 with the shape $Consistent View - 图4 is multiplied by a matrix Consistent View - 图5 with the shape Consistent View - 图6 to get $y $, the shape of Consistent View - 图7 must be Consistent View - 图8.

According to the rule of matrix multiplication, we can divide the matrix Consistent View - 图9 into two matrices Consistent View - 图10 and Consistent View - 图11 by dimension 0, with the shapes of Consistent View - 图12, Consistent View - 图13 respectively:

Device 1:

Consistent View - 图14

Device 2:

Consistent View - 图15

It’s easy to configure the relationship among physical Tensors Consistent View - 图16, Consistent View - 图17 and the Tensor Consistent View - 图18, which is under the consistent view. And also the relationship between Consistent View - 图19, Consistent View - 图20 and the consistent view data Consistent View - 图21:

Consistent View - 图22

Consistent View - 图23

Note: The concat above represents a concatenate operation.

In this way, it is possible to execute the operation and get the correct result from the consistent view by distributing the data to each physical device. The long story we talked above, described in SBP, are surprisingly simple:

Consistent View - 图24 is split(0), Consistent View - 图25 is broadcast, and Consistent View - 图26 is split(0).

We can see that for matrix multiplication, the SBP of its input and output combined in the above way, is legal. For matrix multiplication, there are more than one valid SBP combinations, such as:

Consistent View - 图27 is broadcast, Consistent View - 图28 is split(1), and Consistent View - 图29 is split(1).

Or:

Consistent View - 图30 is split(1), Consistent View - 图31 is split(0), and Consistent View - 图32 is partial sum.

While we showed multiple valid SBP combinations above, not all SBP combinations are valid. For example, for matrix multiplication, if Consistent View - 图33, Consistent View - 图34 are both split(0), then:

Consistent View - 图35

Consistent View - 图36

Because the shapes of Consistent View - 图37 and Consistent View - 图38 do not meet the requirements of matrix multiplication, it is impossible to compute the matrix multiplication on physical devices. We can say that the combination of Consistent View - 图39 as split(0) and Consistent View - 图40 as split(0) is illegal.

We defines a specific, valid SBP combination of the inputs and outputs of an operator, as shown above, as a SBP Signature of this operator.

All operators in OneFlow are presetting all possible SBP signatures according to the operator’s Operation Rules. The user only needs to set the placement and SBP attributes of the data, the selection process is transparent to the user.

Conclusion

placement, SBP, and SBP Signature are the important guarantee of OneFlow distributed consistent view, which makes OneFlow distributed training as simple as on a single machine single card.

In the next article Consistent Tensor, we’ll show you an example of programming under the consistent view.

Please activate JavaScript for write a comment in LiveRe