AUTOGRAD

The training process of a neural network is powered by backpropagation algorithm. In the backpropagation process, we update the parameters by obtaining the gradient of the loss function with respect to the parameters.

OneFlow provides an autograd engine, which can calculate the gradient of the parameters in the neural network automatically.

We will first introduce the basic concepts of the computation graph, which are conducive to understand the common settings and limitations of Oneflow’s automatic differentiation. Then we will introduce OneFlow’s common automatic differentiation interfaces.

Computation Graph

Computation graphs are composed of tensors and operators. We show this in code as below:

  1. import oneflow as flow
  2. def loss(y_pred, y):
  3. return flow.sum(1/2*(y_pred-y)**2)
  4. x = flow.ones(1, 5) # input【不确定】
  5. w = flow.randn(5, 3, requires_grad=True)
  6. b = flow.randn(1, 3, requires_grad=True)
  7. z = flow.matmul(x, w) + b
  8. y = flow.zeros(1, 3) # label
  9. l = loss(z,y)

Corresponding computation graph:

todo

In computation graph, the nodes only with output and with no input called leaf node, like x, w, b, and y, the nodes only with output and with no input called root node, like loss.

During the backpropagation process, the gradient of l to w and b is required to update w and b. Therefore, we need to set requires_grad as True when creating them.

Automatic Gradient

backward() and Gradient

During the backpropagation process, we need to get the gradients of l to w,b respectively, shown as Autograd - 图2 and Autograd - 图3. We only need to call the ‘backward()’ method of l, and then OneFlow will automatically calculate the gradients and store them in the w.grad and b.grad.

  1. l.backward()
  2. print(w.grad)
  3. print(b.grad)
  1. tensor([[0.9397, 2.5428, 2.5377],
  2. [0.9397, 2.5428, 2.5377],
  3. [0.9397, 2.5428, 2.5377],
  4. [0.9397, 2.5428, 2.5377],
  5. [0.9397, 2.5428, 2.5377]], dtype=oneflow.float32)
  6. tensor([[0.9397, 2.5428, 2.5377]], dtype=oneflow.float32)

Gradient for Non-leaf Nodes

By default, only gradients of leaf nodes with requires_grad=True will be retained. The ‘grad’ of a non-leaf node is automatically freed during the calling of ‘backward’ and cannot be viewed.

Tensor.retain_grad() can be called to retain and view the ‘grad’ of a non-leaf node.

  1. from math import pi
  2. n1 = flow.tensor(pi/2, requires_grad=True)
  3. n2 = flow.sin(n1)
  4. n2.retain_grad()
  5. n3 = flow.pow(n2, 2)
  6. n3.backward()
  7. print(n1.grad)
  8. print(n2.grad)

we get Autograd - 图4 and Autograd - 图5 using the code above.

Output:

  1. tensor(-8.7423e-08, dtype=oneflow.float32)
  2. tensor(2., dtype=oneflow.float32)

Call backward() Multiple Times on a Computation Graph

By default, we can only call backward() once for each computation graph. For example, the following code will raise an error:

  1. n1 = flow.tensor(10., requires_grad=True)
  2. n2 = flow.pow(n1, 2)
  3. n2.backward()
  4. n2.backward()

Error message:

Maybe you try to backward through the node a second time. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.

If we need to call backward() multiple times on the same computation graph, retain_graph needs to be True.

  1. n1 = flow.tensor(10., requires_grad=True)
  2. n2 = flow.pow(n1, 2)
  3. n2.backward(retain_graph=True)
  4. print(n1.grad)
  5. n2.backward()
  6. print(n1.grad)

Output:

  1. tensor(20., dtype=oneflow.float32)
  2. tensor(40., dtype=oneflow.float32)

The above output shows that OneFlow will accumulate the gradient calculated by backward() multiple times. By calling the zeros_(), we can clear the gradient:

  1. n1 = flow.tensor(10., requires_grad=True)
  2. n2 = flow.pow(n1, 2)
  3. n2.backward(retain_graph=True)
  4. print(n1.grad)
  5. n1.grad.zeros_()
  6. n2.backward()
  7. print(n1.grad)

Output:

  1. tensor(20., dtype=oneflow.float32)
  2. tensor(20., dtype=oneflow.float32)

Disabled Gradient Calculation

By default, OneFlow will trace and calculate gradients of Tensors with requires_grad = Ture. However, in some cases, we don’t need OneFlow to keep tracing gradients such as just wanting the forward pass for inference. Then we can use oneflow.no_grad() or oneflow.Tensor.detach() to set.

  1. z = flow.matmul(x, w)+b
  2. print(z.requires_grad)
  3. with flow.no_grad():
  4. z = flow.matmul(x, w)+b
  5. print(z.requires_grad)

Output:

  1. True
  2. False
  1. z_det = z.detach()
  2. print(z_det.requires_grad)

Output:

  1. False

Gradients for Non-Scalar Outputs

Usually, we call backward() on scalar loss.

However, if loss is a tensor, an error will be raised when calling backward() on loss.

  1. x = flow.randn(1, 2, requires_grad=True)
  2. y = 3*x + 1
  3. y.backward()

Error message:

Check failed: IsScalarTensor(*outputs.at(i)) Grad can be implicitly created only for scalar outputs

We can get the gradient after y.sum().

  1. x = flow.randn(1, 2, requires_grad=True)
  2. y = 3*x + 1
  3. y = y.sum()
  4. y.backward()
  5. print(x.grad)

Output:

  1. tensor([[3., 3.]], dtype=oneflow.float32)

Please refer to the “Further Reading” section below for the analysis of the cause and solution of the error.

Further Reading

There are two elements Autograd - 图6 and Autograd - 图7 in Tensor x, and two elements Autograd - 图8 and Autograd - 图9 in Tensor y. The relationship between them is:

Autograd - 图10

Autograd - 图11

We want to get Autograd - 图12

Autograd - 图13

It doesn’t make sense in mathematics, so of course an error is reported. In fact, when the user calls y.backward(), the result desired is usually:

Autograd - 图14

After call sum() on y:

Autograd - 图15

At this time, when calling backward(), the gradients of Autograd - 图16 and Autograd - 图17 can be calculated:

Autograd - 图18

Autograd - 图19

In addition to using sum(), Vector Jacobian Product(VJP) is a more general method to calculate the gradient of the non-scalar root node. Using the above example, OneFlow will generate the Jacobian matrix according to the computation graph during the backpropagation process:

Autograd - 图20

To calculate VJP, a vector Autograd - 图21 with the same size as Autograd - 图22 needs to be provided:

Autograd - 图23

If the vector Autograd - 图24 is the gradient of the upper layer in the backpropagation, the result of VJP is exactly the gradient required by the current layer.

backward() can accept a tensor as a parameter, when the parameter is Autograd - 图25 in VJP. We can also use the following methods to find the gradient of a tensor:

  1. x = flow.randn(1, 2, requires_grad=True)
  2. y = 3*x + 1
  3. y.backward(flow.ones_like(y))
  4. print(x.grad)

Output:

  1. tensor([[3., 3.]], dtype=oneflow.float32)

External links

Please activate JavaScript for write a comment in LiveRe