gradient – Symbolic Differentiation

Symbolic gradient is usually computed from gradient.grad(), which offers amore convenient syntax for the common case of wanting the gradient of somescalar cost with respect to some input expressions. The grad_sources_inputs()function does the underlying work, and is more flexible, but is also moreawkward to use when gradient.grad() can do the job.

Gradient related functions

Driver for gradient calculations.

  • exception theano.gradient.DisconnectedInputError[source]
  • Raised when grad is asked to compute the gradientwith respect to a disconnected input anddisconnected_inputs=’raise’.
  • class theano.gradient.DisconnectedType[source]
  • A type indicating that a variable is a resultof taking the gradient of c with respect to xwhen c is not a function of x.A symbolic placeholder for 0, but to conveythe extra information that this gradient is 0because it is disconnected.
  • exception theano.gradient.GradientError(arg, err_pos, shape, val1, val2, abs_err, rel_err, abs_tol, rel_tol)[source]
  • This error is raised when a gradient is calculated, but incorrect.
  • theano.gradient.Lop(f, wrt, eval_points, consider_constant=None, disconnected_inputs='raise')[source]
  • Computes the L operation on f wrt to wrt at eval_points.

Mathematically this stands for the jacobian of f wrtto wrt left muliplied by the eval points.

Parameters:

  • f (Variable or list of Variables) – f stands for the output of the computational graph to which youwant to apply the L operator
  • wrt (Variable or list of Variables) – variables for which you compute the L operator of the expressiondescribed by f
  • eval_points (Variable or list of Variables) – evalutation points for each of the variables in f_Returns: Symbolic expression such thatL_op[i] = sum_i (d f[i] / d wrt[j]) eval_point[i]where the indices in that expression are magic multidimensionalindices that specify both the position within a list and allcoordinates of the tensor element in the lastIf _f is a list/tuple, then return a list/tuple with the results. Return type: Variable or list/tuple of Variables depending on type of f
  • exception theano.gradient.NullTypeGradError[source]
  • Raised when grad encounters a NullType.
  • theano.gradient.Rop(f, wrt, eval_points, disconnected_outputs='raise', return_disconnected='zero')[source]
  • Computes the R operation on f wrt to wrt at eval_points.

Mathematically this stands for the jacobian of f wrtto wrt right muliplied by the eval points.

Parameters:

  • f (Variable or list of Variables) – f stands for the output of the computational graph to which youwant to apply the R operator
  • wrt (Variable or list of Variables) – variables for which you compute the R operator of the expressiondescribed by f
  • eval_points (Variable or list of Variables) – evalutation points for each of the variables in wrt
  • disconnected_outputs (str) – Defines the behaviour if some of the variables in f_have no dependency on any of the variable in _wrt (or ifall links are non-differentiable). The possible values are:

    • ‘ignore’: considers that the gradient on these parameters is zero.
    • ‘warn’: consider the gradient zero, and print a warning.
    • ‘raise’: raise DisconnectedInputError.
  • return_disconnected ({'zero', 'None', 'Disconnected'}) –
    • ‘zero’ : If wrt[i] is disconnected, return value i will bewrt[i].zeros_like()
    • ‘None’ : If wrt[i] is disconnected, return value i will beNone
    • ‘Disconnected’ : returns variables of type DisconnectedTypeReturns: Symbolic expression such thatRop[i] = sum_j (d f[i] / d wrt[j]) eval_point[j]where the indices in that expression are magic multidimensionalindices that specify both the position within a list and allcoordinates of the tensor element in the last.If _wrt is a list/tuple, then return a list/tuple with the results. Return type: Variable or list/tuple of Variables depending on type of f
  • theano.gradient.considerconstant(_x)[source]
  • DEPRECATED: use zero_grad() or disconnected_grad() instead.

Consider an expression constant when computing gradients.

The expression itself is unaffected, but when its gradient iscomputed, or the gradient of another expression that thisexpression is a subexpression of, it will not be backpropagatedthrough. In other words, the gradient of the expression istruncated to 0.

Parameters:x – A Theano expression whose gradient should be truncated.Returns:The expression is returned unmodified, but its gradientis now truncated to 0.

New in version 0.7.

  • theano.gradient.disconnectedgrad(_x)[source]
  • Consider an expression constant when computing gradients.

It will effectively not backpropagating through it.

The expression itself is unaffected, but when its gradient iscomputed, or the gradient of another expression that thisexpression is a subexpression of, it will not be backpropagatedthrough. This is effectively equivalent to truncating the gradientexpression to 0, but is executed faster than zero_grad(), which stilllhas to go through the underlying computational graph related to theexpression.

Parameters:x (Variable) – A Theano expression whose gradient should not bebackpropagated through.Returns:An expression equivalent to x, with its gradientnow effectively truncated to 0.Return type:Variable

  • theano.gradient.formatas(_use_list, use_tuple, outputs)[source]
  • Formats the outputs according to the flags use_list and use_tuple.

If use_list is True, outputs is returned as a list (if outputs_is not a list or a tuple then it is converted in a one element list).If _use_tuple is True, outputs is returned as a tuple (if outputs_is not a list or a tuple then it is converted into a one element tuple).Otherwise (if both flags are false), _outputs is returned.

  • theano.gradient.grad(cost, wrt, consider_constant=None, disconnected_inputs='raise', add_names=True, known_grads=None, return_disconnected='zero', null_gradients='raise')[source]
  • Return symbolic gradients of one cost with respect to one or more variables.

For more information about how automatic differentiation works in Theano,see gradient. For information on how to implement the gradient ofa certain Op, see grad().

Parameters:

  • cost (Variable scalar (0-dimensional) tensor variable or None) – Value that we are differentiating (that we want the gradient of).May be None if known_grads is provided.
  • wrt (Variable or list of Variables) – Term[s] with respect to which we want gradients
  • consider_constant (list of variables) – Expressions not to backpropagate through
  • disconnected_inputs ({'ignore', 'warn', 'raise'}) – Defines the behaviour if some of the variables in wrt arenot part of the computational graph computing cost (or ifall links are non-differentiable). The possible values are:

    • ‘ignore’: considers that the gradient on these parameters is zero.
    • ‘warn’: consider the gradient zero, and print a warning.
    • ‘raise’: raise DisconnectedInputError.
  • add_names (bool) – If True, variables generated by grad will be named(d/d) provided that both cost and wrthave names
  • known_grads (OrderedDict, optional) – A ordered dictionary mapping variables to their gradients. This isuseful in the case where you know the gradient on somevariables but do not know the original cost.
  • return_disconnected ({'zero', 'None', 'Disconnected'}) –
    • ‘zero’ : If wrt[i] is disconnected, return value i will bewrt[i].zeros_like()
    • ‘None’ : If wrt[i] is disconnected, return value i will beNone
    • ‘Disconnected’ : returns variables of type DisconnectedType
  • null_gradients ({'raise', 'return'}) – Defines the behaviour if some of the variables in wrt have anull gradient. The possibles values are:

    • ‘raise’ : raise a NullTypeGradError exception
    • ‘return’ : return the null gradientsReturns: Symbolic expression of gradient of cost with respect to eachof the wrt terms. If an element of wrt is notdifferentiable with respect to the output, then a zerovariable is returned. Return type: variable or list/tuple of variables (matches wrt)

  • theano.gradient.gradclip(_x, lower_bound, upper_bound)[source]
  • This op do a view in the forward, but clip the gradient.

This is an elemwise operation.

Parameters:

  • x – The variable we want its gradient inputs clipped
  • lower_bound – The lower bound of the gradient value
  • upper_bound – The upper bound of the gradient value.

Examples

  1. >>> x = theano.tensor.scalar()
  2. >>> z = theano.tensor.grad(grad_clip(x, -1, 1)**2, x)
  3. >>> z2 = theano.tensor.grad(x**2, x)
  4. >>> f = theano.function([x], outputs = [z, z2])
  5. >>> print(f(2.0))
  6. [array(1.0), array(4.0)]

Note

We register an opt in tensor/opt.py that remove the GradClip.So it have 0 cost in the forward and only do work in the grad.

  • theano.gradient.gradnot_implemented(_op, x_pos, x, comment='')[source]
  • Return an un-computable symbolic variable of type x.type.

If any call to tensor.grad results in an expression containing thisun-computable variable, an exception (NotImplementedError) will beraised indicating that the gradient on thex_pos‘th input of op has not been implemented. Likewise ifany call to theano.function involves this variable.

Optionally adds a comment to the exception explaining why thisgradient is not implemented.

  • theano.gradient.gradscale(_x, multiplier)[source]
  • This op scale or inverse the gradient in the backpropagation.

Parameters:

  • x – The variable we want its gradient inputs scale
  • multiplier – Scale of the gradient

Examples

  1. >>> x = theano.tensor.fscalar()
  2. >>> fx = theano.tensor.sin(x)
  3. >>> fp = theano.tensor.grad(fx, wrt=x)
  4. >>> fprime = theano.function([x], fp)
  5. >>> print(fprime(2))
  6. -0.416...
  7. >>> f_inverse=grad_scale(fx, -1.)
  8. >>> fpp = theano.tensor.grad(f_inverse, wrt=x)
  9. >>> fpprime = theano.function([x], fpp)
  10. >>> print(fpprime(2))
  11. 0.416...
  • theano.gradient.gradundefined(_op, x_pos, x, comment='')[source]
  • Return an un-computable symbolic variable of type x.type.

If any call to tensor.grad results in an expression containing thisun-computable variable, an exception (GradUndefinedError) will beraised indicating that the gradient on thex_pos‘th input of op is mathematically undefined. Likewise ifany call to theano.function involves this variable.

Optionally adds a comment to the exception explaining why thisgradient is not defined.

  • theano.gradient.hessian(cost, wrt, consider_constant=None, disconnected_inputs='raise')[source]

Parameters:

  • cost (Scalar (0-dimensional) variable.) –
  • wrt (Vector (1-dimensional tensor) 'Variable' or list of) –
  • (**1-dimensional tensors) Variables** (vectors) –
  • consider_constant – a list of expressions not to backpropagate through
  • disconnected_inputs (string) – Defines the behaviour if some of the variablesin wrt are not part of the computational graph computing cost(or if all links are non-differentiable). The possible values are:

    • ‘ignore’: considers that the gradient on these parameters is zero.
    • ‘warn’: consider the gradient zero, and print a warning.
    • ‘raise’: raise an exception.Returns: The Hessian of the cost with respect to (elements of) wrt.If an element of wrt is not differentiable with respect to theoutput, then a zero variable is returned. The return value isof same type as wrt: a list/tuple or TensorVariable in all cases. Return type: Variable or list/tuple of Variables
  • theano.gradient.jacobian(expression, wrt, consider_constant=None, disconnected_inputs='raise')[source]
  • Compute the full Jacobian, row by row.

Parameters:

  • expression (Vector (1-dimensional) Variable) – Values that we are differentiating (that we want the Jacobian of)
  • wrt (Variable or list of Variables) – Term[s] with respect to which we compute the Jacobian
  • consider_constant (list of variables) – Expressions not to backpropagate through
  • disconnected_inputs (string) – Defines the behaviour if some of the variablesin wrt are not part of the computational graph computing cost(or if all links are non-differentiable). The possible values are:

    • ‘ignore’: considers that the gradient on these parameters is zero.
    • ‘warn’: consider the gradient zero, and print a warning.
    • ‘raise’: raise an exception.Returns: The Jacobian of expression with respect to (elements of) wrt.If an element of wrt is not differentiable with respect to theoutput, then a zero variable is returned. The return value isof same type as wrt: a list/tuple or TensorVariable in all cases. Return type: Variable or list/tuple of Variables (depending upon wrt)
  • class theano.gradient.numericgrad(_f, pt, eps=None, out_type=None)[source]
  • Compute the numeric derivative of a scalar-valued function at a particularpoint.

    • static absrel_err(_a, b)[source]
    • Return absolute and relative error between a and b.

The relative error is a small number when a and b are close, relativeto how big they are.

  1. - Formulas used:
  2. -

abs_err = abs(a - b)

rel_err = abs_err / max(abs(a) + abs(b), 1e-8)

The denominator is clipped at 1e-8 to avoid dividing by 0 when a and bare both close to 0.

The tuple (abs_err, rel_err) is returned

  • absrel_errors(_g_pt)[source]
  • Return the abs and rel error of gradient estimate g_pt

g_pt must be a list of ndarrays of the same length as self.gf,otherwise a ValueError is raised.

Corresponding ndarrays in g_pt and self.gf must have the sameshape or ValueError is raised.

  • maxerr(_g_pt, abs_tol, rel_tol)[source]
  • Find the biggest error between g_pt and self.gf.

What is measured is the violation of relative and absolute errors,wrt the provided tolerances (abs_tol, rel_tol).A value > 1 means both tolerances are exceeded.

Return the argmax of min(abs_err / abs_tol, rel_err / rel_tol) overg_pt, as well as abs_err and rel_err at this point.

  • theano.gradient.subgraphgrad(_wrt, end, start=None, cost=None, details=False)[source]
  • With respect to wrt, computes gradients of cost and/or fromexisting start gradients, up to the end variables of asymbolic digraph. In other words, computes gradients for asubgraph of the symbolic theano function. Ignores all disconnectedinputs.

This can be useful when one needs to perform the gradient descentiteratively (e.g. one layer at a time in an MLP), or when aparticular operation is not differentiable in theano(e.g. stochastic sampling from a multinomial). In the latter case,the gradient of the non-differentiable process could beapproximated by user-defined formula, which could be calculatedusing the gradients of a cost with respect to samples (0s and1s). These gradients are obtained by performing a subgraphgradfrom the _cost or previously known gradients (start) up to theoutputs of the stochastic process (end). A dictionary mappinggradients obtained from the user-defined differentiation of theprocess, to variables, could then be fed into anothersubgraphgrad as _start with any other cost (e.g. weightdecay).

In an MLP, we could use subgraph_grad to iteratively backpropagate:

  1. x, t = theano.tensor.fvector('x'), theano.tensor.fvector('t')
  2. w1 = theano.shared(np.random.randn(3,4))
  3. w2 = theano.shared(np.random.randn(4,2))
  4. a1 = theano.tensor.tanh(theano.tensor.dot(x,w1))
  5. a2 = theano.tensor.tanh(theano.tensor.dot(a1,w2))
  6. cost2 = theano.tensor.sqr(a2 - t).sum()
  7. cost2 += theano.tensor.sqr(w2.sum())
  8. cost1 = theano.tensor.sqr(w1.sum())
  9.  
  10. params = [[w2],[w1]]
  11. costs = [cost2,cost1]
  12. grad_ends = [[a1], [x]]
  13.  
  14. next_grad = None
  15. param_grads = []
  16. for i in xrange(2):
  17. param_grad, next_grad = theano.subgraph_grad(
  18. wrt=params[i], end=grad_ends[i],
  19. start=next_grad, cost=costs[i]
  20. )
  21. next_grad = dict(zip(grad_ends[i], next_grad))
  22. param_grads.extend(param_grad)

Parameters:

  • wrt (list of variables) – Gradients are computed with respect to wrt.
  • end (list of variables) – Theano variables at which to end gradient descent (they areconsidered constant in theano.grad). For convenience, thegradients with respect to these variables are also returned.
  • start (dictionary of variables) – If not None, a dictionary mapping variables to theirgradients. This is useful when the gradient on some variablesare known. These are used to compute the gradients backwards upto the variables in end (they are used as known_grad intheano.grad).
  • cost (Variable scalar (0-dimensional) variable) – Additional costs for which to compute the gradients. Forexample, these could be weight decay, an l1 constraint, MSE,NLL, etc. May optionally be None if start is provided.

Warning

If the gradients of cost with respect to any of the start_variables is already part of the _start dictionary, then itmay be counted twice with respect to wrt and end.

  • details (bool) – When True, additionally returns the list of gradients fromstart and of cost, respectively, with respect to wrt (notend).Returns: Returns lists of gradients with respect to wrt and end,respectively. Return type: Tuple of 2 or 4 Lists of Variables

New in version 0.7.

  • theano.gradient.undefinedgrad(_x)[source]
  • Consider the gradient of this variable undefined.

This will generate an error message if its gradient is taken.

The expression itself is unaffected, but when its gradient iscomputed, or the gradient of another expression that thisexpression is a subexpression of, an error message will be generatedspecifying such gradient is not defined.

Parameters:x (Variable) – A Theano expression whose gradient should be undefined.Returns:An expression equivalent to x, with its gradient undefined.Return type:Variable

  • theano.gradient.verifygrad(_fun, pt, n_tests=2, rng=None, eps=None, out_type=None, abs_tol=None, rel_tol=None, mode=None, cast_to_output_type=False, no_debug_ref=True)[source]
  • Test a gradient by Finite Difference Method. Raise error on failure.

Raises an Exception if the difference between the analytic gradient andnumerical gradient (computed through the Finite Difference Method) of arandom projection of the fun’s output to a scalar exceeds the giventolerance.

Examples

  1. >>> verify_grad(theano.tensor.tanh,
  2. ... (np.asarray([[2, 3, 4], [-1, 3.3, 9.9]]),),
  3. ... rng=np.random)

Parameters:

  • fun (a Python function) – fun takes Theano variables as inputs, and returns a Theano variable.For instance, an Op instance with a single output.
  • pt (list of numpy.ndarrays) – Input values, points where the gradient is estimated.These arrays must be either float16, float32, or float64 arrays.
  • n_tests (int) – number of times to run the test
  • rng (numpy.random.RandomState, optional) – random number generator used to sample the output random projection u,we test gradient of sum(u * fun) at pt
  • eps (float, optional) – stepsize used in the Finite Difference Method (DefaultNone is type-dependent).Raising the value of eps can raise or lower the absoluteand relative errors of the verification depending on theOp. Raising eps does not lower the verification quality forlinear operations. It is better to raise eps than raisingabs_tol or rel_tol.
  • out_type (string) – dtype of output, if complex (i.e., ‘complex32’ or ‘complex64’)
  • abs_tol (float) – absolute tolerance used as threshold for gradient comparison
  • rel_tol (float) – relative tolerance used as threshold for gradient comparison
  • cast_to_output_type (bool) – if the output is float32 and cast_to_output_type is True, castthe random projection to float32. Otherwise it is float64.float16 is not handled here.
  • no_debug_ref (bool) – Don’t use DebugMode for the numerical gradient function.

Note

This function does not support multiple outputs. Intests/test_scan.py there is an experimental verify_grad thatcovers that case as well by using random projections.

  • theano.gradient.zerograd(_x)[source]
  • Consider an expression constant when computing gradients.

The expression itself is unaffected, but when its gradient iscomputed, or the gradient of another expression that thisexpression is a subexpression of, it will be backpropagatedthrough with a value of zero. In other words, the gradient ofthe expression is truncated to 0.

Parameters:x (Variable) – A Theano expression whose gradient should be truncated.Returns:An expression equivalent to x, with its gradienttruncated to 0.Return type:Variable

List of Implemented R op

See the gradient tutorial for the R op documentation.

  • list of ops that support R-op:
      • with test [Most is tensor/tests/test_rop.py]
        • SpecifyShape
        • MaxAndArgmax
        • Subtensor
        • IncSubtensor set_subtensor too
        • Alloc
        • Dot
        • Elemwise
        • Sum
        • Softmax
        • Shape
        • Join
        • Rebroadcast
        • Reshape
        • Flatten
        • DimShuffle
        • Scan [In scan_module/tests/test_scan.test_rop]
      • without test
        • Split
        • ARange
        • ScalarFromTensor
        • AdvancedSubtensor1
        • AdvancedIncSubtensor1
        • AdvancedIncSubtensor

Partial list of ops without support for R-op:

  • All sparse ops
  • All linear algebra ops.
  • PermuteRowElements
  • Tile
  • AdvancedSubtensor
  • TensorDot
  • Outer
  • Prod
  • MulwithoutZeros
  • ProdWithoutZeros
  • CAReduce(for max,… done for MaxAndArgmax op)
  • MaxAndArgmax(only for matrix on axis 0 or 1)