Making arithmetic Ops on double

Now that we have a double type, we have yet to use it to performcomputations. We’ll start by defining multiplication.

Op’s contract

An Op is any object which inherits from gof.Op. It has todefine the following methods.

  • makenode(*inputs_)
  • This method is responsible for creating output Variables of asuitable symbolic Type to serve as the outputs of this Op’sapplication. The Variables found in *inputs must be operated onusing Theano’s symbolic language to compute the symbolic outputVariables. This method should put these outputs into an Applyinstance, and return the Apply instance.

This method creates an Apply node representing the application ofthe Op on the inputs provided. If the Op cannot be applied to theseinputs, it must raise an appropriate exception.

The inputs of the Apply instance returned by this call must beordered correctly: a subsequent self.make_node(*apply.inputs)must produce something equivalent to the first apply.

  • perform(node, inputs, output_storage)
  • This method computes the function associated to this Op. node isan Apply node created by the Op’s make_node method. inputsis a list of references to data to operate on using non-symbolicstatements, (i.e., statements in Python, Numpy). output_storageis a list of storage cells where the variables of the computationmust be put.

More specifically:

  • node: This is a reference to an Apply node which was previously obtained via the Op‘s make_node method. It is typically not used in simple Ops, but it contains symbolic information that could be required for complex Ops.

  • inputs: This is a list of data from which the values stored in output_storage are to be computed using non-symbolic language.

  • output_storage: This is a list of storage cells where the output is to be stored. A storage cell is a one-element list. It is forbidden to change the length of the list(s) contained in output_storage. There is one storage cell for each output of the Op.

    The data put in output_storage must match the type of the symbolic output. This is a situation where the node argument can come in handy.

    A function Mode may allow output_storage elements to persist between evaluations, or it may reset output_storage cells to hold a value of None. It can also pre-allocate some memory for the Op to use. This feature can allow perform to reuse memory between calls, for example. If there is something preallocated in the output_storage, it will be of the good dtype, but can have the wrong shape and have any stride pattern.

This method must be determined by the inputs. That is to say, ifit is evaluated once on inputs A and returned B, then if everinputs C, equal to A, are presented again, then outputs equal toB must be returned again.

You must be careful about aliasing outputs to inputs, and makingmodifications to any of the inputs. See Views and inplaceoperations before writing a performimplementation that does either of these things.

Instead (or in addition to) perform() You can also provide aC implementation of For more details, refer to thedocumentation for Op.

  • eq(other)
  • other is also an Op.

Returning True here is a promise to the optimization systemthat the other Op will produce exactly the same graph effects(from perform) as this one, given identical inputs. This means itwill produce the same output values, it will destroy the sameinputs (same destroy_map), and will alias outputs to the sameinputs (same view_map). For more details, seeViews and inplace operations.

Note

If you set props, this will be automatically generated.

  • hash()
  • If two Op instances compare equal, then they must return thesame hash value.

Equally important, this hash value must not change during thelifetime of self. Op instances should be immutable in thissense.

Note

If you set props, this will be automatically generated.

Optional methods or attributes

  • props
  • Default: Undefined

Must be a tuple. Lists the name of the attributes which influencethe computation performed. This will also enable the automaticgeneration of appropriate eq, hash and str methods.Should be set to () if you have no attributes that are relevant tothe computation to generate the methods.

New in version 0.7.

  • default_output
  • Default: None

If this member variable is an integer, then the defaultimplementation of call will returnnode.outputs[self.default_output], where node was returnedby make_node. Otherwise, the entire list of outputs will bereturned, unless it is of length 1, where the single element will bereturned by itself.

  • makethunk(_node, storage_map, compute_map, no_recycling, impl=None)
  • This function must return a thunk, that is a zero-argumentsfunction that encapsulates the computation to be performed by thisop on the arguments of the node.

Parameters:

  • node – Apply instanceThe node for which a thunk is requested.
  • storage_map – dict of listsThis maps variables to a one-element lists holding the variable’scurrent value. The one-element list acts as pointer to the valueand allows sharing that “pointer” with other nodes and instances.
  • compute_map – dict of listsThis maps variables to one-element lists holding booleans. Ifthe value is 0 then the variable has not been computed and thevalue should not be considered valid. If the value is 1 thevariable has been computed and the value is valid. If the valueis 2 the variable has been garbage-collected and is no longervalid, but shouldn’t be required anymore for this call.
  • no_recycling – WRITEMEWRITEME
  • impl – None, ‘c’ or ‘py’Which implementation to use.

The returned function must ensure that is sets the computedvariables as computed in the compute_map.

Defining this function removes the requirement for perform()or C code, as you will define the thunk for the computationyourself.

  • call(*inputs, **kwargs)
  • By default this is a convenience function which callsmake_node() with the supplied arguments and returns theresult indexed by default_output. This can be overridden bysubclasses to do anything else, but must return either a theanoVariable or a list of Variables.

If you feel the need to override call to change the graphbased on the arguments, you should instead create a function thatwill use your Op and build the graphs that you want and call thatinstead of the Op instance directly.

  • infershape(_node, shapes)
  • This function is needed for shape optimization. shapes is alist with one tuple for each input of the Apply node (which correspondsto the inputs of the op). Each tuple contains as many elements as thenumber of dimensions of the corresponding input. The value of each elementis the shape (number of items) along the corresponding dimension of thatspecific input.

While this might sound complicated, it is nothing more than the shapeof each input as symbolic variables (one per dimension).

The function should return a list with one tuple for each output.Each tuple should contain the corresponding output’s computed shape.

Implementing this method will allow Theano to compute the output’sshape without computing the output itself, potentially sparing youa costly recomputation.

  • flops(inputs, outputs)
  • It is only used to have more information printed by the memoryprofiler. It makes it print the mega flops and giga flops persecond for each apply node. It takes as inputs two lists: one for theinputs and one for the outputs. They contain tuples that are theshapes of the corresponding inputs/outputs.
  • str()
  • This allows you to specify a more informative string representation of yourOp. If an Op has parameters, it is highly recommended to have thestr method include the name of the op and the Op’s parameters’values.

Note

If you set props, this will be automatically generated.You can still overide it for custom output.

  • doconstant_folding(_node)
  • Default: Return True

By default when optimizations are enabled, we remove duringfunction compilation Apply nodes whose inputs are all constants.We replace the Apply node with a Theano constant variable.This way, the Apply node is not executed at each functioncall. If you want to force the execution of an op during thefunction call, make do_constant_folding return False.

As done in the Alloc op, you can return False only in some cases byanalyzing the graph from the node parameter.

  • debugperform(_node, inputs, output_storage)
  • Undefined by default.

If you define this function then it will be used instead of C codeor perform() to do the computation while debugging (currentlyDebugMode, but others may also use it in the future). It has thesame signature and contract as perform().

This enables ops that cause trouble with DebugMode with theirnormal behaviour to adopt a different one when run under thatmode. If your op doesn’t have any problems, don’t implement this.

If you want your op to work with gradient.grad() you also need toimplement the functions described below.

Gradient

These are the function required to work with gradient.grad().

  • grad(inputs, output_gradients)
  • If the Op being defined is differentiable, its gradient may bespecified symbolically in this method. Both inputs andoutput_gradients are lists of symbolic Theano Variables andthose must be operated on using Theano’s symbolic language. The gradmethod must return a list containing one Variable for eachinput. Each returned Variable represents the gradient with respectto that input computed based on the symbolic gradients with respectto each output.

If the output is not differentiable with respect to an input thenthis method should be defined to return a variable of type NullTypefor that input. Likewise, if you have not implemented the gradcomputation for some input, you may return a variable of typeNullType for that input. theano.gradient contains conveniencemethods that can construct the variable for you:theano.gradient.grad_undefined() andtheano.gradient.grad_not_implemented(), respectively.

If an element of output_gradient is of typetheano.gradient.DisconnectedType, it means that the cost is not afunction of this output. If any of the op’s inputs participate inthe computation of only disconnected outputs, then Op.grad shouldreturn DisconnectedType variables for those inputs.

If the grad method is not defined, then Theano assumes it has beenforgotten. Symbolic differentiation will fail on a graph thatincludes this Op.

It must be understood that the Op’s grad method is not meant toreturn the gradient of the Op’s output. theano.tensor.grad computesgradients; Op.grad is a helper function that computes terms thatappear in gradients.

If an Op has a single vector-valued output y and a singlevector-valued input x, then the grad method will be passed x and asecond vector z. Define J to be the Jacobian of y with respect tox. The Op’s grad method should return dot(J.T,z). Whentheano.tensor.grad calls the grad method, it will set z to be thegradient of the cost C with respect to y. If this op is the only opthat acts on x, then dot(J.T,z) is the gradient of C with respect tox. If there are other ops that act on x, theano.tensor.grad willhave to add up the terms of x’s gradient contributed by the otherop’s grad method.

In practice, an op’s input and output are rarely implemented assingle vectors. Even if an op’s output consists of a listcontaining a scalar, a sparse matrix, and a 4D tensor, you can thinkof these objects as being formed by rearranging a vector. Likewisefor the input. In this view, the values computed by the grad methodstill represent a Jacobian-vector product.

In practice, it is probably not a good idea to explicitly constructthe Jacobian, which might be very large and very sparse. However,the returned value should be equal to the Jacobian-vector product.

So long as you implement this product correctly, you need notunderstand what theano.tensor.grad is doing, but for the curious themathematical justification is as follows:

In essence, the grad method must simply implement through symbolicVariables and operations the chain rule of differentialcalculus. The chain rule is the mathematical procedure that allowsone to calculate the total derivative \frac{d C}{d x} of thefinal scalar symbolic Variable C with respect to a primitivesymbolic Variable x found in the list inputs. The grad methoddoes this using output_gradients which provides the totalderivative \frac{d C}{d f} of C with respect to a symbolicVariable that is returned by the Op (this is provided inoutput_gradients), as well as the knowledge of the totalderivative \frac{d f}{d x} of the latter with respect to theprimitive Variable (this has to be computed).

In mathematics, the total derivative of a scalar variable (C) withrespect to a vector of scalar variables (x), i.e. the gradient, iscustomarily represented as the row vector of the partialderivatives, whereas the total derivative of a vector of scalarvariables (f) with respect to another (x), is customarilyrepresented by the matrix of the partial derivatives, i.e.thejacobian matrix. In this convenient setting, the chain ruleinstructs that the gradient of the final scalar variable C withrespect to the primitive scalar variables in x through those in f issimply given by the matrix product: \frac{d C}{d x} = \frac{dC}{d f} * \frac{d f}{d x}.

Here, the chain rule must be implemented in a similar but slightlymore complex setting: Theano provides in the listoutput_gradients one gradient for each of the Variables returnedby the Op. Where f is one such particular Variable, thecorresponding gradient found in output_gradients andrepresenting \frac{d C}{d f} is provided with a shapesimilar to f and thus not necessarily as a row vector of scalars.Furthermore, for each Variable x of the Op’s list of input variablesinputs, the returned gradient representing \frac{d C}{dx} must have a shape similar to that of Variable x.

If the output list of the op is [f_1, ... f_n], then thelist outputgradients is ![[grad{f1}(C), grad{f2}(C),… , grad{fn}(C)]](/projects/Theano-0.9.x/a91d07bb98163e3c2c0c4ee9405937bb.png). If inputs consists of the list[x_1, ..., x_m], then Op.grad should return the list![[grad{x1}(C), grad{x2}(C), …, grad{xm}(C)]](/projects/Theano-0.9.x/2b61a47b41a149dc100afaa172ab4218.png), where![(grad{y}(Z))_i = \frac{\partial Z}{\partial y_i}](/projects/Theano-0.9.x/1a1ac771868f8d6203764bddc005aa80.png) (andi can stand for multiple dimensions).

In other words, grad() does not return \frac{d f_i}{dx_j}, but instead the appropriate dot product specified by thechain rule: \frac{d C}{d x_j} = \frac{d C}{d f_i} \cdot\frac{d f_i}{d x_j}. Both the partial differentiation and themultiplication have to be performed by grad().

Theano currently imposes the following constraints on the valuesreturned by the grad method:

  • They must be Variable instances.
  • When they are types that have dtypes, they must never have an integer dtype. The output gradients passed to Op.grad will also obey these constraints.

Integers are a tricky subject. Integers are the main reason forhaving DisconnectedType, NullType or zero gradient. When you have aninteger as an argument to your grad method, recall the definition ofa derivative to help you decide what value to return:

\frac{d f}{d x} = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon.

Suppose your function f has an integer-valued output. For mostfunctions you’re likely to implement in theano, this means yourgradient should be zero, because f(x+epsilon) = f(x) for almost allx. (The only other option is that the gradient could be undefined,if your function is discontinuous everywhere, like the rationalindicator function)

Suppose your function f has an integer-valued input. This is alittle trickier, because you need to think about what you meanmathematically when you make a variable integer-valued intheano. Most of the time in machine learning we mean “f is afunction of a real-valued x, but we are only going to pass ininteger-values of x”. In this case, f(x+epsilon) exists, so thegradient through f should be the same whether x is an integer or afloating point variable. Sometimes what we mean is “f is a functionof an integer-valued x, and f is only defined where x is aninteger.” Since f(x+epsilon) doesn’t exist, the gradient isundefined. Finally, many times in theano, integer valued inputsdon’t actually affect the elements of the output, only its shape.

If your function f has both an integer-valued input and aninteger-valued output, then both rules have to be combined:

  • If f is defined at (x+epsilon), then the input gradient isdefined. Since f(x+epsilon) would be equal to f(x) almosteverywhere, the gradient should be 0 (first rule).
  • If f is only defined where x is an integer, then the gradientis undefined, regardless of what the gradient with respect to theoutput is. Examples:

    • f(x,y) = dot product between x and y. x and y are integers.
    • Since the output is also an integer, f is a step function.Its gradient is zero almost everywhere, so Op.grad should returnzeros in the shape of x and y.
    • f(x,y) = dot product between x and y. x is floating point and y is an integer.
    • In this case the output is floating point. It doesn’t matterthat y is an integer. We consider f to still be defined atf(x,y+epsilon). The gradient is exactly the same as if y werefloating point.
    • f(x,y) = argmax of x along axis y.
    • The gradient with respect to y is undefined, because f(x,y) isnot defined for floating point y. How could you take an argmaxalong a fraActional axis? The gradient with respect to x is0, because f(x+epsilon, y) = f(x) almost everywhere.
    • f(x,y) = a vector with y elements, each of which taking on the value x
    • The grad method should return DisconnectedType()() for y,because the elements of f don’t depend on y. Only the shape off depends on y. You probably also want to implement aconnection_pattern method to encode this.
    • f(x) = int(x) converts float x into an int. g(y) = float(y) converts an integer y into a float.
    • If the final cost C = 0.5 * g(y) = 0.5 g(f(x)), then thegradient with respect to y will be 0.5, even if y is aninteger. However, the gradient with respect to x will be 0,because the output of f is integer-valued.
  • connection_pattern(node):
  • Sometimes needed for proper operation of gradient.grad().

Returns a list of list of bools.

Op.connection_pattern[input_idx][output_idx] is true if theelements of inputs[input_idx] have an effect on the elements ofoutputs[output_idx].

The node parameter is needed to determine the number ofinputs. Some ops such as Subtensor take a variable number ofinputs.

If no connection_pattern is specified, gradient.grad willassume that all inputs have some elements connected to someelements of all outputs.

This method conveys two pieces of information that are otherwisenot part of the theano graph:

  • Which of the op’s inputs are truly ancestors of each of theop’s outputs. Suppose an op has two inputs, x and y, andoutputs f(x) and g(y). y is not really an ancestor of f, butit appears to be so in the theano graph.
  • Whether the actual elements of each input/output are relevantto a computation.For example, the shape op does not read its input’s elements,only its shape metadata. d shape(x) / dx should thus raisea disconnected input exception (if these exceptions areenabled).As another example, the elements of the Alloc op’s outputsare not affected by the shape arguments to the Alloc op. Failing to implement this function for an op that needs it canresult in two types of incorrect behavior:

  • gradient.grad erroneously raising a TypeError reporting thata gradient is undefined.

  • gradient.grad failing to raise a ValueError reporting thatan input is disconnected. Even if connection_pattern is not implemented correctly, ifgradient.grad returns an expression, that expression will benumerically correct.
  • Rop(_inputs, eval_points)
  • Optional, to work with gradient.R_op().

This function implements the application of the R-operator on thefunction represented by your op. Let assume that function is f,with input x, applying the R-operator means computing theJacobian of f and right-multiplying it by v, the evaluationpoint, namely: \frac{\partial f}{\partial x} v.

inputs are the symbolic variables corresponding to the value ofthe input where you want to evaluate the jacobian, and eval_pointsare the symbolic variables corresponding to the value you want toright multiply the jacobian with.

Same conventions as for the grad method hold. If your op is notdifferentiable, you can return None. Note that in contrast tothe method grad(), for R_op() you need to return thesame number of outputs as there are ouputs of the op. You can thinkof it in the following terms. You have all your inputs concatenatedinto a single vector x. You do the same with the evaluationpoints (which are as many as inputs and of the shame shape) and obtainanother vector v. For each output, you reshape it into a vector,compute the jacobian of that vector with respect to x andmultiply it by v. As a last step you reshape each of thesevectors you obtained for each outputs (that have the same shape asthe outputs) back to their corresponding shapes and return them as theoutput of the R_op() method.

List of op with r op support.

Defining an Op: mul

We’ll define multiplication as a binary operation, even though amultiplication Op could take an arbitrary number of arguments.

First, we’ll instantiate a mul Op:

  1. from theano import gof
  2. mul = gof.Op()

make_node

This function must take as many arguments as the operation we aredefining is supposed to take as inputs—in this example that would betwo. This function ensures that both inputs have the double type.Since multiplying two doubles yields a double, this function makes anApply node with an output Variable of type double.

  1. def make_node(x, y):
  2. if x.type != double or y.type != double:
  3. raise TypeError('mul only works on doubles')
  4. return gof.Apply(mul, [x, y], [double()])
  5. mul.make_node = make_node

The first two lines make sure that both inputs are Variables of thedouble type that we created in the previous section. We would notwant to multiply two arbitrary types, it would not make much sense(and we’d be screwed when we implement this in C!)

The last line is the meat of the definition. There we create an Applynode representing the application of Op mul to inputs x andy, giving a Variable instance of type double as the output.

Note

Theano relies on the fact that if you call the make_node methodof Apply’s first argument on the inputs passed as the Apply’ssecond argument, the call will not fail and the returned Applyinstance will be equivalent. This is how graphs are copied.

perform

This code actually computes the function.In our example, the data in inputs will be instances of Python’sbuilt-in type float because this is the type that double.filter()will always return, per our own definition. output_storage willcontain a single storage cell for the multiplication’s variable.

  1. def perform(node, inputs, output_storage):
  2. x, y = inputs[0], inputs[1]
  3. z = output_storage[0]
  4. z[0] = x * y
  5. mul.perform = perform

Here, z is a list of one element. By default, z == [None].

Note

It is possible that z does not contain None. If it containsanything else, Theano guarantees that whatever it contains is whatperform put there the last time it was called with thisparticular storage. Furthermore, Theano gives you permission to dowhatever you want with z‘s contents, chiefly reusing it or thememory allocated for it. More information can be found in theOp documentation.

Warning

We gave z the Theano type double in make_node, which meansthat a Python float must be put there. You should not put, say, anint in z[0] because Theano assumes Ops handle typing properly.

Trying out our new Op

In the following code, we use our new Op:

  1. >>> import theano
  2. >>> x, y = double('x'), double('y')
  3. >>> z = mul(x, y)
  4. >>> f = theano.function([x, y], z)
  5. >>> f(5, 6)
  6. 30.0
  7. >>> f(5.6, 6.7)
  8. 37.519999999999996

Note that there is an implicit call todouble.filter() on each argument, so if we give integers as inputsthey are magically cast to the right type. Now, what if we try this?

  1. >>> x = double('x')
  2. >>> z = mul(x, 2)
  3. Traceback (most recent call last):
  4. File "<stdin>", line 1, in <module>
  5. File "/u/breuleuo/hg/theano/theano/gof/op.py", line 207, in __call__
  6. File "<stdin>", line 2, in make_node
  7. AttributeError: 'int' object has no attribute 'type'

Automatic Constant Wrapping

Well, OK. We’d like our Op to be a bit more flexible. This can be doneby modifying make_node to accept Python int or float asx and/or y:

  1. def make_node(x, y):
  2. if isinstance(x, (int, float)):
  3. x = gof.Constant(double, x)
  4. if isinstance(y, (int, float)):
  5. y = gof.Constant(double, y)
  6. if x.type != double or y.type != double:
  7. raise TypeError('mul only works on doubles')
  8. return gof.Apply(mul, [x, y], [double()])
  9. mul.make_node = make_node

Whenever we pass a Python int or float instead of a Variable as x ory, make_node will convert it to Constant for us. gof.Constantis a Variable we statically know the value of.

  1. >>> import numpy
  2. >>> x = double('x')
  3. >>> z = mul(x, 2)
  4. >>> f = theano.function([x], z)
  5. >>> f(10)
  6. 20.0
  7. >>> numpy.allclose(f(3.4), 6.8)
  8. True

Now the code works the way we want it to.

Note

Most Theano Ops follow this convention of up-casting literalmake_node arguments to Constants.This makes typing expressions more natural. If you donot want a constant somewhere in your graph, you have to pass a Variable(like double('x') here).

Final version

The above example is pedagogical. When you define other basic arithmeticoperations add, sub and div, code for make_node can beshared between these Ops. Here is revised implementation of these fourarithmetic operators:

  1. from theano import gof
  2.  
  3. class BinaryDoubleOp(gof.Op):
  4.  
  5. __props__ = ("name", "fn")
  6.  
  7. def __init__(self, name, fn):
  8. self.name = name
  9. self.fn = fn
  10.  
  11. def make_node(self, x, y):
  12. if isinstance(x, (int, float)):
  13. x = gof.Constant(double, x)
  14. if isinstance(y, (int, float)):
  15. y = gof.Constant(double, y)
  16. if x.type != double or y.type != double:
  17. raise TypeError('%s only works on doubles' % self.name)
  18. return gof.Apply(self, [x, y], [double()])
  19.  
  20. def perform(self, node, inp, out):
  21. x, y = inp
  22. z, = out
  23. z[0] = self.fn(x, y)
  24.  
  25. def __str__(self):
  26. return self.name
  27.  
  28. add = BinaryDoubleOp(name='add',
  29. fn=lambda x, y: x + y)
  30.  
  31. sub = BinaryDoubleOp(name='sub',
  32. fn=lambda x, y: x - y)
  33.  
  34. mul = BinaryDoubleOp(name='mul',
  35. fn=lambda x, y: x * y)
  36.  
  37. div = BinaryDoubleOp(name='div',
  38. fn=lambda x, y: x / y)

Instead of working directly on an instance of Op, we create a subclass ofOp that we can parametrize. All the operations we define are binary. Theyall work on two inputs with type double. They all return a singleVariable of type double. Therefore, make_node does the same thingfor all these operations, except for the Op reference self passedas first argument to Apply. We define perform using the functionfn passed in the constructor.

This design is a flexible way to define basic operations withoutduplicating code. The same way a Type subclass represents a set ofstructurally similar types (see previous section), an Op subclassrepresents a set of structurally similar operations: operations thathave the same input/output types, operations that only differ in onesmall detail, etc. If you see common patterns in several Ops that youwant to define, it can be a good idea to abstract out what you can.Remember that an Op is just an object which satisfies the contractdescribed above on this page and that you should use all the tools atyour disposal to create these objects as efficiently as possible.

Exercise: Make a generic DoubleOp, where the number ofarguments can also be given as a parameter.