Extending Theano with a C Op

Extending Theano with a C Op

This tutorial covers how to extend Theano with an op that offers a Cimplementation. It does not cover ops that run on a GPU but it does introducemany elements and concepts which are relevant for GPU ops. This tutorial isaimed at individuals who already know how to extend Theano (see tutorialCreating a new Op: Python implementation) by adding a new op with a Python implementationand will only cover the additional knowledge required to also produce opswith C implementations.

Providing a Theano op with a C implementation requires to interact withPython’s C-API and Numpy’s C-API. Thus, the first step of this tutorial is tointroduce both and highlight their features which are most relevant to thetask of implementing a C op. This tutorial then introduces the most importantmethods that the op needs to implement in order to provide a usable Cimplementation. Finally, it shows how to combine these elements to write asimple C op for performing the simple task of multiplying every element in avector by a scalar.

Python C-API

Python provides a C-API to allows the manipulation of python objects from Ccode. In this API, all variables that represent Python objects are of typePyObject . All objects have a pointer to their type object and a referencecount field (that is shared with the python side). Most python methods havean equivalent C function that can be called on the PyObject pointer.

As such, manipulating a PyObject instance is often straight-forward but itis important to properly manage its reference count. Failing to do so canlead to undesired behavior in the C code.

Reference counting

Reference counting is a mechanism for keeping track, for an object, ofthe number of references to it held by other entities. This mechanism is oftenused for purposes of garbage collecting because it allows to easily see ifan object is still being used by other entities. When the reference countfor an object drops to 0, it means it is not used by anyone any longer and canbe safely deleted.

PyObjects implement reference counting and the Python C-API defines a numberof macros to help manage those reference counts. The definition of thesemacros can be found here : Python C-API Reference Counting. Listed below are thetwo macros most often used in Theano C ops.

void Py_XINCREF(PyObject *o)
Increments the reference count of object o. Without effect if theobject is NULL.

void Py_XDECREF(PyObject *o)
Decrements the reference count of object o. If the reference countreaches 0, it will trigger a call of the object’s deallocation function.Without effect if the object is NULL.

The general principle, in the reference counting paradigm, is that the ownerof a reference to an object is responsible for disposing properly of it.This can be done by decrementing the reference count once the reference is nolonger used or by transfering ownership; passing on the reference to a newowner which becomes responsible for it.

Some functions return “borrowed references”; this means that they return areference to an object without transfering ownership of the reference to thecaller of the function. This means that if you call a function which returns aborrowed reference, you do not have the burden of properly disposing of thatreference. You should not call Py_XDECREF() on a borrowed reference.

Correctly managing the reference counts is important as failing to do so canlead to issues ranging from memory leaks to segmentation faults.

NumPy C-API

The NumPy library provides a C-API to allow users to create, access andmanipulate NumPy arrays from within their own C routines. NumPy’s ndarraysare used extensively inside Theano and so extending Theano with a C op willrequire interaction with the NumPy C-API.

This sections covers the API’s elements that are often required to write codefor a Theano C op. The full documentation for the API can be found here :NumPy C-API.

NumPy data types

To allow portability between platforms, the NumPy C-API defines its own datatypes which should be used whenever you are manipulating a NumPy array’sinternal data. The data types most commonly used to implement C ops are thefollowing : npy_int{8,16,32,64}, npy_uint{8,16,32,64} andnpy_float{32,64}.

You should use these data types when manipulating a NumPy array’s internaldata instead of C primitives because the size of the memory representationfor C primitives can vary between platforms. For instance, a C long can berepresented in memory with 4 bytes but it can also be represented with 8.On the other hand, the in-memory size of NumPy data types remains constantacross platforms. Using them will make your code simpler and more portable.

The full list of defined data types can be found here :NumPy C-API data types.

NumPy ndarrays

In the NumPy C-API, NumPy arrays are represented as instances of thePyArrayObject class which is a descendant of the PyObject class. This meansthat, as for any other Python object that you manipulate from C code, youneed to appropriatedly manage the reference counts of PyArrayObject instances.

Unlike in a standard multidimensionnal C array, a NumPy array’s internal datarepresentation does not have to occupy a continuous region in memory. In fact,it can be C-contiguous, F-contiguous or non-contiguous. C-contiguous meansthat the data is not only contiguous in memory but also that it is organizedsuch that the index of the latest dimension changes the fastest. If thefollowing array

x = [[1, 2, 3],
     [4, 5, 6]]

is C-contiguous, it means that, in memory, the six values contained in thearray x are stored in the order [1, 2, 3, 4, 5, 6] (the first value isx[0,0], the second value is x[0,1], the third value is x[0,2], the,fourth value is x[1,0], etc). F-contiguous (or Fortran Contiguous) alsomeans that the data is contiguous but that it is organized such that the indexof the latest dimension changes the slowest. If the array x isF-contiguous, it means that, in memory, the values appear in the order[1, 4, 2, 5, 3, 6] (the first value is x[0,0], the second value isx[1,0], the third value is x[0,1], etc).

Finally, the internal data can be non-contiguous. In this case, it occupiesa non-contiguous region in memory but it is still stored in an organizedfashion : the distance between the element x[i,j] and the elementx[i+1,j] of the array is constant over all valid values of i andj, just as the distance between the element x[i,j] and the elementx[i,j+1] of the array is constant over all valid values of i and j.This distance between consecutive elements of an array over a given dimension,is called the stride of that dimension.

Accessing NumPy ndarrays’ data and properties

The following macros serve to access various attributes of NumPy ndarrays.

void PyArray_DATA(PyArrayObject arr)
Returns a pointer to the first element of the array’s data. The returnedpointer must be cast to a pointer of the proper Numpy C-API data typebefore use.

int PyArray_NDIM(PyArrayObject* arr)
Returns the number of dimensions in the the array pointed by arr

npy_intp PyArray_DIMS(PyArrayObject arr)
Returns a pointer on the first element of arr‘s internal arraydescribing its dimensions. This internal array contains as many elementsas the array arr has dimensions.

The macro PyArray_SHAPE() is a synonym of PyArray_DIMS() : it hasthe same effect and is used in an identical way.

npy_intp PyArray_STRIDES(PyArrayObject arr)
Returns a pointer on the first element of arr‘s internal arraydescribing the stride for each of its dimension. This array has as manyelements as the number of dimensions in arr. In this array, thestrides are expressed in number of bytes.

PyArray_Descr PyArray_DESCR(PyArrayObject arr)
Returns a reference to the object representing the dtype of the array.

The macro PyArray_DTYPE() is a synonym of the PyArray_DESCR() : ithas the same effect and is used in an identical way.

Note:This is a borrowed reference so you do not need to decrement itsreference count once you are done with it.

int PyArray_TYPE(PyArrayObject* arr)
Returns the typenumber for the elements of the array. Like the dtype, thetypenumber is a descriptor for the type of the data in the array. However,the two are not synonyms and, as such, cannot be used in place of theother.

npy_intp PyArray_SIZE(PyArrayObject* arr)
Returns to total number of elements in the array

bool PyArray_CHKFLAGS(PyArrayObject* arr, flags)
Returns true if the array has the specified flags. The variable flagshould either be a NumPy array flag or an integer obtained by applyingbitwise or to an ensemble of flags.

The flags that can be used in with this macro are :NPY_ARRAY_C_CONTIGUOUS, NPY_ARRAY_F_CONTIGUOUS, NPY_ARRAY_OWNDATA,NPY_ARRAY_ALIGNED, NPY_ARRAY_WRITEABLE, NPY_ARRAY_UPDATEIFCOPY.

Creating NumPy ndarrays

The following functions allow the creation and copy of NumPy arrays :

PyObject PyArray_EMPTY(int nd, npy_intp dims, typenum dtype,
int fortran)
Constructs a new ndarray with the number of dimensions specified bynd, shape specified by dims and data type specified by dtype.If fortran is equal to 0, the data is organized in a C-contiguouslayout, otherwise it is organized in a F-contiguous layout. The arrayelements are not initialized in any way.

The function PyArray_Empty() performs the same function as the macroPyArray_EMPTY() but the data type is given as a pointer to aPyArray_Descr object instead of a typenum.

PyObject PyArray_ZEROS(int nd, npy_intp dims, typenum dtype,
int fortran)
Constructs a new ndarray with the number of dimensions specified bynd, shape specified by dims and data type specified by dtype.If fortran is equal to 0, the data is organized in a C-contiguouslayout, otherwise it is organized in a F-contiguous layout. Every elementin the array is initialized to 0.

The function PyArray_Zeros() performs the same function as the macroPyArray_ZEROS() but the data type is given as a pointer to aPyArray_Descr object instead of a typenum.

PyArrayObject PyArray_GETCONTIGUOUS(PyObject op)
Returns a C-contiguous and well-behaved copy of the array op. If op isalready C-contiguous and well-behaved, this function simply returns anew reference to op.

Methods the C Op needs to define

There is a key difference between an op defining a Python implementation forits computation and defining a C implementation. In the case of a Pythonimplementation, the op defines a function perform() which executes therequired Python code to realize the op. In the case of a C implementation,however, the op does not define a function that will execute the C code; itinstead defines functions that will return the C code to the caller.

This is because calling C code from Python code comes with a significantoverhead. If every op was responsible for executing its own C code, everytime a Theano function was called, this overhead would occur as many timesas the number of ops with C implementations in the function’s computationalgraph.

To maximize performance, Theano instead requires the C ops to simply returnthe code needed for their execution and takes upon itself the task oforganizing, linking and compiling the code from the various ops. Through this,Theano is able to minimize the number of times C code is called from Pythoncode.

The following is a very simple example to illustrate how it’s possible toobtain performance gains with this process. Suppose you need to execute,from Python code, 10 different ops, each one having a C implementation. Ifeach op was responsible for executing its own C code, the overhead ofcalling C code from Python code would occur 10 times. Consider now the casewhere the ops instead return the C code for their execution. You could getthe C code from each op and then define your own C module that would callthe C code from each op in succession. In this case, the overhead would onlyoccur once; when calling your custom module itself.

Moreover, the fact that Theano itself takes care of compiling the C code,instead of the individual ops, allows Theano to easily cache the compiled Ccode. This allows for faster compilation times.

See Implementing the arithmetic Ops in C for the full documentation of the various methods of theclass Op that are related to the C implementation. Of particular interest are:

The methods Op.c_libraries() and Op.c_lib_dirs() to allowyour op to use external libraries.
The method Op.c_code_cleanup() to specify how the op shouldclean up what it has allocated during its execution.
The methods Op.c_init_code() and Op.c_init_code_apply()to specify code that should be executed once when the module isinitialized, before anything else is executed.
The methods Op.c_compile_args() andOp.c_no_compile_args() to specify requirements regarding howthe op’s C code should be compiled.

This section describes the methods Op.c_code(),Op.c_support_code(), Op.c_support_code_apply() andOp.c_code_cache_version() because they are the ones that are mostcommonly used.

ccode(_node, name, input_names, output_names, sub)
This method returns a string containing the C code to perform thecomputation required by this op.

The node argument is an Apply node representing anapplication of the current Op on a list of inputs, producing a list ofoutputs.

input_names is a sequence of strings which contains as many stringsas the op has inputs. Each string contains the name of the C variableto which the corresponding input has been assigned. For example, the nameof the C variable representing the first input of the op is given byinput_names[0]. You should therefore use this name in yourC code to interact with that variable. output_names is usedidentically to input_names, but for the op’s outputs.

Finally, sub is a dictionary of extras parameters to the c_codemethod. Among other things, it contains sub['fail'] which is a stringof C code that you should include in your C code (after ensuring that aPython exception is set) if it needs to raise an exception. Ex:

c_code = """
    PyErr_Format(PyExc_ValueError, "X does not have the right value");
    %(fail)s;
""" % {'fail' : sub['fail']}

to raise a ValueError Python exception with the specified message.The function PyErr_Format() supports string formatting so it ispossible to tailor the error message to the specifics of the errorthat occured. If PyErr_Format() is called with more than twoarguments, the subsequent arguments are used to format the error messagewith the same behavior as the function PyString_FromFormat(). The% characters in the format characters need to be escaped since the Ccode itself is defined in a string which undergoes string formatting.

c_code = """
    PyErr_Format(PyExc_ValueError,
                 "X==%%i but it should be greater than 0", X);
    %(fail)s;
""" % {'fail' : sub['fail']}

Note:Your C code should not return the output of the computation butrather put the results in the C variables whose names are contained inthe output_names.

c_support_code()
Returns a string containing some support C code for this op. This codewill be included at the global scope level and can be used to definefunctions and structs that will be used by every apply of this op.

csupport_code_apply(_node, name)
Returns a string containing some support C code for this op. This codewill be included at the global scope level and can be used to definefunctions and structs that will be used by this op. The difference betweenthis method and c_support_code() is that the C code specified inc_support_code_apply() should be specific to each apply of the Op,while c_support_code() is for support code that is not specific toeach apply.

Both c_support_code() and c_support_code_apply () are necessarybecause a Theano op can be used more than once in a given Theanofunction. For example, an op that adds two matrices could be used at somepoint in the Theano function to add matrices of integers and, at anotherpoint, to add matrices of doubles. Because the dtype of the inputs andoutputs can change between different applies of the op, any support codethat relies on a certain dtype is specific to a given apply of the op andshould therefore be defined in c_support_code_apply().

c_code_cache_version()
Returns a tuple of integers representing the version of the C code in thisop. Ex : (1, 4, 0) for version 1.4.0

This tuple is used by Theano to cache the compiled C code for this op. Assuch, the return value MUST BE CHANGED every time the C code is alteredor else Theano will disregard the change in the code and simply load aprevious version of the op from the cache. If you want to avoid caching ofthe C code of this op, return an empty tuple or do not implement thismethod.

Note:Theano can handle tuples of any hashable objects as return valuesfor this function but, for greater readability and easier management,this function should return a tuple of integers as previouslydescribed.

Important restrictions when implementing an Op

There are some important restrictions to remember when implementing an Op.Unless your Op correctly defines a view_map attribute, the perform and c_code must notproduce outputs whose memory is aliased to any input (technically, if changing theoutput could change the input object in some sense, they are aliased).Unless your Op correctly defines a destroy_map attribute, perform and c_code mustnot modify any of the inputs.

TODO: EXPLAIN DESTROYMAP and VIEWMAP BETTER AND GIVE EXAMPLE.

When developing an Op, you should run computations in DebugMode, by usingargument mode='DebugMode' to theano.function. DebugMode isslow, but it can catch many common violations of the Op contract.

TODO: Like what? How? Talk about Python vs. C too.

DebugMode is no silver bullet though.For example, if you modify an Op self.* during any ofmake_node, perform, or c_code, you are probably doing somethingwrong but DebugMode will not detect this.

TODO: jpt: I don’t understand the following sentence.

Ops and Types should usually be considered immutable – you shoulddefinitely not make a change that would have an impact on eq,hash, or the mathematical value that would be computed by performor c_code.

Simple C Op example

In this section, we put together the concepts that were covered in thistutorial to generate an op which multiplies every element in a vectorby a scalar and returns the resulting vector. This is intended to be a simpleexample so the methods c_support_code() and c_support_code_apply() arenot used because they are not required.

In the C code below notice how the reference count on the output variable ismanaged. Also take note of how the new variables required for the op’scomputation are declared in a new scope to avoid cross-initialization errors.

Also, in the C code, it is very important to properly validate the inputsand outputs storage. Theano guarantees that the inputs exist and have theright number of dimensions but it does not guarantee their exact shape. Forinstance, if an op computes the sum of two vectors, it needs to validate thatits two inputs have the same shape. In our case, we do not need to validatethe exact shapes of the inputs because we don’t have a need that they matchin any way.

For the outputs, things are a little bit more subtle. Theano does notguarantee that they have been allocated but it does guarantee that, if theyhave been allocated, they have the right number of dimension. Again, Theanooffers no guarantee on the exact shapes. This means that, in our example, weneed to validate that the output storage has been allocated and has the sameshape as our vector input. If it is not the case, we allocate a new outputstorage with the right shape and number of dimensions.

import numpy
import theano
from theano import gof
import theano.tensor as T
 
class VectorTimesScalar(gof.Op):
    __props__ = ()
 
    def make_node(self, x, y):
        # Validate the inputs' type
        if x.type.ndim != 1:
            raise TypeError('x must be a 1-d vector')
        if y.type.ndim != 0:
            raise TypeError('y must be a scalar')
 
        # Create an output variable of the same type as x
        output_var = x.type()
 
        return gof.Apply(self, [x, y], [output_var])
 
    def c_code_cache_version(self):
        return (1, 0)
 
    def c_code(self, node, name, inp, out, sub):
        x, y = inp
        z, = out
 
        # Extract the dtypes of the inputs and outputs storage to
        # be able to declare pointers for those dtypes in the C
        # code.
        dtype_x = node.inputs[0].dtype
        dtype_y = node.inputs[1].dtype
        dtype_z = node.outputs[0].dtype
 
        itemsize_x = numpy.dtype(dtype_x).itemsize
        itemsize_z = numpy.dtype(dtype_z).itemsize
 
        fail = sub['fail']
 
        c_code = """
        // Validate that the output storage exists and has the same
        // dimension as x.
        if (NULL == %(z)s ||
            PyArray_DIMS(%(x)s)[0] != PyArray_DIMS(%(z)s)[0])
        {
            /* Reference received to invalid output variable.
            Decrease received reference's ref count and allocate new
            output variable */
            Py_XDECREF(%(z)s);
            %(z)s = (PyArrayObject*)PyArray_EMPTY(1,
                                                PyArray_DIMS(%(x)s),
                                                PyArray_TYPE(%(x)s),
                                                0);
 
            if (!%(z)s) {
                %(fail)s;
            }
        }
 
        // Perform the vector multiplication by a scalar
        {
            /* The declaration of the following variables is done in a new
            scope to prevent cross initialization errors */
            npy_%(dtype_x)s* x_data_ptr =
                            (npy_%(dtype_x)s*)PyArray_DATA(%(x)s);
            npy_%(dtype_z)s* z_data_ptr =
                            (npy_%(dtype_z)s*)PyArray_DATA(%(z)s);
            npy_%(dtype_y)s y_value =
                            ((npy_%(dtype_y)s*)PyArray_DATA(%(y)s))[0];
            int x_stride = PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s;
            int z_stride = PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s;
            int x_dim = PyArray_DIMS(%(x)s)[0];
 
            for(int i=0; i < x_dim; i++)
            {
                z_data_ptr[i * z_stride] = (x_data_ptr[i * x_stride] *
                                            y_value);
            }
        }
        """
 
        return c_code % locals()

The c_code method accepts variable names as arguments (name, inp,out, sub) and returns a C code fragment that computes the expressionoutput. In case of error, the %(fail)s statement cleans up and returnsproperly.

More complex C Op example

This section introduces a new example, slightly more complex than the previousone, with an op to perform an element-wise multiplication between the elementsof two vectors. This new example differs from the previous one in its useof the methods csupport_code() and c_support_code_apply() (it doesnot _need to use them but it does so to explain their use) and its capacityto support inputs of different dtypes.

Recall the method c_support_code() is meant to produce code that willbe used for every apply of the op. This means that the C code in thismethod must be valid in every setting your op supports. If the op is meantto supports inputs of various dtypes, the C code in this method should begeneric enough to work with every supported dtype. If the op operates oninputs that can be vectors or matrices, the C code in this method shouldbe able to accomodate both kinds of inputs.

In our example, the method c_support_code() is used to declare a Cfunction to validate that two vectors have the same shape. Because ourop only supports vectors as inputs, this function is allowed to relyon its inputs being vectors. However, our op should support multipledtypes so this function cannot rely on a specific dtype in its inputs.

The method c_support_code_apply(), on the other hand, is allowedto depend on the inputs to the op because it is apply-specific. Therefore, weuse it to define a function to perform the multiplication between two vectors.Variables or functions defined in the method c_support_code_apply() willbe included at the global scale for every apply of the Op. Because of this,the names of those variables and functions should include the name of the op,like in the example. Otherwise, using the op twice in the same graph will giverise to conflicts as some elements will be declared more than once.

The last interesting difference occurs in the c_code() method. Because thedtype of the output is variable and not guaranteed to be the same as any ofthe inputs (because of the upcast in the method make_node()), the typenumof the output has to be obtained in the Python code and then included in theC code.

class VectorTimesVector(gof.Op):
    __props__ = ()
 
    def make_node(self, x, y):
        # Validate the inputs' type
        if x.type.ndim != 1:
            raise TypeError('x must be a 1-d vector')
        if y.type.ndim != 1:
            raise TypeError('y must be a 1-d vector')
 
        # Create an output variable of the same type as x
        output_var = theano.tensor.TensorType(
                        dtype=theano.scalar.upcast(x.dtype, y.dtype),
                        broadcastable=[False])()
 
        return gof.Apply(self, [x, y], [output_var])
 
    def c_code_cache_version(self):
        return (1, 0, 2)
 
    def c_support_code(self):
        c_support_code = """
        bool vector_same_shape(PyArrayObject* arr1,
            PyArrayObject* arr2)
        {
            return (PyArray_DIMS(arr1)[0] == PyArray_DIMS(arr2)[0]);
        }
        """
 
        return c_support_code
 
    def c_support_code_apply(self, node, name):
        dtype_x = node.inputs[0].dtype
        dtype_y = node.inputs[1].dtype
        dtype_z = node.outputs[0].dtype
 
        c_support_code = """
        void vector_elemwise_mult_%(name)s(npy_%(dtype_x)s* x_ptr,
            int x_str, npy_%(dtype_y)s* y_ptr, int y_str,
            npy_%(dtype_z)s* z_ptr, int z_str, int nbElements)
        {
            for (int i=0; i < nbElements; i++){
                z_ptr[i * z_str] = x_ptr[i * x_str] * y_ptr[i * y_str];
            }
        }
        """
 
        return c_support_code % locals()
 
    def c_code(self, node, name, inp, out, sub):
        x, y = inp
        z, = out
 
        dtype_x = node.inputs[0].dtype
        dtype_y = node.inputs[1].dtype
        dtype_z = node.outputs[0].dtype
 
        itemsize_x = numpy.dtype(dtype_x).itemsize
        itemsize_y = numpy.dtype(dtype_y).itemsize
        itemsize_z = numpy.dtype(dtype_z).itemsize
 
        typenum_z = numpy.dtype(dtype_z).num
 
        fail = sub['fail']
 
        c_code = """
        // Validate that the inputs have the same shape
        if ( !vector_same_shape(%(x)s, %(y)s))
        {
            PyErr_Format(PyExc_ValueError, "Shape mismatch : "
                        "x.shape[0] and y.shape[0] should match but "
                        "x.shape[0] == %%i and y.shape[0] == %%i",
                        PyArray_DIMS(%(x)s)[0], PyArray_DIMS(%(y)s)[0]);
            %(fail)s;
        }
 
        // Validate that the output storage exists and has the same
        // dimension as x.
        if (NULL == %(z)s || !(vector_same_shape(%(x)s, %(z)s)))
        {
            /* Reference received to invalid output variable.
            Decrease received reference's ref count and allocate new
            output variable */
            Py_XDECREF(%(z)s);
            %(z)s = (PyArrayObject*)PyArray_EMPTY(1,
                                                PyArray_DIMS(%(x)s),
                                                %(typenum_z)s,
                                                0);
 
            if (!%(z)s) {
                %(fail)s;
            }
        }
 
        // Perform the vector elemwise multiplication
        vector_elemwise_mult_%(name)s(
                                (npy_%(dtype_x)s*)PyArray_DATA(%(x)s),
                                PyArray_STRIDES(%(x)s)[0] / %(itemsize_x)s,
                                (npy_%(dtype_y)s*)PyArray_DATA(%(y)s),
                                PyArray_STRIDES(%(y)s)[0] / %(itemsize_y)s,
                                (npy_%(dtype_z)s*)PyArray_DATA(%(z)s),
                                PyArray_STRIDES(%(z)s)[0] / %(itemsize_z)s,
                                PyArray_DIMS(%(x)s)[0]);
        """
 
        return c_code % locals()

Alternate way of defining C Ops

The two previous examples have covered the standard way of implementing C Opsin Theano by inheriting from the class Op. This process is mostlysimple but it still involves defining many methods as well as mixing, in thesame file, both Python and C code which tends to make the result lessreadable.

To help with this, Theano defines a class, COp, from which new C opscan inherit. The class COp aims to simplify the process of implementingC ops by doing the following :

It allows you to define the C implementation of your op in a distinctC code file. This makes it easier to keep your Python and C codereadable and well indented.
It can automatically handle all the methods that return C code,in addition to Op.c_code_cache_version() based on theprovided external C implementation.

To illustrate how much simpler the class COp makes the process of defininga new op with a C implementation, let’s revisit the second example of thistutorial, the VectorTimesVector op. In that example, we implemented an opto perform the task of element-wise vector-vector multiplication. The twofollowing blocks of code illustrate what the op would look like if it wasimplemented using the COp class.

The new op is defined inside a Python file with the following code :

import theano
from theano import gof
 
class VectorTimesVector(gof.COp):
    __props__ = ()
 
    func_file = "./vectorTimesVector.c"
    func_name = "APPLY_SPECIFIC(vector_times_vector)"
 
    def __init__(self):
        super(VectorTimesVector, self).__init__(self.func_file,
                                                self.func_name)
 
    def make_node(self, x, y):
        # Validate the inputs' type
        if x.type.ndim != 1:
            raise TypeError('x must be a 1-d vector')
        if y.type.ndim != 1:
            raise TypeError('y must be a 1-d vector')
 
        # Create an output variable of the same type as x
        output_var = theano.tensor.TensorType(
                        dtype=theano.scalar.upcast(x.dtype, y.dtype),
                        broadcastable=[False])()
 
        return gof.Apply(self, [x, y], [output_var])

And the following is the C implementation of the op, defined in an externalC file named vectorTimesVector.c :

#section support_code
 
// Support code function
bool vector_same_shape(PyArrayObject* arr1, PyArrayObject* arr2)
{
    return (PyArray_DIMS(arr1)[0] == PyArray_DIMS(arr2)[0]);
}
 
 
#section support_code_apply
 
// Apply-specific support function
void APPLY_SPECIFIC(vector_elemwise_mult)(
    DTYPE_INPUT_0* x_ptr, int x_str,
    DTYPE_INPUT_1* y_ptr, int y_str,
    DTYPE_OUTPUT_0* z_ptr, int z_str, int nbElements)
{
    for (int i=0; i < nbElements; i++){
        z_ptr[i * z_str] = x_ptr[i * x_str] * y_ptr[i * y_str];
    }
}
 
// Apply-specific main function
int APPLY_SPECIFIC(vector_times_vector)(PyArrayObject* input0,
                                        PyArrayObject* input1,
                                        PyArrayObject** output0)
{
    // Validate that the inputs have the same shape
    if ( !vector_same_shape(input0, input1))
    {
        PyErr_Format(PyExc_ValueError, "Shape mismatch : "
                    "input0.shape[0] and input1.shape[0] should "
                    "match but x.shape[0] == %i and "
                    "y.shape[0] == %i",
                    PyArray_DIMS(input0)[0], PyArray_DIMS(input1)[0]);
        return 1;
    }
 
    // Validate that the output storage exists and has the same
    // dimension as x.
    if (NULL == *output0 || !(vector_same_shape(input0, *output0)))
    {
        /* Reference received to invalid output variable.
        Decrease received reference's ref count and allocate new
        output variable */
        Py_XDECREF(*output0);
        *output0 = (PyArrayObject*)PyArray_EMPTY(1,
                                                PyArray_DIMS(input0),
                                                TYPENUM_OUTPUT_0,
                                                0);
 
        if (!*output0) {
            PyErr_Format(PyExc_ValueError,
                        "Could not allocate output storage");
            return 1;
        }
    }
 
    // Perform the actual vector-vector multiplication
    APPLY_SPECIFIC(vector_elemwise_mult)(
                            (DTYPE_INPUT_0*)PyArray_DATA(input0),
                            PyArray_STRIDES(input0)[0] / ITEMSIZE_INPUT_0,
                            (DTYPE_INPUT_1*)PyArray_DATA(input1),
                            PyArray_STRIDES(input1)[0] / ITEMSIZE_INPUT_1,
                            (DTYPE_OUTPUT_0*)PyArray_DATA(*output0),
                            PyArray_STRIDES(*output0)[0] / ITEMSIZE_OUTPUT_0,
                            PyArray_DIMS(input0)[0]);
 
    return 0;
}

As you can see from this example, the Python and C implementations are nicelydecoupled which makes them much more readable than when they were intertwinedin the same file and the C code contained string formatting markers.

Now that we have motivated the COp class, we can have a more precise look atwhat it does for us. For this, we go through the various elements that make upthis new version of the VectorTimesVector op :

Parent class : instead of inheriting from the class Op,VectorTimesVector inherits from the class COp.
Constructor : in our new op, the init() method has animportant use; to inform the constructor of the COp classof the location, on the filesystem of the C implementation ofthis op. To do this, it gives a list of file paths containingthe C code for this op. To auto-generate the c_code methodwith a function call you can specify the function name as thesecond parameter. The paths should be given as a relativepath from the folder where the descendant of the COp classis defined.
make_node() : the make_node() method is absolutelyidentical to the one in our old example. Using the COpclass doesn’t change anything here.
External C code : the external C code implements the variousfunctions associated with the op. Writing this C codeinvolves a few subtleties which deserve their own respectivesections.

Main function

If you pass a function name to the init() method of theCOp class, it must respect the following constraints:

It must return an int. The value of that int indicates whetherthe op could perform its task or not. A value of 0 indicatessuccess while any non-zero value will interrupt the executionof the Theano function. When returning non-zero the functionmust set a python exception indicating the details of theproblem.
It must receive one argument for each input to the op followedby one pointer to an argument for each output of the op. Thetypes for the argument is dependant on the Types (that istheano Types) of your inputs and outputs.
You can sepcify the number of inputs and outputs for your opby setting the copnum_inputs and cop_num_outputs_attributes on your op. The main function will always becalled with that number of arguments, using NULL to fill infor missing values at the end. This can be used if your ophas a variable number of inputs or outputs, but with a fixedmaximum.

For example, the main C function of an op that takes two TensorTypes(which has PyArrayObject * as its C type) as inputs and returnsboth their sum and the difference between them would have fourparameters (two for the op’s inputs and two for its outputs) and it’ssignature would look something like this :

int sumAndDiffOfScalars(PyArrayObject* in0, PyArrayObject* in1,
                        PyArrayObject** out0, PyArrayObject** out1)

Macros

For certain section tags, your C code can benefit from a number ofpre-defined macros. These section tags have no macros: init_code,support_code. All other tags will have the support macrosdiscussed below.

APPLY_SPECIFIC(str) which will automatically append a nameunique to the Apply node that applies the Op at the endof the provided str. The use of this macro is discussedfuther below.

For every input which has a dtype attribute (this meansTensors, and equivalent types on GPU), the following macros will bedefined unless your Op class has an Op.check_input attributedefined to False. In these descrptions ‘i’ refers to the position(indexed from 0) in the input array.

DTYPEINPUT{i} : NumPy dtype of the data in the array.This is the variable type corresponding to the NumPy dtype, not thestring representation of the NumPy dtype. For instance, if the op’sfirst input is a float32 ndarray, then the macro DTYPE_INPUT_0corresponds to npy_float32 and can directly be used to declare anew variable of the same dtype as the data in the array :

DTYPE_INPUT_0 myVar = someValue;

TYPENUMINPUT{i} : Typenum of the data in the array
ITEMSIZEINPUT{i} : Size, in bytes, of the elements inthe array.

In the same way, the macros DTYPEOUTPUT{i},ITEMSIZEOUTPUT{i} and TYPENUMOUTPUT{i} are defined forevery output ‘i’ of the op.

In addition to these macros, the init_code_struct, code, andcode_cleanup section tags also have the following macros:

FAIL : Code to insert at error points. A python exceptionshould be set prior to this code. An invocation look like this:

if (error) {
  // Set python exception
  FAIL
}

You can add a semicolon after the macro if it makes your editorhappy.

PARAMS : Name of the params variable for this node. (onlyfor Ops which have params, which is discussed elsewhere)

Finally the tag code and codecleanup have macros topass the inputs and output names. These are name INPUT{i} andOUTPUT{i} where _i is the 0-based index position in the inputand output arrays respectively.

Support code

Certain section are limited in what you can place in them due tosemantic and syntactic restrictions of the C++ language. Most ofthese restrictions apply to the tags that end in _struct.

When we defined the VectorTimesVector op without using the COpclass, we had to make a distinction between two types of support_code: the support code that was apply-specific and the support code thatwasn’t. The apply-specific code was defined in thec_support_code_apply() method and the elements defined in thatcode (global variables and functions) had to include the name of theApply node in their own names to avoid conflicts between the differentversions of the apply-specific code. The code that wasn’tapply-specific was simply defined in the c_support_code() method.

To make indentifiers that include the Apply node name use theAPPLY_SPECIFIC(str) macro. In the above example, this macro isused when defining the functions vector_elemwise_mult() andvector_times_vector() as well as when calling functionvector_elemwise_mult() from inside vector_times_vector().

When using the COp class, we still have to make the distinctionbetween C code for each of the methods of a C class. These sections ofcode are separated by #section <tag> markers. The tag determinesthe name of the method this C code applies to with the rule that<tag> applies to c<tag>_. Unknown tags are an error and will bereported. Duplicate tags will be merged together in the order theappear in the C files.

The rules for knowing if where a piece of code should be put can besometimes tricky. The key thing to remember is that things that canbe shared between instances of the op should be apply-agnostic and gointo a section which does not end in _apply or _struct. Thedistinction of _apply and _struct mostly hinghes on how youwant to manange the lifetime of the object. Note that to use anapply-specific object, you have to be in a apply-specific section, sosome portions of the code that might seem apply-agnostic may still beapply-specific because of the data they use (this does not includearguments).

In the above example, the function vector_same_shape() isapply-agnostic because it uses none of the macros defined by the classCOp and it doesn’t rely on any apply-specific code. The functionvector_elemwise_mult() is apply-specific because it uses themacros defined by COp. Finally, the functionvector_times_vector() is apply-specific because it uses those samemacros and also because it calls vector_elemwise_mult() which isan apply-specific function.

Using GDB to debug Op’s C code

When debugging C code, it can be useful to use GDB for code compiledby Theano.

For this, you must enable this Theano: cmodule.remove_gxx_opt=True.For the GPU, you must add in this second flag nvcc.flags=-g (it slowdown computation on the GPU, but it is enabled by default on the CPU).

Then you must start Python inside GDB and in it start your Pythonprocess (e.g. theano-nose):

$gdb python
(gdb)r bin/theano-nose theano/

Quick guide to GDB.

Final Note

This tutorial focuses on providing C implementations to ops that manipulateTheano tensors. For more information about other Theano types, you can referto the section Alternate Theano Types.