Utility functions

Optimization

  • theano.gpuarray.optutil.alpha_merge(_cls, alpha_in, beta_in)[source]
  • Decorator to merge multiplication by a scalar on the output.

This will find a pattern of scal * (some, params, alpha,beta) and update it so that the scalar multiplication happens aspart of your op.

The op needs to accept an alpha and a beta scalar which act this way:

  1. out = Op() * alpha + out_like * beta

Where out_like is a buffer that has the same size as the outputand gets added to the “real” output of the operation. An exampleof an operation that respects this pattern is GEMM from blas.

The decorated function must have this signature:

  1. maker(node, *inputs)

The node argument you recieve is the original apply node thatcontains your op. You should use it to grab relevant propertiesfor your op so that the new version performs the same computation.The *inputs parameters contains the new inputs for your op. YouMUST use those inputs instead of the ones on node. Note thatthis function can be as simple as:

  1. def maker(node, *inputs):
  2. return node.op(*inputs)

Parameters:

  • cls (op class) – The class of the op you want to merge
  • alpha_in (int) – The input index for the alpha scalar for your op (in node.inputs).
  • beta_in (int) – The input index for the beta scalar for your op (in node.inputs).Returns: an unregistered local optimizer that has the same name as thedecorated function. Return type: local optimizer

Notes

This was factored out since the code to deal with interveningtransfers and correctness in the presence of different values ofalpha and beta scaling factors is not trivial.

  • theano.gpuarray.optutil.find_node(_v, cls, ignore_clients=False)[source]
  • Find the node that has an op of of type cls in v.

This digs through possibly redundant transfers to for the nodethat has the type cls. If ignore_clients is False (thedefault) it will only dig through nodes that have a single clientto avoid duplicating computations.

Parameters:

  • v – The variable to dig through
  • cls (Op class) – The type of the node we are looking for
  • ignore_clients (bool, optional) – Whether to ignore multiple clients or not.
  • theano.gpuarray.optutil.grab_cpu_scalar(_v, nd)[source]
  • Get a scalar variable value from the tree at v.

This function will dig through transfers and dimshuffles to getthe constant value. If no such constant is found, it returns None.

Parameters:

  • v – Theano variable to extract the constant value from.
  • nd (int) – Expected number of dimensions for the variable (forbroadcasted constants).
  • theano.gpuarray.optutil.inplace_allocempty(_op, idx)[source]
  • Wrapper to make an inplace optimization that deals with AllocEmpty

This will duplicate the alloc input if it has more than one clientto allow the op to work on it inplace.

The decorated function must have this signature:

  1. maker(node, inputs)

The node argument you recieve is the original apply node thatcontains your op. You should use it to grab relevant propertiesfor your op so that the new version performs the same computation.You should also switch the op to work inplace. The *inputs_parameters contains the new inputs for your op. You MUST usethose inputs instead of the ones on _node. Note that thisfunction can be as simple as:

  1. def maker(node, inputs):
  2. return [node.op.__class__(inplace=True)(*inputs)]

Parameters:

  • op (op class) – The op class to look for to make inplace
  • idx (int) – The index of the (possibly) AllocEmpty input (in node.inputs).Returns: an unregistered inplace local optimizer that has the same nameas the decorated function. Return type: local optimizer
  • theano.gpuarray.optutil.is_equal(_var, val)[source]
  • Returns True if var is always equal to val.

This will only return True if the variable will always be equal tothe value. If it might not be true in some cases then it returns False.

Parameters:

  • var – Variable to compare
  • val – Python value
  • theano.gpuarray.optutil.output_merge(_cls, alpha_in, beta_in, out_in)[source]
  • Decorator to merge addition by a value on the output.

This will find a pattern of val * (some, params, alpha,beta, out_like) and update it so that the addtition happens aspart of your op.

The op needs to accept an alpha and a beta scalar which act this way:

  1. out = Op() * alpha + out_like * beta

Where out_like is a buffer that has the same size as the outputand gets added to the “real” output of the operation. An exampleof an operation that respects this pattern is GEMM from blas.

The decorated function must have this signature:

  1. maker(node, *inputs)

The node argument you recieve is the original apply node thatcontains your op. You should use it to grab relevant propertiesfor your op so that the new version performs the same computation.The *inputs parameters contains the new inputs for your op. YouMUST use those inputs instead of the ones on node. Note thatthis function can be as simple as:

  1. def maker(node, *inputs):
  2. return node.op(*inputs)

Parameters:

  • cls (op class) – The class of the op you want to merge
  • alpha_in (int) – The input index for the alpha scalar for your op (in node.inputs).
  • beta_in (int) – The input index for the beta scalar for your op (in node.inputs).
  • out_in (int) – The input index for the out_like input for your op (in node.inputs).Returns: an unregistered local optimizer that has the same name as thedecorated function. Return type: local optimizer

Notes

This was factored out since the code to deal with interveningtransfers and correctness in the presence of different values ofalpha and beta scaling factors is not trivial.

This also correctly handles the case where the added value isbroadcasted (by not performing the replacement).

  • theano.gpuarray.optutil.pad_dims(_input, leftdims, rightdims)[source]
  • Reshapes the input to a (leftdims + rightdims) tensor

This helper function is used to convert pooling inputs with arbitrarynon-pooling dimensions to the correct number of dimensions for theGPU pooling ops.

This reduces or expands the number of dimensions of the input toexactly leftdims, by adding extra dimensions on the left or bycombining some existing dimensions on the left of the input.

Use unpad_dims to reshape back to the original dimensions.

Examples

Given input of shape (3, 5, 7), pad_dims(input, 2, 2)adds a singleton dimension and reshapes to (1, 3, 5, 7).Given that output from pad_dims, unpad_dims(output, input, 2, 2)reshapes back to (3, 5, 7).

Given input of shape (3, 5, 7, 9), pad_dims(input, 2, 2)does not reshape and returns output with shape (3, 5, 7, 9).

Given input of shape (3, 5, 7, 9, 11), pad_dims(input, 2, 2)combines the first two dimensions and reshapes to (15, 7, 9, 11).

Given input of shape (3, 5, 7, 9), pad_dims(input, 2, 3)adds a singleton dimension and reshapes to (1, 3, 5, 7, 9).

  • theano.gpuarray.optutil.unpad_dims(_output, input, leftdims, rightdims)[source]
  • Reshapes the output after pad_dims.

This reverts the padding by pad_dims.

Kernel generation

  • theano.gpuarray.kernelcodegen.code_version(_version)[source]
  • Decorator to support version-based cache mechanism.
  • theano.gpuarray.kernelcodegen.inline_reduce(_N, buf, pos, count, manner_fn)[source]
  • Return C++ code for a function that reduces a contiguous buffer.

Parameters:

  • N – Length of the buffer.
  • buf – buffer pointer.
  • pos – Index of executing thread.
  • count – Number of executing threads.
  • manner_fn – A function that accepts strings of arguments a and b, andreturns c code for their reduction.
return “%(a)s + %(b)s”

for a sum reduction.

Notes

buf should be in gpu shared memory, we access it many times.

This function leaves the answer in position 0 of the buffer. Therest of the buffer is trashed by this function.

  • theano.gpuarray.kernelcodegen.inline_reduce_fixed_shared(_N, buf, x, stride_x, load_x, pos, count, manner_fn, manner_init, b='', stride_b='', load_b='', dtype='float32')[source]
  • Return C++ code for a function that reduces a contiguous buffer.

This function leaves the answer in position 0 of the buffer. Therest of the buffer is trashed by this function.

Parameters:

  • N – Length of the buffer.
  • buf – Buffer pointer of size warpSize * sizeof(dtype).
  • x – Input data.
  • stride_x – Input data stride.
  • load_x – Wrapper to read from x.
  • pos – Index of executing thread.
  • count – Number of executing threads.
  • manner_fn – A function that accepts strings of arguments a and b, andreturns c code for their reduction.
return “%(a)s + %(b)s”

for a sum reduction.

  • manner_init – A function that accepts strings of arguments a and return ccode for its initialization.
  • b – Optional, pointer to the bias.
  • stride_b – Optional, the stride of b if b is provided.
  • load_b – Optional, wrapper to read from b if b is provided.
  • dtype – Optional, the dtype of the output.

Notes

buf should be in gpu shared memory, we access it many times.

  • theano.gpuarray.kernelcodegen.inline_softmax(_N, buf, buf2, threadPos, threadCount, dtype='float32')[source]
  • Generate code for a softmax.

On entry, buf and buf2 must contain two identical copies ofthe input to softmax.

After the code returns buf contains the softmax, buf2 containsun-normalized softmax.

Parameters:

  • N – Length of the buffer.
  • threadPos – Index of executing thread.
  • threadCount – Number of executing threads.
  • dtype – Dtype of the softmax’s output.

Notes

buf and buf2 should be in gpu shared memory, we access it manytimes.

We use __i as an int variable in a loop.

  • theano.gpuarray.kernelcodegen.inline_softmax_fixed_shared(_N, buf, x, stride_x, load_x, sm, sm_stride, write_sm, threadPos, threadCount, b='', stride_b='', load_b='', dtype='float32')[source]
  • Generate code to perform softmax with a fixed amount of sharedmemory.

On entry, buf is assumed to be empty.

On exit, buf[0] contains the softmax, buf2 containsun-normalized softmax.

Parameters:

  • N – Length of the buffer, atleast waprSize(32).
  • buf – A shared memory buffer of size warpSize * sizeof(dtype).
  • x – A ptr to the gpu memory where the row is stored.
  • stride_x – The stride between each element in x.
  • load_x – Wrapper to read from x.
  • sm – A ptr to the gpu memory to store the result.
  • sm_stride – The stride between each sm element.
  • write_sm – Wrapper before writing to sm.
  • threadPos – Index of executing thread.
  • threadCount – Number of executing threads.
  • b – Optional, pointer to the bias.
  • stride_b – Optional, the stride of b if b is provided.
  • load_b – Optional, wrapper to read from b if b is provided.
  • dtype – Optional, the dtype of the softmax’s output if not float32.

Notes

buf should be in gpu shared memory, we access it many times.

We use tx as an int variable in a loop.

  • theano.gpuarray.kernelcodegen.nvcc_kernel(_name, params, body)[source]
  • Return the c code of a kernel function.

Parameters:

  • params – The parameters to the function as one or more strings.
  • body – The [nested] list of statements for the body of the function.These will be separated by ‘;’ characters.

float16

  • theano.gpuarray.fp16help.load_w(_dtype)[source]
  • Return the function name to load data.

This should be used like this:

  1. code = '%s(ival)' % (load_w(input_type),)
  • theano.gpuarray.fp16help.work_dtype(_dtype)[source]
  • Return the data type for working memory.
  • theano.gpuarray.fp16help.write_w(_dtype)[source]
  • Return the function name to write data.

This should be used like this:

  1. code = 'res = %s(oval)' % (write_w(output_type),)