Graph optimization

In this section we will define a couple optimizations on doubles.

Todo

This tutorial goes way too far under the hood, for someone who just wantsto add yet another pattern to the libraries in tensor.opt for example.

We need another tutorial that covers the decorator syntax, and explains howto register your optimization right away. That’s what you need to getgoing.

Later, the rest is more useful for when that decorator syntax type thingdoesn’t work. (There are optimizations that don’t fit that model).

Note

The optimization tag cxx_only is used for optimizations that insertOps which have no Python implementation (so they only have C code).Optimizations with this tag are skipped when there is no C++ compileravailable.

Global and local optimizations

First, let’s lay out the way optimizations work in Theano. There aretwo types of optimizations: global optimizations and local_optimizations. A global optimization takes a FunctionGraph object (aFunctionGraph is a wrapper around a whole computation graph, you can see itsdocumentation for more details) and navigates through itin a suitable way, replacing some Variables by others in the process. Alocal optimization, on the other hand, is defined as a function on a_single Apply node and must return either False (to mean thatnothing is to be done) or a list of new Variables that we would like toreplace the node’s outputs with. A Navigator is a special kindof global optimization which navigates the computation graph in somefashion (in topological order, reverse-topological order, randomorder, etc.) and applies one or more local optimizations at each step.

Optimizations which are holistic, meaning that they must take intoaccount dependencies that might be all over the graph, should beglobal. Optimizations that can be done with a narrow perspective arebetter defined as local optimizations. The majority of optimizationswe want to define are local.

Global optimization

A global optimization (or optimizer) is an object which defines the followingmethods:

  • class Optimizer
    • apply(fgraph)
    • This method takes a FunctionGraph object which contains the computation graphand does modifications in line with what the optimization is meantto do. This is one of the main methods of the optimizer.

    • addrequirements(_fgraph)

    • This method takes a FunctionGraph object and adds features to it. These features are “plugins” that are neededfor the apply method to do its job properly.

    • optimize(fgraph)

    • This is the interface function called by Theano.

Default: this is defined by Optimizer as add_requirement(fgraph); apply(fgraph).

See the section about FunctionGraph to understand how to define thesemethods.

Local optimization

A local optimization is an object which defines the following methods:

  • class LocalOptimizer
    • transform(node)
    • This method takes an Apply node and returns either False tosignify that no changes are to be done or a list of Variables whichmatches the length of the node’s outputs list. When theLocalOptimizer is applied by a Navigator, the outputs of the nodepassed as argument to the LocalOptimizer will be replaced by thelist returned.

One simplification rule

For starters, let’s define the following simplification:

\frac{xy}{y} = x

We will implement it in three ways: using a global optimization, alocal optimization with a Navigator and then using the PatternSubfacility.

Global optimization

Here is the code for a global optimization implementing thesimplification described above:

  1. import theano
  2. from theano import gof
  3. from theano.gof import toolbox
  4.  
  5. class Simplify(gof.Optimizer):
  6. def add_requirements(self, fgraph):
  7. fgraph.attach_feature(toolbox.ReplaceValidate())
  8. def apply(self, fgraph):
  9. for node in fgraph.toposort():
  10. if node.op == true_div:
  11. x, y = node.inputs
  12. z = node.outputs[0]
  13. if x.owner and x.owner.op == mul:
  14. a, b = x.owner.inputs
  15. if y == a:
  16. fgraph.replace_validate(z, b)
  17. elif y == b:
  18. fgraph.replace_validate(z, a)
  19.  
  20. simplify = Simplify()

Todo

What is add_requirements? Why would we know to do this? Are there otherrequirements we might want to know about?

Here’s how it works: first, in add_requirements, we add theReplaceValidate FunctionGraph Features located intoolbox – [doc TODO]. This feature adds the replace_validatemethod to fgraph, which is an enhanced version of replace thatdoes additional checks to ensure that we are not messing up thecomputation graph (note: if ReplaceValidate was already added byanother optimizer, extend will do nothing). In a nutshell,toolbox.ReplaceValidate grants access to fgraph.replace_validate,and fgraph.replace_validate allows us to replace a Variable withanother while respecting certain validation constraints. You canbrowse the list of FunctionGraph Feature List and see if some ofthem might be useful to write optimizations with. For example, as anexercise, try to rewrite Simplify using NodeFinder. (Hint: youwant to use the method it publishes instead of the call to toposort!)

Then, in apply we do the actual job of simplification. We start byiterating through the graph in topological order. For each nodeencountered, we check if it’s a div node. If not, we have nothingto do here. If so, we put in x, y and z the numerator,denominator and quotient (output) of the division.The simplification only occurs when the numerator is a multiplication,so we check for that. If the numerator is a multiplication we put thetwo operands in a and b, sowe can now say that z == (a*b)/y. If y==a then z==b and ify==b then z==a. When either case happens then we can replacez by either a or b using fgraph.replace_validate - else we donothing. You might want to check the documentation about Variableand Apply to get a better understanding of thepointer-following game you need to get ahold of the nodes of interestfor the simplification (x, y, z, a, b, etc.).

Test time:

  1. >>> from theano.scalar import float64, add, mul, true_div
  2. >>> x = float64('x')
  3. >>> y = float64('y')
  4. >>> z = float64('z')
  5. >>> a = add(z, mul(true_div(mul(y, x), y), true_div(z, x)))
  6. >>> e = gof.FunctionGraph([x, y, z], [a])
  7. >>> e
  8. [add(z, mul(true_div(mul(y, x), y), true_div(z, x)))]
  9. >>> simplify.optimize(e)
  10. >>> e
  11. [add(z, mul(x, true_div(z, x)))]

Cool! It seems to work. You can check what happens if you put manyinstances of \frac{xy}{y} in the graph. Note that it sometimeswon’t work for reasons that have nothing to do with the quality of theoptimization you wrote. For example, consider the following:

  1. >>> x = float64('x')
  2. >>> y = float64('y')
  3. >>> z = float64('z')
  4. >>> a = true_div(mul(add(y, z), x), add(y, z))
  5. >>> e = gof.FunctionGraph([x, y, z], [a])
  6. >>> e
  7. [true_div(mul(add(y, z), x), add(y, z))]
  8. >>> simplify.optimize(e)
  9. >>> e
  10. [true_div(mul(add(y, z), x), add(y, z))]

Nothing happened here. The reason is: add(y, z) != add(y, z). That is the case for efficiency reasons. To fix this problem wefirst need to merge the parts of the graph that represent the samecomputation, using the MergeOptimizer defined intheano.gof.opt.

  1. >>> from theano.gof.opt import MergeOptimizer
  2. >>> MergeOptimizer().optimize(e)
  3. (0, ..., None, None, {}, 1, 0)
  4. >>> e
  5. [true_div(mul(*1 -> add(y, z), x), *1)]
  6. >>> simplify.optimize(e)
  7. >>> e
  8. [x]

Once the merge is done, both occurrences of add(y, z) arecollapsed into a single one and is used as an input in twoplaces. Note that add(x, y) and add(y, x) are still consideredto be different because Theano has no clue that add iscommutative. You may write your own global optimizer to identifycomputations that are identical with full knowledge of the rules ofarithmetics that your Ops implement. Theano might provide facilitiesfor this somewhere in the future.

Note

FunctionGraph is a Theano structure intended for the optimizationphase. It is used internally by function and is rarelyexposed to the end user. You can use it to test out optimizations,etc. if you are comfortable with it, but it is recommended to usethe function frontend and to interface optimizations withoptdb (we’ll see how to do that soon).

Local optimization

The local version of the above code would be the following:

  1. class LocalSimplify(gof.LocalOptimizer):
  2. def transform(self, node):
  3. if node.op == true_div:
  4. x, y = node.inputs
  5. if x.owner and x.owner.op == mul:
  6. a, b = x.owner.inputs
  7. if y == a:
  8. return [b]
  9. elif y == b:
  10. return [a]
  11. return False
  12. def tracks(self):
  13. # This should be needed for the EquilibriumOptimizer
  14. # but it isn't now
  15. # TODO: do this and explain it
  16. return [] # that's not what you should do
  17.  
  18. local_simplify = LocalSimplify()

Todo

Fix up previous example… it’s bad and incomplete.

The definition of transform is the inner loop of the global optimizer,where the node is given as argument. If no changes are to be made,False must be returned. Else, a list of what to replace the node’soutputs with must be returned. This list must have the same length asnode.ouputs. If one of node.outputs don’t have clients(it is not usedin the graph), you can put None in the returned list to remove it.

In order to apply the local optimizer we must use it in conjunctionwith a Navigator. Basically, a Navigator is a globaloptimizer that loops through all nodes in the graph (or a well-definedsubset of them) and applies one or several local optimizers on them.

  1. >>> x = float64('x')
  2. >>> y = float64('y')
  3. >>> z = float64('z')
  4. >>> a = add(z, mul(true_div(mul(y, x), y), true_div(z, x)))
  5. >>> e = gof.FunctionGraph([x, y, z], [a])
  6. >>> e
  7. [add(z, mul(true_div(mul(y, x), y), true_div(z, x)))]
  8. >>> simplify = gof.TopoOptimizer(local_simplify)
  9. >>> simplify.optimize(e)
  10. (<theano.gof.opt.TopoOptimizer object at 0x...>, 1, 5, 3, ..., ..., ...)
  11. >>> e
  12. [add(z, mul(x, true_div(z, x)))]

OpSub, OpRemove, PatternSub

Theano defines some shortcuts to make LocalOptimizers:

  • OpSub(op1, op2)
  • Replaces all uses of op1 by op2. In otherwords, the outputs of all Apply involving op1 by the outputsof Apply nodes involving op2, where their inputs are the same.
  • OpRemove(op)
  • Removes all uses of op in the following way:if y = op(x) then y is replaced by x. op must have as manyoutputs as it has inputs. The first output becomes the first input,the second output becomes the second input, and so on.
  • PatternSub(pattern1, pattern2)
  • Replaces all occurrences of the first pattern by the second pattern.See PatternSub.
  1. from theano.gof.opt import OpSub, OpRemove, PatternSub
  2.  
  3. # Replacing add by mul (this is not recommended for primarily
  4. # mathematical reasons):
  5. add_to_mul = OpSub(add, mul)
  6.  
  7. # Removing identity
  8. remove_identity = OpRemove(identity)
  9.  
  10. # The "simplify" operation we've been defining in the past few
  11. # sections. Note that we need two patterns to account for the
  12. # permutations of the arguments to mul.
  13. local_simplify_1 = PatternSub((true_div, (mul, 'x', 'y'), 'y'),
  14. 'x')
  15. local_simplify_2 = PatternSub((true_div, (mul, 'x', 'y'), 'x'),
  16. 'y')

Note

OpSub, OpRemove and PatternSub produce local optimizers, whichmeans that everything we said previously about local optimizersapply: they need to be wrapped in a Navigator, etc.

Todo

wtf is a navigator?

When an optimization can be naturally expressed using OpSub, OpRemoveor PatternSub, it is highly recommended to use them.

WRITEME: more about using PatternSub (syntax for the patterns, how touse constraints, etc. - there’s some decent doc atPatternSub for those interested)

The optimization database (optdb)

Theano exports a symbol called optdb which acts as a sort ofordered database of optimizations. When you make a new optimization,you must insert it at the proper place in the database. Furthermore,you can give each optimization in the database a set of tags that canserve as a basis for filtering.

The point of optdb is that you might want to apply many optimizationsto a computation graph in many unique patterns. For example, you mightwant to do optimization X, then optimization Y, then optimizationZ. And then maybe optimization Y is an EquilibriumOptimizer containingLocalOptimizers A, B and C which are applied on every node of thegraph until they all fail to change it. If some optimizations act up,we want an easy way to turn them off. Ditto if some optimizations arevery CPU-intensive and we don’t want to take the time to apply them.

The optdb system allows us to tag each optimization with a unique nameas well as informative tags such as ‘stable’, ‘buggy’ or‘cpu_intensive’, all this without compromising the structure of theoptimizations.

Definition of optdb

optdb is an object which is an instance ofSequenceDB,itself a subclass of DB.There exist (for now) two types of DB, SequenceDB and EquilibriumDB.When given an appropriate Query, DB objects build an Optimizer matchingthe query.

A SequenceDB contains Optimizer or DB objects. Each of them has aname, an arbitrary number of tags and an integer representing theirorder in the sequence. When a Query is applied to a SequenceDB, allOptimizers whose tags match the query are inserted in proper order ina SequenceOptimizer, which is returned. If the SequenceDB contains DBinstances, the Query will be passed to them as well and the optimizersthey return will be put in their places.

An EquilibriumDB contains LocalOptimizer or DB objects. Each of themhas a name and an arbitrary number of tags. When a Query is applied toan EquilibriumDB, all LocalOptimizers that match the query areinserted into an EquilibriumOptimizer, which is returned. If theSequenceDB contains DB instances, the Query will be passed to them aswell and the LocalOptimizers they return will be put in their places(note that as of yet no DB can produce LocalOptimizer objects, so thisis a moot point).

Theano contains one principal DB object, optdb, whichcontains all of Theano’s optimizers with proper tags. It isrecommended to insert new Optimizers in it. As mentioned previously,optdb is a SequenceDB, so, at the top level, Theano applies a sequenceof global optimizations to the computation graphs.

Query

A Query is built by the following call:

  1. theano.gof.Query(include, require=None, exclude=None, subquery=None)
  • class Query
    • include
    • A set of tags (a tag being a string) such that everyoptimization obtained through this Query must have one of the tagslisted. This field is required and basically acts as a starting pointfor the search.

    • require

    • A set of tags such that every optimization obtainedthrough this Query must have all of these tags.

    • exclude

    • A set of tags such that every optimization obtainedthrough this Query must have none of these tags.

    • subquery

    • optdb can contain sub-databases; subquery is adictionary mapping the name of a sub-database to a special Query.If no subquery is given for a sub-database, the original Query will beused again.

Furthermore, a Query object includes three methods, including,requiring and excluding which each produce a new Query objectwith include, require and exclude sets refined to contain the new [WRITEME]

Examples

Here are a few examples of how to use a Query on optdb to produce anOptimizer:

  1. from theano.gof import Query
  2. from theano.compile import optdb
  3.  
  4. # This is how the optimizer for the fast_run mode is defined
  5. fast_run = optdb.query(Query(include=['fast_run']))
  6.  
  7. # This is how the optimizer for the fast_compile mode is defined
  8. fast_compile = optdb.query(Query(include=['fast_compile']))
  9.  
  10. # This is the same as fast_run but no optimizations will replace
  11. # any operation by an inplace version. This assumes, of course,
  12. # that all inplace operations are tagged as 'inplace' (as they
  13. # should!)
  14. fast_run_no_inplace = optdb.query(Query(include=['fast_run'],
  15. exclude=['inplace']))

Registering an Optimizer

Let’s say we have a global optimizer called simplify. We can addit to optdb as follows:

  1. # optdb.register(name, optimizer, order, *tags)
  2. optdb.register('simplify', simplify, 0.5, 'fast_run')

Once this is done, the FAST_RUN mode will automatically include youroptimization (since you gave it the ‘fast_run’ tag). Of course,already-compiled functions will see no change. The ‘order’ parameter(what it means and how to choose it) will be explained inoptdb structure below.

Registering a LocalOptimizer

LocalOptimizers may be registered in two ways:

  • Wrap them in a Navigator and insert them like a global optimizer(see previous section).
  • Put them in an EquilibriumDB.

Theano defines two EquilibriumDBs where you can put localoptimizations:

  • canonicalize()
  • This contains optimizations that aim to simplify the graph:

    • Replace rare or esoterical operations with their equivalents usingelementary operations.
    • Order operations in a canonical way (any sequence ofmultiplications and divisions can be rewritten to contain at mostone division, for example; xx can be rewritten x*2; etc.)
    • Fold constants (Constant(2)*Constant(2) becomes Constant(4))
  • specialize()
  • This contains optimizations that aim to specialize the graph:

    • Replace a combination of operations with a special operation thatdoes the same thing (but better).

For each group, all optimizations of the group that are selected bythe Query will be applied on the graph over and over again until noneof them is applicable, so keep that in mind when designing it: checkcarefully that your optimization leads to a fixpoint (a point where itcannot apply anymore) at which point it returns False to indicate itsjob is done. Also be careful not to undo the work of another localoptimizer in the group, because then the graph will oscillate betweentwo or more states and nothing will get done.

optdb structure

optdb contains the following Optimizers and sub-DBs, with the givenpriorities and tags:

OrderNameDescription
0merge1First merge operation
1canonicalizeSimplify the graph
2specializeAdd specialized operations
49merge2Second merge operation
49.5add_destroy_handlerEnable inplace optimizations
100merge3Third merge operation

The merge operations are meant to put together parts of the graph thatrepresent the same computation. Since optimizations can modify thegraph in such a way that two previously different-looking parts of thegraph become similar, we merge at the beginning, in the middle and atthe very end. Technically, we only really need to do it at the end,but doing it in previous steps reduces the size of the graph andtherefore increases the efficiency of the process.

See previous section for more information about the canonicalize andspecialize steps.

The add_destroy_handler step is not really an optimization. It isa marker. Basically:

Warning

Any optimization which inserts inplace operations in thecomputation graph must appear after the add_destroy_handler“optimizer”. In other words, the priority of any such optimizationmust be >= 50. Failure to comply by this restriction can leadto the creation of incorrect computation graphs.

The reason the destroy handler is not inserted at the beginning isthat it is costly to run. It is cheaper to run most optimizationsunder the assumption there are no inplace operations.

Navigator

WRITEME

Profiling Theano function compilation

You find that compiling a Theano function is taking too much time? Youcan get profiling information about Theano optimization. The normalTheano profiler will provide you with veryhigh-level information. The indentation shows the included in/subsetrelationship between sections. The top of its output look like this:

  1. Function profiling

Message: PATHTOA_FILE:23 Time in 0 calls to Function.__call: 0.000000e+00s Total compile time: 1.131874e+01s Number of Apply nodes: 50 Theano Optimizer time: 1.152431e+00s Theano validate time: 2.790451e-02s Theano Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s Import time 1.153541e-02s Time in all call to theano.grad() 4.732513e-02s

Explanations:

  • Total compile time: 1.131874e+01s gives the total time spent inside theano.function.
  • Number of Apply nodes: 50 means that after optimization, there are 50 apply node in the graph.
  • Theano Optimizer time: 1.152431e+00s means that we spend 1.15s in the theano.function phase where we optimize (modify) the graph to make it faster / more stable numerically / work on GPU /…
  • Theano validate time: 2.790451e-02s means that we spent 2.8e-2s in the validate subset of the optimization phase.
  • Theano Linker time (includes C, CUDA code generation/compiling): 7.893991e-02s means that we spent 7.9e-2s in linker phase of theano.function.
  • Import time 1.153541e-02s is a subset of the linker time where we import the compiled module.
  • Time in all call to theano.grad() 4.732513e-02s tells that we spent a total of 4.7e-2s in all calls to theano.grad. This is outside of the calls to theano.function.

The linker phase includes the generation of the C code, the time spentby g++ to compile and the time needed by Theano to build the object wereturn. The C code generation and compilation is cached, so the firsttime you compile a function and the following ones could take differentamount of execution time.

Detailed profiling of Theano optimizer

You can get more detailed profiling information about the Theanooptimizer phase by setting to True the Theano flagsconfig.profile_optimizer (this require config.profile to be _True_as well).

This will output something like this:

  1. Optimizer Profile

SeqOptimizer OPT_FAST_RUN time 1.152s for 123/50 nodes before/after optimization 0.028s for fgraph.validate() 0.131s for callback time - (name, class, index) - validate time 0.751816s - ('canonicalize', 'EquilibriumOptimizer', 4) - 0.004s EquilibriumOptimizer canonicalize time 0.751s for 14 passes nb nodes (start, end, max) 108 81 117 time io_toposort 0.029s time in local optimizers 0.687s time in global optimizers 0.010s 0 - 0.050s 27 (0.000s in global opts, 0.002s io_toposort) - 108 nodes - ('local_dimshuffle_lift', 9) ('local_upcast_elemwise_constant_inputs', 5) ('local_shape_to_shape_i', 3) ('local_fill_sink', 3) ('local_fill_to_alloc', 2) … 1 - 0.288s 26 (0.002s in global opts, 0.002s io_toposort) - 117 nodes - ('local_dimshuffle_lift', 8) ('local_fill_sink', 4) ('constant_folding', 4) ('local_useless_elemwise', 3) ('local_subtensor_make_vector', 3) … 2 - 0.044s 13 (0.002s in global opts, 0.003s io_toposort) - 96 nodes - ('constant_folding', 4) ('local_dimshuffle_lift', 3) ('local_fill_sink', 3) ('local_useless_elemwise', 1) ('local_fill_to_alloc', 1) … 3 - 0.045s 11 (0.000s in global opts, 0.002s io_toposort) - 91 nodes - ('constant_folding', 3) ('local_fill_to_alloc', 2) ('local_dimshuffle_lift', 2) ('local_mul_canonizer', 2) ('MergeOptimizer', 1) … 4 - 0.035s 8 (0.002s in global opts, 0.002s io_toposort) - 93 nodes - ('local_fill_sink', 3) ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('constant_folding', 1) 5 - 0.035s 6 (0.000s in global opts, 0.002s io_toposort) - 88 nodes - ('local_fill_sink', 2) ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('local_mul_canonizer', 1) 6 - 0.038s 10 (0.001s in global opts, 0.002s io_toposort) - 95 nodes - ('local_fill_sink', 3) ('local_dimshuffle_lift', 3) ('constant_folding', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) 7 - 0.032s 5 (0.001s in global opts, 0.002s io_toposort) - 91 nodes - ('local_fill_sink', 3) ('MergeOptimizer', 1) ('local_dimshuffle_lift', 1) 8 - 0.034s 5 (0.000s in global opts, 0.002s io_toposort) - 92 nodes - ('local_fill_sink', 3) ('MergeOptimizer', 1) ('local_greedy_distributor', 1) 9 - 0.031s 6 (0.001s in global opts, 0.002s io_toposort) - 90 nodes - ('local_fill_sink', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('local_dimshuffle_lift', 1) ('local_greedy_distributor', 1) 10 - 0.032s 5 (0.000s in global opts, 0.002s io_toposort) - 89 nodes - ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('local_fill_sink', 1) 11 - 0.030s 5 (0.000s in global opts, 0.002s io_toposort) - 88 nodes - ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('constant_folding', 1) 12 - 0.026s 1 (0.000s in global opts, 0.003s io_toposort) - 81 nodes - ('MergeOptimizer', 1) 13 - 0.031s 0 (0.000s in global opts, 0.003s io_toposort) - 81 nodes - times - times applied - nb node created - name: 0.263s - 15 - 0 - constant_folding 0.096s - 2 - 14 - local_greedy_distributor 0.066s - 4 - 19 - local_mul_canonizer 0.046s - 28 - 57 - local_fill_sink 0.042s - 35 - 78 - local_dimshuffle_lift 0.018s - 5 - 15 - local_upcast_elemwise_constant_inputs 0.010s - 11 - 4 - MergeOptimizer 0.009s - 4 - 0 - local_useless_elemwise 0.005s - 11 - 2 - local_fill_to_alloc 0.004s - 3 - 6 - local_neg_to_mul 0.002s - 1 - 3 - local_lift_transpose_through_dot 0.002s - 3 - 4 - local_shape_to_shape_i 0.002s - 2 - 4 - local_subtensor_lift 0.001s - 3 - 0 - local_subtensor_make_vector 0.001s - 1 - 1 - local_sum_all_to_none 0.131s - in 62 optimization that where not used (display only those with a runtime > 0) 0.050s - local_add_canonizer 0.018s - local_mul_zero 0.016s - local_one_minus_erf 0.010s - local_func_inv 0.006s - local_0_dot_x 0.005s - local_track_shape_i 0.004s - local_mul_switch_sink 0.004s - local_fill_cut 0.004s - local_one_minus_erf2 0.003s - local_remove_switch_const_cond 0.003s - local_cast_cast 0.002s - local_IncSubtensor_serialize 0.001s - local_sum_div_dimshuffle 0.001s - local_div_switch_sink 0.001s - local_dimshuffle_no_inplace_at_canonicalize 0.001s - local_cut_useless_reduce 0.001s - local_reduce_join 0.000s - local_sum_sum 0.000s - local_useless_alloc 0.000s - local_reshape_chain 0.000s - local_useless_subtensor 0.000s - local_reshape_lift 0.000s - local_flatten_lift 0.000s - local_useless_slice 0.000s - local_subtensor_of_alloc 0.000s - local_subtensor_of_dot 0.000s - local_subtensor_merge 0.101733s - ('elemwise_fusion', 'SeqOptimizer', 13) - 0.000s SeqOptimizer elemwise_fusion time 0.102s for 78/50 nodes before/after optimization 0.000s for fgraph.validate() 0.004s for callback 0.095307s - ('composite_elemwise_fusion', 'FusionOptimizer', 1) - 0.000s FusionOptimizer nb_iter 3 nb_replacement 10 nb_inconsistency_replace 0 validate_time 0.000249624252319 callback_time 0.00316381454468 time_toposort 0.00375390052795 0.006412s - ('local_add_mul_fusion', 'FusionOptimizer', 0) - 0.000s FusionOptimizer nb_iter 2 nb_replacement 3 nb_inconsistency_replace 0 validate_time 6.43730163574e-05 callback_time 0.000783205032349 time_toposort 0.0035240650177 0.090089s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 30) - 0.019s 0.048993s - ('BlasOpt', 'SeqOptimizer', 8) - 0.000s SeqOptimizer BlasOpt time 0.049s for 81/80 nodes before/after optimization 0.000s for fgraph.validate() 0.003s for callback 0.035997s - ('gemm_optimizer', 'GemmOptimizer', 1) - 0.000s GemmOptimizer nb_iter 2 nb_replacement 2 nb_replacement_didn_t_remove 0 nb_inconsistency_make 0 nb_inconsistency_replace 0 time_canonicalize 0.00720071792603 time_factor_can 9.05990600586e-06 time_factor_list 0.00128507614136 time_toposort 0.00311398506165 validate_time 4.60147857666e-05 callback_time 0.00174236297607 0.004569s - ('local_dot_to_dot22', 'TopoOptimizer', 0) - 0.000s TopoOptimizer nb_node (start, end, changed) (81, 81, 5) init io_toposort 0.00139284133911 loop time 0.00312399864197 callback_time 0.00172805786133 0.002283s - ('local_dot22_to_dot22scalar', 'TopoOptimizer', 2) - 0.000s TopoOptimizer nb_node (start, end, changed) (80, 80, 0) init io_toposort 0.00171804428101 loop time 0.000502109527588 callback_time 0.0 0.002257s - ('local_gemm_to_gemv', 'EquilibriumOptimizer', 3) - 0.000s EquilibriumOptimizer local_gemm_to_gemv time 0.002s for 1 passes nb nodes (start, end, max) 80 80 80 time io_toposort 0.001s time in local optimizers 0.000s time in global optimizers 0.000s 0 - 0.002s 0 (0.000s in global opts, 0.001s io_toposort) - 80 nodes - 0.002227s - ('use_c_blas', 'TopoOptimizer', 4) - 0.000s TopoOptimizer nb_node (start, end, changed) (80, 80, 0) init io_toposort 0.0014750957489 loop time 0.00068998336792 callback_time 0.0 0.001632s - ('use_scipy_ger', 'TopoOptimizer', 5) - 0.000s TopoOptimizer nb_node (start, end, changed) (80, 80, 0) init io_toposort 0.00138401985168 loop time 0.000202178955078 callback_time 0.0 0.031740s - ('specialize', 'EquilibriumOptimizer', 9) - 0.000s EquilibriumOptimizer specialize time 0.031s for 2 passes nb nodes (start, end, max) 80 78 80 time io_toposort 0.003s time in local optimizers 0.022s time in global optimizers 0.004s 0 - 0.017s 6 (0.002s in global opts, 0.001s io_toposort) - 80 nodes - ('constant_folding', 2) ('local_mul_to_sqr', 1) ('local_elemwise_alloc', 1) ('local_div_to_inv', 1) ('local_mul_specialize', 1) 1 - 0.014s 0 (0.002s in global opts, 0.001s io_toposort) - 78 nodes - times - times applied - nb node created - name: 0.003s - 1 - 1 - local_mul_specialize 0.002s - 1 - 2 - local_elemwise_alloc 0.002s - 2 - 0 - constant_folding 0.001s - 1 - 1 - local_div_to_inv 0.001s - 1 - 1 - local_mul_to_sqr 0.016s - in 69 optimization that where not used (display only those with a runtime > 0) 0.004s - crossentropy_to_crossentropy_with_softmax_with_bias 0.002s - local_one_minus_erf 0.002s - Elemwise{sub,no_inplace}(z, Elemwise{mul,no_inplace}(alpha subject to <function <lambda> at 0x7f475e4da050>, SparseDot(x, y))) -> Usmm{no_inplace}(Elemwise{neg,no_inplace}(alpha), x, y, z) 0.002s - local_add_specialize 0.001s - local_func_inv 0.001s - local_useless_elemwise 0.001s - local_abs_merge 0.001s - local_track_shape_i 0.000s - local_one_minus_erf2 0.000s - local_sum_mul_by_scalar 0.000s - local_elemwise_sub_zeros 0.000s - local_cast_cast 0.000s - local_alloc_unary 0.000s - Elemwise{log,no_inplace}(Softmax(x)) -> <function make_out_pattern at 0x7f47619a8410>(x) 0.000s - local_sum_div_dimshuffle 0.000s - local_sum_alloc 0.000s - local_dimshuffle_lift 0.000s - local_reduce_broadcastable 0.000s - local_grad_log_erfc_neg 0.000s - local_advanced_indexing_crossentropy_onehot 0.000s - local_log_erfc 0.000s - local_log1p 0.000s - local_log_add 0.000s - local_useless_alloc 0.000s - local_neg_neg 0.000s - local_neg_div_neg …

To understand this profile here is some explanation of how optimizations work:

  • Optimizations are organized in an hierarchy. At the top level, thereis a SeqOptimizer (Sequence Optimizer). It contains other optimizers,and applies them in the order they were specified. Those sub-optimizers can beof other types, but are all global optimizers.

  • Each Optimizer in the hierarchy will print some stats aboutitself. The information that it prints depends of the type of theoptimizer.

  • The SeqOptimizer will print some stats at the start:

  1. Optimizer Profile
  2. ————————-
  3. SeqOptimizer OPT_FAST_RUN time 1.152s for 123/50 nodes before/after optimization
  4. 0.028s for fgraph.validate()
  5. 0.131s for callback
  6. time - (name, class, index) - validate time

Then it will print, with some additional indentation, each sub-optimizer’s profileinformation. These sub-profiles are ordered by the time they took to execute,not by their execution order.

  • OPT_FAST_RUN is the name of the optimizer
  • 1.152s is the total time spent in that optimizer
  • 123/50 means that before this optimization, there were 123 apply node in the function graph, and after only 50.
  • 0.028s means it spent that time calls to fgraph.validate()
  • 0.131s means it spent that time for callbacks. This is a mechanism that can trigger other execution when there is a change to the FunctionGraph.
  • time - (name, class, index) - validate time tells how the information for each sub-optimizer get printed.
  • All other instances of SeqOptimizer are described like this. In particular, some sub-optimizer from OPT_FAST_RUN that are also SeqOptimizer.
    • The SeqOptimizer will print some stats at the start:
  1. 0.751816s - ('canonicalize', 'EquilibriumOptimizer', 4) - 0.004s
  2. EquilibriumOptimizer canonicalize
  3. time 0.751s for 14 passes
  4. nb nodes (start, end, max) 108 81 117
  5. time io_toposort 0.029s
  6. time in local optimizers 0.687s
  7. time in global optimizers 0.010s
  8. 0 - 0.050s 27 (0.000s in global opts, 0.002s io_toposort) - 108 nodes - ('local_dimshuffle_lift', 9) ('local_upcast_elemwise_constant_inputs', 5) ('local_shape_to_shape_i', 3) ('local_fill_sink', 3) ('local_fill_to_alloc', 2)
  9. 1 - 0.288s 26 (0.002s in global opts, 0.002s io_toposort) - 117 nodes - ('local_dimshuffle_lift', 8) ('local_fill_sink', 4) ('constant_folding', 4) ('local_useless_elemwise', 3) ('local_subtensor_make_vector', 3)
  10. 2 - 0.044s 13 (0.002s in global opts, 0.003s io_toposort) - 96 nodes - ('constant_folding', 4) ('local_dimshuffle_lift', 3) ('local_fill_sink', 3) ('local_useless_elemwise', 1) ('local_fill_to_alloc', 1)
  11. 3 - 0.045s 11 (0.000s in global opts, 0.002s io_toposort) - 91 nodes - ('constant_folding', 3) ('local_fill_to_alloc', 2) ('local_dimshuffle_lift', 2) ('local_mul_canonizer', 2) ('MergeOptimizer', 1)
  12. 4 - 0.035s 8 (0.002s in global opts, 0.002s io_toposort) - 93 nodes - ('local_fill_sink', 3) ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('constant_folding', 1)
  13. 5 - 0.035s 6 (0.000s in global opts, 0.002s io_toposort) - 88 nodes - ('local_fill_sink', 2) ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('local_mul_canonizer', 1)
  14. 6 - 0.038s 10 (0.001s in global opts, 0.002s io_toposort) - 95 nodes - ('local_fill_sink', 3) ('local_dimshuffle_lift', 3) ('constant_folding', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1)
  15. 7 - 0.032s 5 (0.001s in global opts, 0.002s io_toposort) - 91 nodes - ('local_fill_sink', 3) ('MergeOptimizer', 1) ('local_dimshuffle_lift', 1)
  16. 8 - 0.034s 5 (0.000s in global opts, 0.002s io_toposort) - 92 nodes - ('local_fill_sink', 3) ('MergeOptimizer', 1) ('local_greedy_distributor', 1)
  17. 9 - 0.031s 6 (0.001s in global opts, 0.002s io_toposort) - 90 nodes - ('local_fill_sink', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('local_dimshuffle_lift', 1) ('local_greedy_distributor', 1)
  18. 10 - 0.032s 5 (0.000s in global opts, 0.002s io_toposort) - 89 nodes - ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('local_fill_sink', 1)
  19. 11 - 0.030s 5 (0.000s in global opts, 0.002s io_toposort) - 88 nodes - ('local_dimshuffle_lift', 2) ('local_fill_to_alloc', 1) ('MergeOptimizer', 1) ('constant_folding', 1)
  20. 12 - 0.026s 1 (0.000s in global opts, 0.003s io_toposort) - 81 nodes - ('MergeOptimizer', 1)
  21. 13 - 0.031s 0 (0.000s in global opts, 0.003s io_toposort) - 81 nodes -
  22. times - times applied - nb node created - name:
  23. 0.263s - 15 - 0 - constant_folding
  24. 0.096s - 2 - 14 - local_greedy_distributor
  25. 0.066s - 4 - 19 - local_mul_canonizer
  26. 0.046s - 28 - 57 - local_fill_sink
  27. 0.042s - 35 - 78 - local_dimshuffle_lift
  28. 0.018s - 5 - 15 - local_upcast_elemwise_constant_inputs
  29. 0.010s - 11 - 4 - MergeOptimizer
  30. 0.009s - 4 - 0 - local_useless_elemwise
  31. 0.005s - 11 - 2 - local_fill_to_alloc
  32. 0.004s - 3 - 6 - local_neg_to_mul
  33. 0.002s - 1 - 3 - local_lift_transpose_through_dot
  34. 0.002s - 3 - 4 - local_shape_to_shape_i
  35. 0.002s - 2 - 4 - local_subtensor_lift
  36. 0.001s - 3 - 0 - local_subtensor_make_vector
  37. 0.001s - 1 - 1 - local_sum_all_to_none
  38. 0.131s - in 62 optimization that where not used (display only those with a runtime > 0)
  39. 0.050s - local_add_canonizer
  40. 0.018s - local_mul_zero
  41. 0.016s - local_one_minus_erf
  42. 0.010s - local_func_inv
  43. 0.006s - local_0_dot_x
  44. 0.005s - local_track_shape_i
  45. 0.004s - local_mul_switch_sink
  46. 0.004s - local_fill_cut
  47. 0.004s - local_one_minus_erf2
  48. 0.003s - local_remove_switch_const_cond
  49. 0.003s - local_cast_cast
  50. 0.002s - local_IncSubtensor_serialize
  51. 0.001s - local_sum_div_dimshuffle
  52. 0.001s - local_div_switch_sink
  53. 0.001s - local_dimshuffle_no_inplace_at_canonicalize
  54. 0.001s - local_cut_useless_reduce
  55. 0.001s - local_reduce_join
  56. 0.000s - local_sum_sum
  57. 0.000s - local_useless_alloc
  58. 0.000s - local_reshape_chain
  59. 0.000s - local_useless_subtensor
  60. 0.000s - local_reshape_lift
  61. 0.000s - local_flatten_lift
  62. 0.000s - local_useless_slice
  63. 0.000s - local_subtensor_of_alloc
  64. 0.000s - local_subtensor_of_dot
  65. 0.000s - local_subtensor_merge
  • 0.751816s - ('canonicalize', 'EquilibriumOptimizer', 4) - 0.004sThis line is from SeqOptimizer, and indicates information relatedto a sub-optimizer. It means that this sub-optimizer tooka total of .7s. Its name is 'canonicalize'. It is anEquilibriumOptimizer. It was executed at index 4 by theSeqOptimizer. It spent 0.004s in the validate phase.

  • All other lines are from the profiler of the EquilibriumOptimizer.

  • An EquilibriumOptimizer does multiple passes on the Apply nodes fromthe graph, trying to apply local and global optimizations.Conceptually, it tries to execute all global optimizations,and to apply all local optimizations on allnodes in the graph. If no optimization got applied during a pass, itstops. So it tries to find an equilibrium state where none of theoptimizations get applied. This is useful when we do not know a fixed order forthe execution of the optimization.

  • time 0.751s for 14 passes means that it took .7s and did 14 passes over the graph.

  • nb nodes (start, end, max) 108 81 117 means that at the start,the graph had 108 node, at the end, it had 81 and the maximum sizewas 117.

  • Then it prints some global timing information: it spent 0.029s inio_toposort, all local optimizers took 0.687s together for allpasses, and global optimizers took a total of 0.010s.

  • Then we print the timing for each pass, the optimization thatgot applied, and the number of time they got applied. For example,in pass 0, the local_dimshuffle_lift optimizer changed the graph 9time.

  • Then we print the time spent in each optimizer, the number of timesthey changed the graph and the number of nodes they introduced inthe graph.

  • Optimizations with that pattern local_op_lift means that a nodewith that op will be replaced by another node, with the same op,but will do computation closer to the inputs of the graph.For instance, local_op(f(x)) getting replaced by f(local_op(x)).

  • Optimization with that pattern local_op_sink is the opposite oflift. For instance f(local_op(x)) getting replaced by local_op(f(x)).

  • Local optimizers can replace any arbitrary node in the graph, notonly the node it received as input. For this, it must return adict. The keys being nodes to replace and thevalues being the corresponding replacement.

This is useful to replace a client of the node received asparameter.