Profiling Theano function
Note
This method replace the old ProfileMode. Do not use ProfileModeanymore.
Besides checking for errors, another important task is to profile yourcode in terms of speed and/or memory usage.
You can profile yourfunctions using either of the following two options:
- Use Theano flag
config.profile
to enable profiling. - To enable the memory profiler use the Theano flag:
config.profile_memory
in addition toconfig.profile
. - Moreover, to enable the profiling of Theano optimization phase,use the Theano flag:
config.profile_optimizer
in additiontoconfig.profile
. - You can also use the Theano flags
profiling.n_apply
,profiling.n_ops
andprofiling.min_memory_size
to modify the quantity of information printed.
- To enable the memory profiler use the Theano flag:
- Use Theano flag
- Pass the argument
profile=True
to the functiontheano.function
. And then callf.profile.summary()
for a single function. - Use this option when you want to profile not all thefunctions but one or more specific function(s).
- You can also combine the profile of many functions: The profiler will output one profile per Theano function and profilethat is the sum of the printed profiles. Each profile contains 4sections: global info, class info, Ops info and Apply node info.
- Pass the argument
In the global section, the “Message” is the name of the Theanofunction. theano.function() has an optional parameter name
thatdefaults to None. Change it to something else to help you profile manyTheano functions. In that section, we also see the number of times thefunction was called (1) and the total time spent in all thosecalls. The time spent in Function.fn.call and in thunks is usefulto understand Theano overhead.
Also, we see the time spent in the two parts of the compilationprocess: optimization (modify the graph to make it more stable/faster)and the linking (compile c code and make the Python callable returnedby function).
The class, Ops and Apply nodes sections are the same information:information about the Apply node that ran. The Ops section takes theinformation from the Apply section and merge the Apply nodes that haveexactly the same op. If two Apply nodes in the graph have two Ops thatcompare equal, they will be merged. Some Ops like Elemwise, will notcompare equal, if their parameters differ (the scalar beingexecuted). So the class section will merge more Apply nodes then theOps section.
Note that the profile also shows which Ops were running a c implementation.
Developers wishing to optimize the performance of their graph shouldfocus on the worst offending Ops and Apply nodes – either by optimizingan implementation, providing a missing C implementation, or by writinga graph optimization that eliminates the offending Op altogether.You should strongly consider emailing one of our lists about yourissue before spending too much time on this.
Here is an example output when we disable some Theano optimizations togive you a better idea of the difference between sections. With alloptimizations enabled, there would be only one op left in the graph.
Note
To profile the peak memory usage on the GPU you need to do:
- * In the file theano/sandbox/cuda/cuda_ndarray.cu, set the macro
- COMPUTE_GPU_MEM_USED to 1.
- * Then call theano.sandbox.cuda.theano_allocated()
- It return a tuple with two ints. The first is the current GPU
- memory allocated by Theano. The second is the peak GPU memory
- that was allocated by Theano.
Do not always enable this, as this slows down memory allocation andfree. As this slows down the computation, this will affect speedprofiling. So don’t use both at the same time.
to run the example:
THEANO_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/profiling_example.py
The output:
- Function profiling
- Function profiling
Message: None Time in 1 calls to Function.call: 5.698204e-05s Time in Function.fn.call: 1.192093e-05s (20.921%) Time in thunks: 6.198883e-06s (10.879%) Total compile time: 3.642474e+00s Theano Optimizer time: 7.326508e-02s Theano validate time: 3.712177e-04s Theano Linker time (includes C, CUDA code generation/compiling): 9.584920e-01s
Class
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 100.0% 100.0% 0.000s 2.07e-06s C 3 3 <class 'theano.tensor.elemwise.Elemwise'> … (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> 65.4% 65.4% 0.000s 2.03e-06s C 2 2 Elemwise{add,no_inplace} 34.6% 100.0% 0.000s 2.15e-06s C 1 1 Elemwise{mul,no_inplace} … (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 50.0% 50.0% 0.000s 3.10e-06s 1 0 Elemwise{add,no_inplace}(x, y) 34.6% 84.6% 0.000s 2.15e-06s 1 2 Elemwise{mul,no_inplace}(TensorConstant{(1,) of 2.0}, Elemwise{add,no_inplace}.0) 15.4% 100.0% 0.000s 9.54e-07s 1 1 Elemwise{add,no_inplace}(Elemwise{add,no_inplace}.0, z) … (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)