Profiling
- Cython Profiling Basics
- Profiling Tutorial

Profiling

This part describes the profiling abilities of Cython. If you are familiarwith profiling pure Python code, you can only read the first section(Cython Profiling Basics). If you are not familiar with Python profiling youshould also read the tutorial (Profiling Tutorial) which takes youthrough a complete example step by step.

Cython Profiling Basics

Profiling in Cython is controlled by a compiler directive.It can be set either for an entire file or on a per function basisvia a Cython decorator.

Enabling profiling for a complete source file

Profiling is enabled for a complete source file via a global directive to theCython compiler at the top of a file:

# cython: profile=True

Note that profiling gives a slight overhead to each function call therefore makingyour program a little slower (or a lot, if you call some small functions veryoften).

Once enabled, your Cython code will behave just like Python code when calledfrom the cProfile module. This means you can just profile your Cython codetogether with your Python code using the same tools as for Python code alone.

Disabling profiling function wise

If your profiling is messed up because of the call overhead to some smallfunctions that you rather do not want to see in your profile - either becauseyou plan to inline them anyway or because you are sure that you can’t make themany faster - you can use a special decorator to disable profiling for onefunction only (regardless of whether it is globally enabled or not):

cimport cython
 
@cython.profile(False)
def my_often_called_function():
    pass

Enabling line tracing

To get more detailed trace information (for tools that can make use of it),you can enable line tracing:

# cython: linetrace=True

This will also enable profiling support, so the above profile=True optionis not needed. Line tracing is needed for coverage analysis, for example.

Note that even if line tracing is enabled via the compiler directive, it isnot used by default. As the runtime slowdown can be substantial, it mustadditionally be compiled in by the C compiler by setting the C macro definitionCYTHON_TRACE=1. To include nogil functions in the trace, setCYTHON_TRACE_NOGIL=1 (which implies CYTHON_TRACE=1). C macros can bedefined either in the extension definition of the setup.py script or bysetting the respective distutils options in the source file with the followingfile header comment (if cythonize() is used for compilation):

# distutils: define_macros=CYTHON_TRACE_NOGIL=1

Enabling coverage analysis

Since Cython 0.23, line tracing (see above) also enables support for coveragereporting with the coverage.py tool. Tomake the coverage analysis understand Cython modules, you also need to enableCython’s coverage plugin in your .coveragerc file as follows:

[run]
plugins = Cython.Coverage

With this plugin, your Cython source files should show up normally in thecoverage reports.

To include the coverage report in the Cython annotated HTML file, you needto first run the coverage.py tool to generate an XML result file. Passthis file into the cython command as follows:

$ cython  --annotate-coverage coverage.xml  package/mymodule.pyx

This will recompile the Cython module and generate one HTML outputfile next to each Cython source file it processes, containing colourmarkers for lines that were contained in the coverage report.

Profiling Tutorial

This will be a complete tutorial, start to finish, of profiling Python code,turning it into Cython code and keep profiling until it is fast enough.

As a toy example, we would like to evaluate the summation of the reciprocals ofsquares up to a certain integer for evaluating $\pi$ . Therelation we want to use has been proven by Euler in 1735 and is known as theBasel problem.

$\pi^2 = 6 \sum_{k=1}^{\infty} \frac{1}{k^2} =6 \lim_{k \to \infty} \big( \frac{1}{1^2} + \frac{1}{2^2} + \dots + \frac{1}{k^2} \big) \approx6 \big( \frac{1}{1^2} + \frac{1}{2^2} + \dots + \frac{1}{n^2} \big)$

A simple Python code for evaluating the truncated sum looks like this:

# calc_pi.py
 
def recip_square(i):
    return 1. / i ** 2
 
def approx_pi(n=10000000):
    val = 0.
    for k in range(1, n + 1):
        val += recip_square(k)
    return (6 * val) ** .5

On my box, this needs approximately 4 seconds to run the function with thedefault n. The higher we choose n, the better will be the approximation for $\pi$ . An experienced Python programmer will already see plenty ofplaces to optimize this code. But remember the golden rule of optimization:Never optimize without having profiled. Let me repeat this: Never optimizewithout having profiled your code. Your thoughts about which part of yourcode takes too much time are wrong. At least, mine are always wrong. So let’swrite a short script to profile our code:

# profile.py
 
import pstats, cProfile
 
import calc_pi
 
cProfile.runctx("calc_pi.approx_pi()", globals(), locals(), "Profile.prof")
 
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()

Running this on my box gives the following output:

Sat Nov  7 17:40:54 2009    Profile.prof
 
         10000004 function calls in 6.211 CPU seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.243    3.243    6.211    6.211 calc_pi.py:7(approx_pi)
 10000000    2.526    0.000    2.526    0.000 calc_pi.py:4(recip_square)
        1    0.442    0.442    0.442    0.442 {range}
        1    0.000    0.000    6.211    6.211 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This contains the information that the code runs in 6.2 CPU seconds. Note thatthe code got slower by 2 seconds because it ran inside the cProfile module. Thetable contains the real valuable information. You might want to check thePython profiling documentationfor the nitty gritty details. The most important columns here are totime (totaltime spent in this function not counting functions that were called by thisfunction) and cumtime (total time spent in this function also counting thefunctions called by this function). Looking at the tottime column, we see thatapproximately half the time is spent in approx_pi and the other half is spentin recip_square. Also half a second is spent in range … of course we shouldhave used xrange for such a big iteration. And in fact, just changing range toxrange makes the code run in 5.8 seconds.

We could optimize a lot in the pure Python version, but since we are interestedin Cython, let’s move forward and bring this module to Cython. We would do thisanyway at some time to get the loop run faster. Here is our first Cython version:

# cython: profile=True
 
# calc_pi.pyx
 
def recip_square(int i):
    return 1. / i ** 2
 
def approx_pi(int n=10000000):
    cdef double val = 0.
    cdef int k
    for k in range(1, n + 1):
        val += recip_square(k)
    return (6 * val) ** .5

Note the first line: We have to tell Cython that profiling should be enabled.This makes the Cython code slightly slower, but without this we would not getmeaningful output from the cProfile module. The rest of the code is mostlyunchanged, I only typed some variables which will likely speed things up a bit.

We also need to modify our profiling script to import the Cython module directly.Here is the complete version adding the import of the Pyximport module:

# profile.py
 
import pstats, cProfile
 
import pyximport
pyximport.install()
 
import calc_pi
 
cProfile.runctx("calc_pi.approx_pi()", globals(), locals(), "Profile.prof")
 
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()

We only added two lines, the rest stays completely the same. Alternatively, we could alsomanually compile our code into an extension; we wouldn’t need to change theprofile script then at all. The script now outputs the following:

Sat Nov  7 18:02:33 2009    Profile.prof
 
         10000004 function calls in 4.406 CPU seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.305    3.305    4.406    4.406 calc_pi.pyx:7(approx_pi)
 10000000    1.101    0.000    1.101    0.000 calc_pi.pyx:4(recip_square)
        1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
        1    0.000    0.000    4.406    4.406 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

We gained 1.8 seconds. Not too shabby. Comparing the output to the previous, wesee that recip_square function got faster while the approx_pi function has notchanged a lot. Let’s concentrate on the recip_square function a bit more. Firstnote, that this function is not to be called from code outside of our module;so it would be wise to turn it into a cdef to reduce call overhead. We shouldalso get rid of the power operator: it is turned into a pow(i,2) function call byCython, but we could instead just write i*i which could be faster. Thewhole function is also a good candidate for inlining. Let’s look at thenecessary changes for these ideas:

# cython: profile=True
 
# calc_pi.pyx
 
cdef inline double recip_square(int i):
    return 1. / (i * i)
 
def approx_pi(int n=10000000):
    cdef double val = 0.
    cdef int k
    for k in range(1, n + 1):
        val += recip_square(k)
    return (6 * val) ** .5

Now running the profile script yields:

Sat Nov  7 18:10:11 2009    Profile.prof
 
         10000004 function calls in 2.622 CPU seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.782    1.782    2.622    2.622 calc_pi.pyx:7(approx_pi)
 10000000    0.840    0.000    0.840    0.000 calc_pi.pyx:4(recip_square)
        1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
        1    0.000    0.000    2.622    2.622 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

That bought us another 1.8 seconds. Not the dramatic change we could haveexpected. And why is recip_square still in this table; it is supposed to beinlined, isn’t it? The reason for this is that Cython still generates profiling codeeven if the function call is eliminated. Let’s tell it to notprofile recip_square any more; we couldn’t get the function to be much faster anyway:

# cython: profile=True
 
# calc_pi.pyx
 
cimport cython
 
@cython.profile(False)
cdef inline double recip_square(int i):
    return 1. / (i * i)
 
def approx_pi(int n=10000000):
    cdef double val = 0.
    cdef int k
    for k in range(1, n + 1):
        val += recip_square(k)
    return (6 * val) ** .5

Running this shows an interesting result:

Sat Nov  7 18:15:02 2009    Profile.prof
 
         4 function calls in 0.089 CPU seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.089    0.089    0.089    0.089 calc_pi.pyx:10(approx_pi)
        1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
        1    0.000    0.000    0.089    0.089 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

First note the tremendous speed gain: this version only takes 1/50 of the timeof our first Cython version. Also note that recip_square has vanished from thetable like we wanted. But the most peculiar and import change is thatapprox_pi also got much faster. This is a problem with all profiling: calling afunction in a profile run adds a certain overhead to the function call. Thisoverhead is not added to the time spent in the called function, but to thetime spent in the calling function. In this example, approx_pi didn’t need 2.622seconds in the last run; but it called recip_square 10000000 times, each time taking alittle to set up profiling for it. This adds up to the massive time loss ofaround 2.6 seconds. Having disabled profiling for the often called function nowreveals realistic timings for approx_pi; we could continue optimizing it now ifneeded.

This concludes this profiling tutorial. There is still some room forimprovement in this code. We could try to replace the power operator inapprox_pi with a call to sqrt from the C stdlib; but this is not necessarilyfaster than calling pow(x,0.5).

Even so, the result we achieved here is quite satisfactory: we came up with asolution that is much faster then our original Python version while retainingfunctionality and readability.