Standard Interpreter Optimizations

Introduction

One of the advantages – indeed, one of the motivating goals – of the PyPystandard interpreter (compared to CPython) is that of increased flexibility andconfigurability.

One example of this is that we can provide several implementations of the sameobject (e.g. lists) without exposing any difference to application-levelcode. This makes it easy to provide a specialized implementation of a type thatis optimized for a certain situation without disturbing the implementation forthe regular case.

This document describes several such optimizations. Most of them are notenabled by default. Also, for many of these optimizations it is not clearwhether they are worth it in practice for a real-world application (they suremake some microbenchmarks a lot faster and use less memory, which is not sayingtoo much). If you have any observation in that direction, please let us know!By the way: alternative object implementations are a great way to get into PyPydevelopment since you have to know only a rather small part of PyPy to dothem. And they are fun too!

Object Optimizations

Integer Optimizations

Caching Small Integers

Similar to CPython, it is possible to enable caching of small integer objects tonot have to allocate all the time when doing simple arithmetic. Every time a newinteger object is created it is checked whether the integer is small enough tobe retrieved from the cache.

This option is disabled by default, you can enable this feature with the–objspace-std-withprebuiltint option.

Integers as Tagged Pointers

An even more aggressive way to save memory when using integers is “small int”integer implementation. It is another integer implementation used for integersthat only needs 31 bits (or 63 bits on a 64 bit machine). These integersare represented as tagged pointers by setting their lowest bits to distinguishthem from normal pointers. This completely avoids the boxing step, savingtime and memory.

You can enable this feature with the –objspace-std-withsmalllong option.

Dictionary Optimizations

Dict Strategies

Dict strategies are an implementation approach for dictionaries (and lists)that make it possible to use a specialized representation of the dictionary’sdata, while still being able to switch back to a general representation shouldthat become necessary later.

Dict strategies are always enabled, by default there are special strategies fordicts with just string keys, just unicode keys and just integer keys. If one ofthose specialized strategies is used, then dict lookup can use much fasterhashing and comparison for the dict keys. There is of course also a strategyfor general keys.

Identity Dicts

We also have a strategy specialized for keys that are instances of classeswhich compares “by identity”, which is the default unless you overridehash, eq or cmp. This strategy will be used only withnew-style classes.

Map Dicts

Map dictionaries are a special representation used together with dict strategies.This dict strategy is used only for instance dictionaries and tries tomake instance dictionaries use less memory (in fact, usually memory behaviourshould be mostly like that of using slots).

The idea is the following: Most instances of the same class have very similarattributes, and are even adding these keys to the dictionary in the same orderwhile init() is being executed. That means that all the dictionaries ofthese instances look very similar: they have the same set of keys with differentvalues per instance. What sharing dicts do is store these common keys into acommon structure object and thus save the space in the individual instancedicts:the representation of the instance dict contains only a list of values.

List Optimizations

Range-Lists

Range-lists solve the same problem that the xrange builtin solves poorly:the problem that range allocates memory even if the resulting list is onlyever used for iterating over it. Range lists are a different implementation forlists. They are created only as a result of a call to range. As long as theresulting list is used without being mutated, the list stores only the start, stopand step of the range. Only when somebody mutates the list the actual list iscreated. This gives the memory and speed behaviour of xrange and the generalityof use of range, and makes xrange essentially useless.

This feature is enabled by default as part of the–objspace-std-withliststrategies option.

User Class Optimizations

Method Caching

A method cache is introduced where the result of a method lookupis stored (which involves potentially many lookups in the base classes of aclass). Entries in the method cache are stored using a hash computed fromthe name being looked up, the call site (i.e. the bytecode object andthe current program counter), and a special “version” of the type where thelookup happens (this version is incremented every time the type or one of itsbase classes is changed). On subsequent lookups the cached version can be used,as long as the instance did not shadow any of its classes attributes.

This feature is enabled by default.

Interpreter Optimizations

Special Bytecodes

LOOKUP_METHOD & CALL_METHOD

An unusual feature of Python’s version of object oriented programming is theconcept of a “bound method”. While the concept is clean and powerful, theallocation and initialization of the object is not without its performance cost.We have implemented a pair of bytecodes that alleviate this cost.

For a given method call obj.meth(x, y), the standard bytecode looks likethis:

  1. LOAD_GLOBAL obj # push 'obj' on the stack
  2. LOAD_ATTR meth # read the 'meth' attribute out of 'obj'
  3. LOAD_GLOBAL x # push 'x' on the stack
  4. LOAD_GLOBAL y # push 'y' on the stack
  5. CALL_FUNCTION 2 # call the 'obj.meth' object with arguments x, y

We improved this by keeping method lookup separated from method call, unlikesome other approaches, but using the value stack as a cache instead of buildinga temporary object. We extended the bytecode compiler to (optionally) generatethe following code for obj.meth(x, y):

  1. LOAD_GLOBAL obj
  2. LOOKUP_METHOD meth
  3. LOAD_GLOBAL x
  4. LOAD_GLOBAL y
  5. CALL_METHOD 2

LOOKUPMETHOD contains exactly the same attribute lookup logic asLOAD_ATTR - thus fully preserving semantics - but pushes two values onto thestack instead of one. These two values are an “inlined” version of the boundmethod object: the _im_func and im_self, i.e. respectively the underlyingPython function object and a reference to obj. This is only possible whenthe attribute actually refers to a function object from the class; when this isnot the case, LOOKUPMETHOD still pushes two values, but one (imfunc) issimply the regular result that LOADATTR would have returned, and the other(imself) is an interpreter-level None placeholder.

After pushing the arguments, the layout of the stack in the aboveexample is as follows (the stack grows upwards):

y (2nd arg)
x (1st arg)
obj (im_self)
function object (im_func)

The CALLMETHOD N bytecode emulates a bound method call byinspecting the _im_self entry in the stack below the N arguments:if it is not None, then it is considered to be an additional firstargument in the call to the im_func object from the stack.

Overall Effects

The impact these various optimizations have on performance unsurprisinglydepends on the program being run. Using the default multi-dict implementation thatsimply special cases string-keyed dictionaries is a clear win on all benchmarks,improving results by anything from 15-40 per cent.

Another optimization, or rather set of optimizations, that has a uniformly goodeffect are the two ‘method optimizations’, i.e. themethod cache and the LOOKUP_METHOD and CALL_METHOD opcodes. On a heavilyobject-oriented benchmark (richards) they combine to give a speed-up of nearly50%, and even on the extremely un-object-oriented pystone benchmark, theimprovement is over 20%.

When building pypy, all generally useful optimizations are turned on by defaultunless you explicitly lower the translation optimization level with the—opt option.