Using multiple GPUs

Theano has a feature to allow the use of multiple GPUs at the sametime in one function. The multiple gpu feature requires the use ofthe GpuArray Backend backend, so make sure that works correctly.

In order to keep a reasonably high level of abstraction you do notrefer to device names directly for multiple-gpu use. You insteadrefer to what we call context names. These are then mapped to adevice using the theano configuration. This allows portability ofmodels between machines.

Warning

The code is rather new and is still considered experimental at thispoint. It has been tested and seems to perform correctly in allcases observed, but make sure to double-check your results beforepublishing a paper or anything of the sort.

Note

For data-parallelism, you probably are better using platoon.

Defining the context map

The mapping from context names to devices is done through theconfig.contexts option. The format looks like this:

  1. dev0->cuda0;dev1->cuda1

Let’s break it down. First there is a list of mappings. Each ofthese mappings is separeted by a semicolon ‘;’. There can be anynumber of such mappings, but in the example above we have two of them:dev0->cuda0 and dev1->cuda1.

The mappings themselves are composed of a context name followed by thetwo characters ‘->’ and the device name. The context name is a simplestring which does not have any special meaning for Theano. Forparsing reasons, the context name cannot contain the sequence ‘->’ or‘;’. To avoid confusion context names that begin with ‘cuda’ or‘opencl’ are disallowed. The device name is a device in the form thatgpuarray expects like ‘cuda0’ or ‘opencl0:0’.

Note

Since there are a bunch of shell special characters in the syntax,defining this on the command-line will require proper quoting, like this:

  1. $ THEANO_FLAGS="contexts=dev0->cuda0"

When you define a context map, if config.print_active_deviceis True (the default), Theano will print the mappings as they aredefined. This will look like this:

  1. $ THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'
  2. Mapped name dev0 to device cuda0: GeForce GTX TITAN X
  3. Mapped name dev1 to device cuda1: GeForce GTX TITAN X

If you don’t have enough GPUs for a certain model, you can assign thesame device to more than one name. You can also assign extra namesthat a model doesn’t need to some other devices. However, aproliferation of names is not always a good idea since theano oftenassumes that different context names will be on different devices andwill optimize accordingly. So you may get faster performance for asingle name and a single device.

Note

It is often the case that multi-gpu operation requires or assumesthat all the GPUs involved are equivalent. This is not the casefor this implementation. Since the user has the task ofdistrubuting the jobs across the different device a model can bebuilt on the assumption that one of the GPU is slower or hassmaller memory.

A simple graph on two GPUs

The following simple program works on two GPUs. It builds a functionwhich perform two dot products on two different GPUs.

  1. import numpy
  2. import theano
  3.  
  4. v01 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
  5. target='dev0')
  6. v02 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
  7. target='dev0')
  8. v11 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
  9. target='dev1')
  10. v12 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
  11. target='dev1')
  12.  
  13. f = theano.function([], [theano.tensor.dot(v01, v02),
  14. theano.tensor.dot(v11, v12)])
  15.  
  16. f()

This model requires a context map with assignations for ‘dev0’ and‘dev1’. It should run twice as fast when the devices are different.

Explicit transfers of data

Since operations themselves cannot work on more than one device, theywill pick a device to work on based on their inputs and automaticallyinsert transfers for any input which is not on the right device.

However you may want some explicit control over where and how thesetransfers are done at some points. This is done by using the newtransfer() method that is present on variables. It works formoving data between GPUs and also between the host and the GPUs. Hereis a example.

  1. import theano
  2.  
  3. v = theano.tensor.fmatrix()
  4.  
  5. # Move to the device associated with 'gpudev'
  6. gv = v.transfer('gpudev')
  7.  
  8. # Move back to the cpu
  9. cv = gv.transfer('cpu')

Of course you can mix transfers and operations in any order youchoose. However you should try to minimize transfer operationsbecause they will introduce overhead any may reduce performance.