Idioms

There are certain types of computation that show up again and again in parallel computation. This section shows how to perform them with jug.

Map/Reduce

Currently, the dominant paradigm for large scale distributed computing is the map/reduce paradigm. Originally made prominent by Google’s proprietary implementation, it is now available in many implementations, including open-source ones such as Hadoop.

Jug is not a direct competitor to Hadoop as it focuses on medium sized problems. Jug does not implement a distributed file system of any sort, but assumes that all compute nodes have access to a central repository of information (such as a shared filesystem or a redis server).

On the other hand, jug supports much more complex computations than map-reduce. If, however, your problem is naturally described as a map/reduce computation, then jug has some helper functions.

jug.mapreduce.mapreduce

The jug.mapreduce.mapreduce function implements mapreduce:

  1. jug.mapreduce(reducer, mapper, inputs)

is roughly equivalent to:

  1. Task{ reduce(reducer, map(mapper, inputs)) }

If the syntax of Python supported such a thing.

An issue that might come up is that your map function can be too fast. A good task should take at least a few seconds (otherwise, the overhead of scheduling and loading the data overwhelms the performance advantages of parallelism. Analogously for the reduce function.

Therefore, jug groups your inputs so that a mapping task actually consists of mapping and reducing more than one input. How many is controlled by the map_step parameter. By default, it is set to 4. Similarly, the reduce_step parameter controls how many reduction steps to perform in a single task (by default, 8; reflecting the fact that reduce operations tend to be lighter than map operations).

The compound task section has a worked out example of using map/reduce.

Parameter Sweep

This is a standard problem in many fields, for example, in machine learning. You have an algorithm that takes a few parameters (let’s call them p0 and p1) and a function which takes your input data (data) and the parameters and outputs the score of this parameter combination.

In pure Python, we’d write something like:

  1. best = None
  2. best_val = float("-Inf")
  3. for p0 in range(100):
  4. for p1 in range(-20, 20):
  5. cur = score(data, p0, p1)
  6. if cur > best_val:
  7. best = p0,p1
  8. best_val = cur
  9. print('Best parameter pair', best)

This is, obviously, an embarassingly parallel problem and we want jug to handle it.

First note: we can, of course, perform this with a map/reduce:

  1. def mapper(data, p0, p1):
  2. return (p0, p1, score(data, p0, p1))
  3. def reducer(a, b):
  4. _, _, av = a
  5. _, _, bv = b
  6. if av > bv: return a
  7. return b
  8. best = jug.mapreduce.mapreduce(
  9. reducer,
  10. mapper,
  11. [(p0, p1)
  12. for p0 in range(101)
  13. for p1 in range(-20, 21)])

However, if you want to look at the whole parameter space instead of just the best score, this will not work. Instead, you can do:

  1. from jug import TaskGenerator
  2. score = TaskGenerator(score)
  3. results = {}
  4. for p0 in range(100):
  5. for p1 in range(-20, 20):
  6. result[p0,p1] = value(data, p0, p1)

Now, after you’ve run ``jug execute``, you can use jug shell and load the result dictionary to look at all the results.

  1. result = value(result)
  2. print(result[0, -2])
  3. # Look for the maximum score
  4. print(max(result.values()))
  5. # Look at maximum score *and* the parameters that generated it:
  6. print(max((v, k) for k, v in result.iteritems()))