Getting Started with PySpark

MLeap PySpark integration provides serialization of PySpark-trained MLpipelines to MLeap Bundles. MLeap also providesseveral extensions to Spark, including enhanced one hot encoding and one vsrest models. Unlike Mleap<>Spark integration, MLeap doesn’t yet provide PySparkintegration with Spark Extensions transformers.

Adding MLeap Spark to Your Project

Before adding MLeap Pyspark to your project, you first have to compile andadd MLeap Spark.

MLeap PySpark is available in the combust/mleap github repository in thepython package.

To add MLeap to your PySpark project, just clone the git repo, add the mleap/pyhtonpath, and import mleap.pyspark

  1. git clone git@github.com:combust/mleap.git

Then in your python environment do:

  1. import sys
  2. sys.path.append('<git directory>/mleap/python')
  3. import mleap.pyspark

Note: the import of mleap.pyspark needs to happen before any other PySparklibraries are imported.

Note: If you are working from a notebook environment, be sure to take a look atinstructions of how to set up MLeap PySpark with:

Using PIP

Alternatively, there is PIP support for PySpark available under: https://pypi.python.org/pypi/mleap.

To use MLeap extensions to PySpark:

  1. See build instructions to build MLeap from source.
  2. See core concepts for an overview of ML pipelines.
  3. See Spark documentation to learn how to train ML pipelines in Spark.
  4. See Demo notebook on how to use PySpark and MLeap to serialize your pipeline to Bundle.ml