Data Frames

Data Frames are used to store data during execution of an ML pipeline.They are similar to a SQL table in that they have a schema for storingthe data types of every column and they have rows for storing the actualvalues.

Spark, Scikit-learn, and MLeap all have their own version of a dataframe. Tensorflow uses a graph of inputs and outputs to executetransformations, which is very easy to inteface with a data framestructure.

Spark Data Frames

Spark’s data frames are optimized for distributed computation, makingthem excellent for processing large datasets. They are veryheavy-weight, as they need to handle network failure scenarios,compilation of execution plans, redundancy, and many other requirementsin a distributed context. Spark data frames offer a lot of functionalityoutside of ML pipelines, such as joining large datasets, mapping,reducing, SQL queries, etc.

Scikit-learn Data Frames

Scikit-learn data frames are provided by Pandasand NumPy. These are lightweight datastructures, and offer quite a bit of the same functionality as Sparkdata frames, minus the distributed nature of Spark’s data frames.

MLeap Data Frames: Leap Frames

Leap frames are very lightweight data structures and are meant tosupport very basic operations and ML transformations. Because of theirsimplicity, they are highly-optimized for use as a realtime predictionengine or small-batch predictions. Leap frames can be abstracted overSpark data frames, and so they do not lose their ability to act as anefficient batch-mode data store as well.

Example Leap Frame

Here is an example leap frame in JSON, it comes from our AirBnB demo:

  1. {
  2. "schema": {
  3. "fields": [{
  4. "name": "state",
  5. "type": "string"
  6. }, {
  7. "name": "bathrooms",
  8. "type": "double"
  9. }, {
  10. "name": "square_feet",
  11. "type": "double"
  12. }, {
  13. "name": "bedrooms",
  14. "type": "double"
  15. }, {
  16. "name": "review_scores_rating",
  17. "type": "double"
  18. }, {
  19. "name": "room_type",
  20. "type": "string"
  21. }, {
  22. "name": "cancellation_policy",
  23. "type": "string"
  24. }]
  25. },
  26. "rows": [["NY", 2.0, 1250.0, 3.0, 50.0, "Entire home/apt", "strict"]]
  27. }

Tensorflow

Tensorflow does not have data frames like Spark, Scikit-learn and MLeap.Instead, Tensorflow relies on input nodes and output nodes, connected bya graph of transfomation operations. This paradigm is actually neatlycompatible with data frames, as certain columns can be used to providedata for the input nodes, while output nodes can be placed in newcolumns of a data frame. Leap frames are specifically designed to becompatible with Tensorflow graphs, Spark data frames, and to a certainextent, Scikit-learn data frames.