Local or remote execution?

Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn’t require any additional setup.

Executing a transform process

Once you’ve created your TransformProcess using your Schema, and you’ve either loaded your dataset into a Apache Spark JavaRDD or have a RecordReader that load your dataset, you can execute a transform.

Locally this looks like:

  1. import org.datavec.local.transforms.LocalTransformExecutor;
  2. List<List<Writable>> transformed = LocalTransformExecutor.execute(recordReader, transformProcess)
  3. List<List<List<Writable>>> transformedSeq = LocalTransformExecutor.executeToSequence(sequenceReader, transformProcess)
  4. List<List<Writable>> joined = LocalTransformExecutor.executeJoin(join, leftReader, rightReader)

When using Spark this looks like:

  1. import org.datavec.spark.transforms.SparkTransformExecutor;
  2. JavaRDD<List<Writable>> transformed = SparkTransformExecutor.execute(inputRdd, transformProcess)
  3. JavaRDD<List<List<Writable>>> transformedSeq = SparkTransformExecutor.executeToSequence(inputSequenceRdd, transformProcess)
  4. JavaRDD<List<Writable>> joined = SparkTransformExecutor.executeJoin(join, leftRdd, rightRdd)

Available executors


LocalTransformExecutor

[source]

Local transform executor

isTryCatch
  1. public static boolean isTryCatch()

Execute the specified TransformProcess with the given input dataNote: this method can only be used if the TransformProcess returns non-sequence data. For TransformProcessesthat return a sequence, use {- link #executeToSequence(List, TransformProcess)}

  • param inputWritables Input data to process
  • param transformProcess TransformProcess to execute
  • return Processed data

SparkTransformExecutor

[source]

Execute a datavectransform processon spark rdds.

isTryCatch
  1. public static boolean isTryCatch()
  • deprecated Use static methods instead of instance methods on SparkTransformExecutor