Design Overview for ElasticDL on SQLFlow

Overview

This is a design doc on integration with SQLFlow.

User Interface

Training Job Submission

  1. SELECT
  2. c1, c2, c3, c4, c5 as class
  3. FROM training_data
  4. TRAIN ElasticDLKerasClassifier
  5. WITH
  6. model.optimizer = "optimizer",
  7. model.loss = "loss",
  8. model.eval_metrics_fn = "eval_metrics_fn",
  9. model.num_classes = 3,
  10. model.dataset_fn = "dataset_fn",
  11. train.shuffle = 120,
  12. train.epoch = 2,
  13. train.grads_to_wait = 2,
  14. train.tensorboard_log_dir = "",
  15. train.checkpoint_steps = 0,
  16. train.checkpoint_dir = "",
  17. train.keep_checkpoint_max = 0,
  18. eval.steps = 0,
  19. eval.start_delay_secs = 100,
  20. eval.throttle_secs = 0,
  21. eval.checkpoint_dir_for_init = "",
  22. engine.master_resource_request = "cpu=400m,memory=1024Mi",
  23. engine.master_resource_limit = "cpu=1,memory=2048Mi",
  24. engine.worker_resource_request = "cpu=400m,memory=2048Mi",
  25. engine.worker_resource_limit = "cpu=1,memory=3072Mi",
  26. engine.num_workers = 2,
  27. engine.volume = "",
  28. engine.image_pull_policy = "Never",
  29. engine.restart_policy = "Never",
  30. engine.extra_pypi_index = "",
  31. engine.namespace = "default",
  32. engine.minibatch_size = 64,
  33. engine.master_pod_priority = "",
  34. engine.cluster_spec = "",
  35. engine.num_minibatches_per_task = 2,
  36. engine.docker_image_repository = "",
  37. engine.envs = ""
  38. COLUMN
  39. c1,
  40. c2,
  41. c3,
  42. c4
  43. LABEL class
  44. INTO trained_elasticdl_keras_classifier;

Prediction Job Submission

  1. SELECT
  2. c1, c2, c3, c4
  3. FROM prediction_data
  4. PREDICT prediction_results_table
  5. WITH
  6. model.num_classes = 10,
  7. model.dataset_fn = "dataset_fn",
  8. predict.checkpoint_dir_for_init = "v1/",
  9. engine.master_resource_request = "cpu=400m,memory=1024Mi",
  10. engine.master_resource_limit = "cpu=1,memory=2048Mi",
  11. engine.worker_resource_request = "cpu=400m,memory=2048Mi",
  12. engine.worker_resource_limit = "cpu=1,memory=3072Mi",
  13. engine.num_workers = 2,
  14. engine.volume = "",
  15. engine.image_pull_policy = "Never",
  16. engine.restart_policy = "Never",
  17. engine.extra_pypi_index = "",
  18. engine.namespace = "default",
  19. engine.minibatch_size = 64,
  20. engine.master_pod_priority = "",
  21. engine.cluster_spec = "",
  22. engine.num_minibatches_per_task = 2,
  23. engine.docker_image_repository = "",
  24. engine.envs = ""
  25. USING trained_elasticdl_keras_classifier;

Implementation

Mapping Extended SQL

The components of the extended SQL defined by SQLFlow are mapped to a elasticDLFiller struct that looks like the following:

  1. type elasticDLFiller struct {
  2. IsTraining bool
  3. TrainInputTable string
  4. EvalInputTable string
  5. PredictInputTable string
  6. PredictOutputTable string
  7. PredictInputModel string
  8. OutputShape int
  9. InputShape int
  10. ModelDir string
  11. LabelColName string
  12. FeaturesList string
  13. TrainClause *resolvedTrainClause
  14. PredictClause *resolvedPredictClause
  15. }

This elasticDLFiller struct will be used to fill a template pre-defined to generate the model definition components required for ElasticDL, such as the model definition using tf.keras API, loss, optimizer, dataset_fn, etc.

For example, the dataset_fn is generated using the FeaturesList, LabelColName, InputShape, IsTraining, and TrainClause in the elasticDLFiller struct:

  1. def dataset_fn(dataset, mode, metadata):
  2. def _parse_data(record):
  3. def _get_features_without_labels(
  4. record, label_col_ind, features_shape
  5. ):
  6. features = [
  7. record[:label_col_ind],
  8. record[label_col_ind + 1 :],
  9. ]
  10. features = tf.concat(features, -1)
  11. return tf.reshape(features, features_shape)
  12. record = tf.strings.to_number(record, tf.float32)
  13. features_shape = (, 1)
  14. labels_shape = (1,)
  15. label_col_name = ""
  16. if mode != Mode.PREDICTION:
  17. if label_col_name not in metadata.column_names:
  18. raise ValueError(
  19. "Missing the label column '%s' in the retrieved "
  20. "table." % label_col_name
  21. )
  22. label_col_ind = metadata.column_names.index(label_col_name)
  23. labels = tf.reshape(record[label_col_ind], labels_shape)
  24. return (
  25. _get_features_without_labels(
  26. record, label_col_ind, features_shape
  27. ),
  28. labels,
  29. )
  30. return tf.reshape(record, features_shape)
  31. dataset = dataset.map(_parse_data)
  32. if mode != Mode.PREDICTION and "" == "true":
  33. dataset = dataset.shuffle(buffer_size=)
  34. return dataset

Some fields used to generate the above dataset_fn are obtained directly from the extended SQL statement. For example, FeaturesList is obtained from SELECT FROM clause. LabelColName is obtained from LABEL clause. TrainClause.ShuffleBufferSize is obtained from train.shuffle in the WITH clause. There are also fields that are obtained indirectly. For example, InputShape is inferred from FeaturesList.

Note that in the template we currently we hard-coded the types for each column to be tf.float32 in the generated dataset_fn. We should infer this information from the database instead. We also hard-coded other components in the model definition such as loss and optimizer, these components should be derived from the model zoo instead.

Generate ElasticDL Command

Once we generated the components for the model definition, we can then generate the ElasticDL command to submit the job. Below is an example:

  1. elasticdl train \
  2. --image_base=elasticdl:ci \
  3. --model_zoo=<model-zoo> \
  4. --model_def=<path-to-generated-model-def> \
  5. --loss=<loss-function-name> \
  6. --eval_metrics_fn=<eval-metrics-function-name> \
  7. --training_data=<training-input-table> \
  8. --validation_data=<validation-input-table> \
  9. --num_epochs=2 \
  10. --master_resource_request="cpu=400m,memory=1024Mi" \
  11. --master_resource_limit="cpu=1,memory=2048Mi" \
  12. --worker_resource_request="cpu=400m,memory=2048Mi" \
  13. --worker_resource_limit="cpu=1,memory=3072Mi" \
  14. --minibatch_size=64 \
  15. --num_minibatches_per_task=10 \
  16. --num_workers=2 \
  17. --checkpoint_steps=10 \
  18. --evaluation_steps=15 \
  19. --grads_to_wait=2 \
  20. --job_name=test-iris \
  21. --log_level=INFO \
  22. --image_pull_policy=Never \
  23. --output=<model-output> \
  24. --envs=<env-vars> \
  25. --data_reader_params=<data-reader-params>

In the command, --model_def is the path to the model definition file we generated earlier. Additional arguments related to model definition such as --loss and --eval_metrics_fn are obtained from parameters with name starting with model..

The rest of the arguments are derived from the extended SQL, for example:

  • --model_zoo is obtained from TRAIN clause.
  • --training_data is obtained from FROM clause.
  • --num_epochs is obtained from train.shuffle in WITH clause.

ElasticDL engine specific arguments such as --grads_to_wait and --num_workers are obtained from parameters with name starting with engine..

In order to integrate with different databases we support, we pass additional information to the ElasticDL command.

For example, we pass necessary environment variables such as access ID and key for ODPS account to --envs. In addition, we pass the list of column names that we want to read from ODPS via --data_reader_params.

Future Work

  • Support tf.feature_columns API via COLUMN clause.
  • Support evaluation job. Evaluation on separate evaluation table is not supported yet in SQLFlow. Please check out #675 and #675 for details.
  • Switch to use intermediate representation for ElasticDL codegen. For details, please see #1075.
  • Support on synchronous call on high level API. For details, please see #1285.
  • Unify model zoos between SQLFlow and ElasticDL and support submitting an ElasticDL job for a model defined in model zoo. Please see #22 and #1063 for details.
  • Currently the only database ElasticDL supports is ODPS. However, we should expose necessary abstractions so ElasticDL can fully leverage SQLFlow’s functionality to read/write from different SQL databases.
  • Support prediction job and add integration tests on Travis CI.
  • Currently we have hard-coded the types for each column to be tf.float32 in the generated dataset_fn. We should infer this information from the database instead.