MLflow Node

Overview

MLflow is an excellent open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

MLflow task plugin used to execute MLflow tasks,Currently contains MLflow Projects and MLflow Models. (Model Registry will soon be rewarded for support)

  • MLflow Projects: Package data science code in a format to reproduce runs on any platform.
  • MLflow Models: Deploy machine learning models in diverse serving environments.
  • Model Registry: Store, annotate, discover, and manage models in a central repository.

The MLflow plugin currently supports and will support the following:

  • MLflow Projects
    • BasicAlgorithm: contains LogisticRegression, svm, lightgbm, xgboost
    • AutoML: AutoML tool, contains autosklean, flaml
    • Custom projects: Support for running your own MLflow projects
  • MLflow Models
    • MLFLOW: Use MLflow models serve to deploy a model service
    • Docker: Run the container after packaging the docker image

Create Task

  • Click Project Management -> Project Name -> Workflow Definition, and click the Create Workflow button to enter the DAG editing page.
  • Drag from the toolbar MLflow - 图1 task node to canvas.

Task Parameters and Example

ParameterDescription
MLflow Tracking Server URIMLflow Tracking Server URI, default http://localhost:5000.
Experiment NameCreate the experiment where the task is running, if the experiment does not exist. If the name is empty, it is set to Default, the same as MLflow.

MLflow Projects

BasicAlgorithm

mlflow-conda-env

Task Parameters

ParameterDescription
Register ModelRegister the model or not. If register is selected, the following parameters are expanded.
Model NameThe registered model name is added to the original model version and registered as Production.
Data PathThe absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation.
ParametersParameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters “time_budget=30;estimator_list=[‘lgbm’]” for flaml. The convention will be passed with ‘; ‘ shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through python eval().
AlgorithmThe selected algorithm currently supports LR, SVM, LightGBM and XGboost based on scikit-learn form.
Parameter Search SpaceParameter search space when running the corresponding algorithm, which can be empty. For example, the parameter max_depth=[5, 10];n_estimators=[100, 200] for lightgbm. The convention will be passed with ‘; ‘shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through python eval().

AutoML

mlflow-automl

Task Parameter

ParameterDescription
Register ModelRegister the model or not. If register is selected, the following parameters are expanded.
model nameThe registered model name is added to the original model version and registered as Production.
Data PathThe absolute path of the file or folder. Ends with .csv for file or contain train.csv and test.csv for folder(In the suggested way, users should build their own test sets for model evaluation).
ParametersParameter when initializing the algorithm/AutoML model, which can be empty. For example, parameters n_estimators=200;learning_rate=0.2 for flaml. The convention will be passed with ‘; ‘shards each parameter, using the name before the equal sign as the parameter name, and using the name after the equal sign to get the corresponding parameter value through python eval(). The detailed parameter list is as follows:
AutoML toolThe AutoML tool used, currently supports autosklearn and flaml.

Custom projects

mlflow-custom-project.png

Task Parameter

ParameterDescription
parameters—param-list in mlflow run. For example -P learning_rate=0.2 -P colsample_bytree=0.8 -P subsample=0.9.
RepositoryRepository url of MLflow Project, Support git address and directory on worker. If it’s in a subdirectory, We add # to support this (same as mlflow run) , for example https://github.com/mlflow/mlflow#examples/xgboost/xgboost_native.
Project VersionVersion of the project, default master.

You can now use this feature to run all MLFlow projects on Github (For example MLflow examples ). You can also create your own machine learning library to reuse your work, and then use DolphinScheduler to use your library with one click.

MLflow Models

General Parameters

ParameterDescription
Model-URIModel-URI of MLflow , support models:/<model_name>/suffix format and runs:/ format. See https://mlflow.org/docs/latest/tracking.html#artifact-stores
PortThe port to listen on.

MLflow

mlflow-models-mlflow

Docker

mlflow-models-docker

Environment to Prepare

Conda Environment

Please install anaconda or miniconda in advance.

Method A:

Config anaconda environment in /dolphinscheduler/conf/env/dolphinscheduler_env.sh.

Add the following content to the file:

  1. # config anaconda environment
  2. export PATH=/opt/anaconda3/bin:$PATH

Method B:

You need to enter the admin account to configure a conda environment variable.

mlflow-conda-env

Note During the configuration task, select the conda environment created above. Otherwise, the program cannot find the Conda environment.

mlflow-set-conda-env

Start the MLflow Service

Make sure you have installed MLflow, using ‘pip install mlflow’.

Create a folder where you want to save your experiments and models and start MLflow service.

  1. mkdir mlflow
  2. cd mlflow
  3. mlflow server -h 0.0.0.0 -p 5000 --serve-artifacts --backend-store-uri sqlite:///mlflow.db

After running, an MLflow service is started.

After this, you can visit the MLflow service (http://localhost:5000) page to view the experiments and models.

mlflow-server

Preset Algorithm Repository Configuration

If you can’t access github, you can modify the following fields in the commom.properties configuration file to replace the github address with an accessible address.

  1. # mlflow task plugin preset repository
  2. ml.mlflow.preset_repository=https://github.com/apache/dolphinscheduler-mlflow
  3. # mlflow task plugin preset repository version
  4. ml.mlflow.preset_repository_version="main"