Overview of Kubeflow Pipelines

Understanding the goals and main concepts of Kubeflow Pipelines

Beta

This Kubeflow component has beta status. See the Kubeflow versioning policies. The Kubeflow team is interested in your feedback about the usability of the feature.

Kubeflow Pipelines is a platform for building and deploying portable,scalable machine learning (ML) workflows based on Docker containers.

Quickstart

Run your first pipeline by following thepipelines quickstart guide.

What is Kubeflow Pipelines?

The Kubeflow Pipelines platform consists of:

A user interface (UI) for managing and tracking experiments, jobs, and runs.
An engine for scheduling multi-step ML workflows.
An SDK for defining and manipulating pipelines and components.
Notebooks for interacting with the system using the SDK.

The following are the goals of Kubeflow Pipelines:

End-to-end orchestration: enabling and simplifying the orchestration ofmachine learning pipelines.
Easy experimentation: making it easy for you to try numerous ideas andtechniques and manage your various trials/experiments.
Easy re-use: enabling you to re-use components and pipelines to quicklycreate end-to-end solutions without having to rebuild each time.

InKubeflow v0.1.3 and later,Kubeflow Pipelines is one of the Kubeflow core components. It’s automatically deployed during Kubeflow deployment.

Due tokubeflow/pipelines#1700,the container builder in Kubeflow Pipelines currently prepares credentials forGoogle Cloud Platform (GCP) only. As a result, the container builder supportsonly Google Container Registry. However, you can store the container images onother registries, provided you set up the credentials correctly to fetchthe image.

What is a pipeline?

A pipeline is a description of an ML workflow, including all of the componentsin the workflow and how they combine in the form of a graph. (See thescreenshot below showing an example of a pipeline graph.) The pipelineincludes the definition of the inputs (parameters) required to run the pipelineand the inputs and outputs of each component.

After developing your pipeline, you can upload and share it on theKubeflow Pipelines UI.

A pipeline component is a self-contained set of user code, packaged as aDocker image, thatperforms one step in the pipeline. For example, a component can be responsiblefor data preprocessing, data transformation, model training, and so on.

See the conceptual guides to pipelinesand components.

Example of a pipeline

The screenshots and code below show the xgboost-training-cm.py pipeline, whichcreates an XGBoost model using structured data in CSV format. You can see thesource code and other information about the pipeline onGitHub.

The runtime execution graph of the pipeline

The screenshot below shows the example pipeline’s runtime execution graph in theKubeflow Pipelines UI:

The Python code that represents the pipeline

Below is an extract from the Python code that defines thexgboost-training-cm.py pipeline. You can see the full code onGitHub.

@dsl.pipeline(
    name='XGBoost Trainer',
    description='A trainer that does end-to-end distributed training for XGBoost models.'
)
def xgb_train_pipeline(
    output='gs://your-gcs-bucket',
    project='your-gcp-project',
    cluster_name='xgb-%s' % dsl.RUN_ID_PLACEHOLDER,
    region='us-central1',
    train_data='gs://ml-pipeline-playground/sfpd/train.csv',
    eval_data='gs://ml-pipeline-playground/sfpd/eval.csv',
    schema='gs://ml-pipeline-playground/sfpd/schema.json',
    target='resolution',
    rounds=200,
    workers=2,
    true_label='ACTION',
):
    output_template = str(output) + '/' + dsl.RUN_ID_PLACEHOLDER + '/data'
    # Current GCP pyspark/spark op do not provide outputs as return values, instead,
    # we need to use strings to pass the uri around.
    analyze_output = output_template
    transform_output_train = os.path.join(output_template, 'train', 'part-*')
    transform_output_eval = os.path.join(output_template, 'eval', 'part-*')
    train_output = os.path.join(output_template, 'train_output')
    predict_output = os.path.join(output_template, 'predict_output')
    with dsl.ExitHandler(exit_op=dataproc_delete_cluster_op(
        project_id=project,
        region=region,
        name=cluster_name
    )):
        _create_cluster_op = dataproc_create_cluster_op(
            project_id=project,
            region=region,
            name=cluster_name,
            initialization_actions=[
              os.path.join(_PYSRC_PREFIX,
                           'initialization_actions.sh'),
            ],
            image_version='1.2'
        )
        _analyze_op = dataproc_analyze_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            schema=schema,
            train_data=train_data,
            output=output_template
        ).after(_create_cluster_op).set_display_name('Analyzer')
        _transform_op = dataproc_transform_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            train_data=train_data,
            eval_data=eval_data,
            target=target,
            analysis=analyze_output,
            output=output_template
        ).after(_analyze_op).set_display_name('Transformer')
        _train_op = dataproc_train_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            train_data=transform_output_train,
            eval_data=transform_output_eval,
            target=target,
            analysis=analyze_output,
            workers=workers,
            rounds=rounds,
            output=train_output
        ).after(_transform_op).set_display_name('Trainer')
        _predict_op = dataproc_predict_op(
            project=project,
            region=region,
            cluster_name=cluster_name,
            data=transform_output_eval,
            model=train_output,
            target=target,
            analysis=analyze_output,
            output=predict_output
        ).after(_train_op).set_display_name('Predictor')
        _cm_op = confusion_matrix_op(
            predictions=os.path.join(predict_output, 'part-*.csv'),
            output_dir=output_template
        ).after(_predict_op)
        _roc_op = roc_op(
            predictions_dir=os.path.join(predict_output, 'part-*.csv'),
            true_class=true_label,
            true_score_column=true_label,
            output_dir=output_template
        ).after(_predict_op)
    dsl.get_pipeline_conf().add_op_transformer(
        gcp.use_gcp_secret('user-gcp-sa'))

Pipeline input data on the Kubeflow Pipelines UI

The partial screenshot below shows the Kubeflow Pipelines UI for kicking off arun of the pipeline. The pipeline definition in your code determines whichparameters appear in the UI form. The pipeline definition can also set defaultvalues for the parameters:

Outputs from the pipeline

The following screenshots show examples of the pipeline output visible onthe Kubeflow Pipelines UI.

Prediction results:

Confusion matrix:

Receiver operating characteristics (ROC) curve:

Architectural overview

At a high level, the execution of a pipeline proceeds as follows:

Python SDK: You create components or specify a pipeline using the KubeflowPipelines domain-specific language(DSL).
DSL compiler: TheDSL compilertransforms your pipeline’s Python code into a static configuration (YAML).
Pipeline Service: You call the Pipeline Service to create apipeline run from the static configuration.
Kubernetes resources: The Pipeline Service calls the Kubernetes APIserver to create the necessary Kubernetes resources(CRDs)to run the pipeline.
Orchestration controllers: A set of orchestration controllersexecute the containers needed to complete the pipeline execution specifiedby the Kubernetes resources(CRDs).The containers execute within Kubernetes Pods on virtual machines. Anexample controller is the Argo Workflow controller, whichorchestrates task-driven workflows.
Artifact storage: The Pods store two kinds of data:
- Metadata: Experiments, jobs, runs, etc. Also single scalar metrics,generally aggregated for the purposes of sorting and filtering.Kubeflow Pipelines stores the metadata in a MySQL database.
- Artifacts: Pipeline packages, views, etc. Alsolarge-scale metrics like time series, usually used for investigating anindividual run’s performance and for debugging. Kubeflow Pipelinesstores the artifacts in an artifact store likeMinio server orCloud Storage.The MySQL database and the Minio server are both backed by the KubernetesPersistentVolume(PV) subsystem.
Persistence agent and ML metadata: The Pipeline Persistence Agentwatches the Kubernetes resources created by the Pipeline Service andpersists the state of these resources in the ML Metadata Service. ThePipeline Persistence Agent records the set of containers that executed aswell as their inputs and outputs. The input/output consists of eithercontainer parameters or data artifact URIs.
Pipeline web server: The Pipeline web server gathers data from variousservices to display relevant views: the list of pipelines currently running,the history of pipeline execution, the list of data artifacts, debugginginformation about individual pipeline runs, execution status about individualpipeline runs.

Next steps

Follow thepipelines quickstart guide todeploy Kubeflow and run a sample pipeline directly from theKubeflow Pipelines UI.
Build machine-learning pipelines with the Kubeflow PipelinesSDK.
Follow the full guide to experimenting withthe Kubeflow Pipelines samples.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified 12.02.2020: fix link in pipeline overview (#1679) (c380e917)