Component Specification

Component Specification

Definition of a Kubeflow Pipelines component

This specification describes the container component data model for KubeflowPipelines. The data model is serialized to a file in YAML format for sharing.

Below are the main parts of the component definition:

Metadata: Name, description, and other metadata.
Interface (inputs and outputs): Name, type, default value.
Implementation: How to run the component, given the input arguments.

Example of a component specification

A component specification takes the form of a YAML file, component.yaml. Belowis an example:

name: xgboost4j - Train classifier
description: Trains a boosted tree ensemble classifier using xgboost4j
inputs:
- {name: Training data}
- {name: Rounds, type: Integer, default: '30', help: Number of training rounds}
outputs:
- {name: Trained model, type: XGBoost model, help: Trained XGBoost model}
implementation:
  container:
    image: gcr.io/ml-pipeline/xgboost-classifier-train@sha256:b3a64d57
    command: [
      /ml/train.py,
      --train-set, {inputPath: Training data},
      --rounds,    {inputValue: Rounds},
      --out-model, {outputPath: Trained model},
    ]

See some examples of real-worldcomponent specifications.

Detailed specification (ComponentSpec)

This section describes theComponentSpec.

Metadata

name: Human-readable name of the component.
description: Description of the component.
metadata: Standard object’s metadata:
- annotations: A string key-value map used to add information about the component.Currently, the annotations get translated to Kubernetes annotations when the component task is executed on Kubernetes. Current limitation: the key cannot contain more that one slash (“/”). See more information in theKubernetes user guide.
- labels: Deprecated. Use annotations.

Interface

inputs and outputs:Specifies the list of inputs/outputs and their properties. Each input oroutput has the following properties:
- name: Human-readable name of the input/output. Name must beunique inside the inputs or outputs section, but an output may have thesame name as an input.
- description: Human-readable description of the input/output.
- default: Specifies the default value for an input. Onlyvalid for inputs.
- type: Specifies the type of input/output. The types are usedas hints for pipeline authors and can be used by the pipeline system/UIto validate arguments and connections between components. Basic typesare String, Integer, Float, and Bool. See the full listof typesdefined by the Kubeflow Pipelines SDK.

Implementation

implementation: Specifies how to execute the component instance.There are two implementation types, container and graph. (The latter isnot in scope for this document.) In future we may introduce moreimplementation types like daemon or K8sResource.
- container:Describes the Docker container that implements the component. A portablesubset of the KubernetesContainer v1 spec.
  - image: Name of the Docker image.
  - command: Entrypoint array. The Docker image’sENTRYPOINT is used if this is not provided. Each item is either astring or a placeholder. The most common placeholders are{inputValue: Input name}, {inputPath: Input name} and {outputPath: Output name}.
  - args: Arguments to the entrypoint. The Dockerimage’s CMD is used if this is not provided. Each item is either astring or a placeholder. The most common placeholders are{inputValue: Input name}, {inputPath: Input name} and {outputPath: Output name}.
  - env: Map of environment variables to set in the container.
  - fileOutputs: Legacy property that is only needed incases where the container always stores the output data in somehard-coded non-configurable local location. This property specifiesa map between some outputs and local file paths where the programwrites the output data files. Only needed for components that havehard-coded output paths. Such containers need to be fixed bymodifying the program or adding a wrapper script that copies theoutput to a configurable location. Otherwise the component may beincompatible with future storage systems.

You can set all other Kubernetes container properties when youuse the component inside a pipeline.

Using placeholders for command-line arguments

Consuming input by value

The {inputValue: <Input name>} placeholder is replaced by the value of the input argument:

In component.yaml:

  command: [program.py, --rounds, {inputValue: Rounds}]

In the pipeline code:

  task1 = component1(rounds=150)

Resulting command-line code (showing the value of the input argument thathas replaced the placeholder):

  program.py --rounds 150

Consuming input by file

The {inputPath: <Input name>} placeholder is replaced by the (auto-generated) local file path where the system has put the argument data passed for the “Input name” input.

In component.yaml:

  command: [program.py, --train-set, {inputPath: training_data}]

In the pipeline code:

  task2 = component1(training_data=some_task1.outputs['some_data'])

Resulting command-line code (the placeholder is replaced by thegenerated path):

  program.py --train-set /inputs/train_data/data

Producing outputs

The {outputPath: <Output name>} placeholder is replaced by a (generated) local file path where the component program is supposed to write the output data.The parent directories of the path may or may not not exist. Yourprogram must handle both cases without error.

In component.yaml:

  command: [program.py, --out-model, {outputPath: trained_model}]

In the pipeline code:

  task1 = component1()
  # You can now pass `task1.outputs['trained_model']` to other components as argument.

Resulting command-line code (the placeholder is replaced by thegenerated path):

  program.py --out-model /outputs/trained_model/data

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified 27.10.2019: [Pipelines] Greatly simplified the "Create Reusable Components" tutorial (#1277) (a1893e88)