Best Practices for Designing Components

Designing and writing components for Kubeflow Pipelines

This page describes some recommended practices for designingcomponents. For an application of these best practices, see thecomponent development guide. Ifyou’re new to pipelines, see the conceptual guides topipelinesand components.

General component design rules

  • Design your components with composability in mind. Think aboutupstream and downstream components. What formats to consume as inputs fromthe upstream components. What formats to use for output data so thatdownstream components can consume it.
  • Component code must use local files for input/output data (unless impossible
    • for example, Cloud ML Engine and BigQuery require Cloud Storage stagingpaths).
  • Components must be pure - they must not use any outside data except datathat comes through inputs (unless impossible). Everything should either beinside the container or come from inputs. Network access is stronglydiscouraged unless that’s the explicit purpose of a component (for example,upload/download).

Writing component code

  • The program must be runnable both locally and inside the Dockercontainer.
  • Programming languages:

    • Generally, use the language that makes the most sense. If thecomponent wraps a Java library, then it may make sense to use Java toexpose that library.
    • For most new components when the performance is not a concernthe Python language is preferred (use version 3 wherever possible).
    • If a component wraps an existing program, it’s preferred todirectly expose the program in the component command line.
    • If there needs to be some wrapper around the program (smallpre-processing or post-processing like file renaming), it can be donewith a shell script.
    • Follow the best practices for the chosen language.
  • Each output data piece should be written to a separate file (see next line).

  • The input and output file paths must be passed in the command line andnot hard coded:

    • Typical command line:
  1. program.py --input-data <input path> --output-data <output path> --param 42
  • Do NOT hardcode paths in the program:
  1. open("/output.txt", "w")
  • For temporary data you should use library functions that createtemporary files. For example, for Python usehttps://docs.python.org/3/library/tempfile.html.Do not just write to the root, or testing will be hard.

  • Design the code to be testable.

Writing tests

  • Follow the general rules section so that writing the tests iseasier.
  • Use the unit testing libraries that are standard for the language you’reusing.
  • Try to design the component code so that it can be tested using unit tests.Do not use network unless necessary

  • Prepare small input data files so that the component code can be tested inisolation. For example, for an ML prediction component prepare a small modeland evaluation dataset.

  • Use testing best practices.

    • Test the expected behavior of the code. Don’t just verify that“nothing has changed”:

      • For training you can look at loss at final iteration.
      • For prediction you can look at the result metrics.
      • For data augmenting you can check for some desired post-invariants.
  • If the component cannot be tested locally or in isolation, then create asmall proof-of-concept pipeline that tests the component. You can useconditionals to verify the output values of a particular task and onlyenable the “success” task if the results are expected.

Writing a Dockerfile

  • Follow theDocker best practices

  • Structure the Dockerfile so that the required packages are installedfirst and the main component scripts/binaries are added last. Ideally, splitthe Dockerfile in two parts (base image and component code) so that themain component image build is fast and more reliable (not require networkaccess).

Writing a component specification YAML file

For the complete definition of a Kubeflow Pipelines component, see thecomponent specification.When creating your component.yaml file, you can look at the definitions forsomeexisting components.

  • Use the {inputValue: Input name} command-line placeholder for smallvalues that should be directly inserted into the command-line.
  • Use the {inputPath: Input name} command-line placeholder for input filelocations.
  • Use the {outputPath: Output name} command-line placeholder for output filelocations.
  • Specify the full command line in ‘command:’ instead of just arguments to theentry point.