ceph-mgr orchestrator modules

Warning

This is developer documentation, describing Ceph internals thatare only relevant to people writing ceph-mgr orchestrator modules.

In this context, orchestrator refers to some external service thatprovides the ability to discover devices and create Ceph services. Thisincludes external projects such as ceph-ansible, DeepSea, and Rook.

An orchestrator module is a ceph-mgr module (ceph-mgr module developer’s guide)which implements common management operations using a particularorchestrator.

Orchestrator modules subclass the Orchestrator class: this class isan interface, it only provides method definitions to be implementedby subclasses. The purpose of defining this common interfacefor different orchestrators is to enable common UI code, such asthe dashboard, to work with various different backends.

digraph G { subgraph cluster_1 { volumes [label="mgr/volumes"] rook [label="mgr/rook"] dashboard [label="mgr/dashboard"] orchestrator_cli [label="mgr/orchestrator_cli"] orchestrator [label="Orchestrator Interface"] ansible [label="mgr/ansible"] ssh [label="mgr/ssh"] deepsea [label="mgr/deepsea"] label = "ceph-mgr"; } volumes -> orchestrator dashboard -> orchestrator orchestrator_cli -> orchestrator orchestrator -> rook -> rook_io orchestrator -> ansible -> ceph_ansible orchestrator -> deepsea -> suse_deepsea orchestrator -> ssh rook_io [label="Rook"] ceph_ansible [label="ceph-ansible"] suse_deepsea [label="DeepSea"] rankdir="TB";}

Behind all the abstraction, the purpose of orchestrator modules is simple:enable Ceph to do things like discover available hardware, create anddestroy OSDs, and run MDS and RGW services.

A tutorial is not included here: for full and concrete examples, seethe existing implemented orchestrator modules in the Ceph source tree.

Glossary

  • Stateful service
  • a daemon that uses local storage, such as OSD or mon.

  • Stateless service

  • a daemon that doesn’t use any local storage, suchas an MDS, RGW, nfs-ganesha, iSCSI gateway.

  • Label

  • arbitrary string tags that may be applied by administratorsto nodes. Typically administrators use labels to indicatewhich nodes should run which kinds of service. Labels areadvisory (from human input) and do not guarantee that nodeshave particular physical capabilities.

  • Drive group

  • collection of block devices with common/shared OSDformatting (typically one or more SSDs acting asjournals/dbs for a group of HDDs).

  • Placement

  • choice of which node is used to run a service.

Key Concepts

The underlying orchestrator remains the source of truth for informationabout whether a service is running, what is running where, whichnodes are available, etc. Orchestrator modules should avoid takingany internal copies of this information, and read it directly fromthe orchestrator backend as much as possible.

Bootstrapping nodes and adding them to the underlying orchestrationsystem is outside the scope of Ceph’s orchestrator interface. Cephcan only work on nodes when the orchestrator is already aware of them.

Calls to orchestrator modules are all asynchronous, and return _completion_objects (see below) rather than returning values immediately.

Where possible, placement of stateless services should be left up to theorchestrator.

Completions and batching

All methods that read or modify the state of the system can potentiallybe long running. To handle that, all such methods return a Completion_object. Orchestrator modulesmust implement the _process method: this takes a list of completions, andis responsible for checking if they’re finished, and advancing the underlyingoperations as needed.

Each orchestrator module implements its own underlying mechanismsfor completions. This might involve running the underlying operationsin threads, or batching the operations up before later executingin one go in the background. If implementing such a batching pattern, themodule would do no work on any operation until it appeared in a listof completions passed into process.

Some operations need to show a progress. Those operations need to adda ProgressReference to the completion. At some point, the progress referencebecomes effective, meaning that the operation has really happened(e.g. a service has actually been started).

  • Orchestrator.process(completions)
  • Given a list of Completion instances, process any which areincomplete.

Callers should inspect the detail of each completion to identifypartial completion/progress information, and present that informationto the user.

This method should not block, as this would make it slow to querya status, while other long running operations are in progress.

  • class orchestrator.Completion(_first_promise=None, value=, on_complete=None, name=None)
  • Combines multiple promises into one overall operation.
  • Completions are composable by being able tocall one completion from another completion. I.e. making them re-usableusing Promises E.g.:

    1. >>> return Orchestrator().get_hosts().then(self._create_osd)

    where get_hosts returns a Completion of list of hosts and_create_osd takes a list of hosts.

    The concept behind this is to store the computation stepsexplicit and then explicitly evaluate the chain:

    1. >>> p = Completion(on_complete=lambda x: x*2).then(on_complete=lambda x: str(x))
    2. ... p.finalize(2)
    3. ... assert p.result = "4"

    or graphically:

    1. +---------------+ +-----------------+
    2. | | then | |
    3. | lambda x: x*x | +--> | lambda x: str(x)|
    4. | | | |
    5. +---------------+ +-----------------+
    • fail(e)
    • Sets the whole completion to be faild with this exception and end theevaluation.

    • property has_result

    • Has the operation already a result?

    For Write operations, it can already have aresult, if the orchestrator’s configuration ispersistently written. Typically this wouldindicate that an update had been written toa manifest, but that the update had notnecessarily been pushed out to the cluster.

    1. - Returns
    2. -
    • property is_errored
    • Has the completion failed. Default implementation looks forself.exception. Can be overwritten.

    • property is_finished

    • Could the external operation be deemed as complete,or should we wait?We must wait for a read operation only if it is not complete.

    • property needs_result

    • Could the external operation be deemed as complete,or should we wait?We must wait for a read operation only if it is not complete.

    • property progress_reference

    • ProgressReference. Marks this completionas a write completeion.

    • property result

    • The result of the operation that we were waitedfor. Only valid after calling Orchestrator.process() on thiscompletion.

    • result_str()

    • Force a string.

    • class orchestrator.ProgressReference(message, mgr, completion=None)
      • completion = None
      • The completion can already have a result, before the writeoperation is effective. progress == 1 means, the services arecreated / removed.

      • property progress

      • if a orchestrator module can provide a more detailedprogress information, it needs to also call progress.update().

    Placement

    In general, stateless services do not require any specific placementrules, as they can run anywhere that sufficient system resourcesare available. However, some orchestrators may not include thefunctionality to choose a location in this way, so we can optionallyspecify a location when creating a stateless service.

    OSD services generally require a specific placement choice, as thiswill determine which storage devices are used.

    Error Handling

    The main goal of error handling within orchestrator modules is to provide debug information toassist users when dealing with deployment errors.

    • class orchestrator.OrchestratorError
    • General orchestrator specific error.

    Used for deployment, configuration or user errors.

    It’s not intended for programming errors or orchestrator internal errors.

    • class orchestrator.NoOrchestrator(msg='No orchestrator configured (try ceph orchestrator set backend)')
    • No orchestrator in configured.
    • class orchestrator.OrchestratorValidationError
    • Raised when an orchestrator doesn’t support a specific feature.

    In detail, orchestrators need to explicitly deal with different kinds of errors:

    • No orchestrator configured

    See NoOrchestrator.

    • An orchestrator doesn’t implement a specific method.

    For example, an Orchestrator doesn’t support add_host.

    In this case, a NotImplementedError is raised.

    • Missing features within implemented methods.

    E.g. optional parameters to a command that are not supported by thebackend (e.g. the hosts field in Orchestrator.update_mons() command with the rook backend).

    See OrchestratorValidationError.

    • Input validation errors

    The orchestrator_cli module and other calling modules are supposed toprovide meaningful error messages.

    See OrchestratorValidationError.

    • Errors when actually executing commands

    The resulting Completion should contain an error string that assists in understanding theproblem. In addition, Completion.is_errored() is set to True

    • Invalid configuration in the orchestrator modules

    This can be tackled similar to 5.

    All other errors are unexpected orchestrator issues and thus should raise an exception that are thenlogged into the mgr log file. If there is a completion object at that point,Completion.result() may contain an error message.

    Excluded functionality

    • Ceph’s orchestrator interface is not a general purpose framework formanaging linux servers – it is deliberately constrained to managethe Ceph cluster’s services only.

    • Multipathed storage is not handled (multipathing is unnecessary forCeph clusters). Each drive is assumed to be visible only ona single node.

    Host management

    • Orchestrator.addhost(_host)
    • Add a host to the orchestrator inventory.

      • Parameters
      • host – hostname
    • Orchestrator.removehost(_host)
    • Remove a host from the orchestrator inventory.

      • Parameters
      • host – hostname
    • Orchestrator.get_hosts()
    • Report the hosts in the cluster.

    The default implementation is extra slow.

    • Returns
    • list of InventoryNodes

    Inventory and status

    • Orchestrator.getinventory(_node_filter=None, refresh=False)
    • Returns something that was created by ceph-volume inventory.

      • Returns
      • list of InventoryNode
    • class orchestrator.InventoryFilter(labels=None, nodes=None)
    • When fetching inventory, use this filter to avoid unnecessarilyscanning the whole estate.

      • Typical use: filter by node when presenting UI workflow for configuring
      • a particular server.filter by label when not all of estate is Ceph servers,and we want to only learn about the Ceph servers.filter by label when we are interested particularlyin e.g. OSD servers.
    • class ceph.deployment.inventory.Devices(devices)
    • A container for Device instances with reporting
    • class ceph.deployment.inventory.Device(path, sys_api=None, available=None, rejected_reasons=None, lvs=None, device_id=None)
    • Orchestrator.describeservice(_service_type=None, service_id=None, node_name=None, refresh=False)
    • Describe a service (of any kind) that is already configured inthe orchestrator. For example, when viewing an OSD in the dashboardwe might like to also display information about the orchestrator’sview of the service (like the kubernetes pod ID).

    When viewing a CephFS filesystem in the dashboard, we would use thisto display the pods being currently run for MDS daemons.

    • Returns
    • list of ServiceDescription objects.
    • class orchestrator.ServiceDescription(nodename=None, container_id=None, container_image_id=None, container_image_name=None, service=None, service_instance=None, service_type=None, version=None, rados_config_location=None, service_url=None, status=None, status_desc=None)
    • For responding to queries about the status of a particular service,stateful or stateless.

    This is not about health or performance monitoring of services: it’sabout letting the orchestrator tell Ceph whether and where aservice is scheduled in the cluster. When an orchestrator tellsCeph “it’s running on node123”, that’s not a promise that the processis literally up this second, it’s a description of where the orchestratorhas decided the service should run.

    Service Actions

    • Orchestrator.serviceaction(_action, service_type, service_name=None, service_id=None)
    • Perform an action (start/stop/reload) on a service.

    Either service_name or service_id must be specified:

    • If using service_name, perform the action on that entire logicalservice (i.e. all daemons providing that named service).

    • If using service_id, perform the action on a single specific daemoninstance.

    • Parameters

      • action – one of “start”, “stop”, “restart”, “redeploy”, “reconfig”

      • service_type – e.g. “mds”, “rgw”, …

      • service_name – name of logical service (“cephfs”, “us-east”, …)

      • service_id – service daemon instance (usually a short hostname)

    • Return type

    • Completion

    OSD management

    • Orchestrator.createosds(_drive_group)
    • Create one or more OSDs within a single Drive Group.

    The principal argument here is the drive_group memberof OsdSpec: other fields are advisory/extensible for anyfiner-grained OSD feature enablement (choice of backing store,compression/encryption, etc).

    • Parameters
      • drive_group – DriveGroupSpec

      • all_hosts – TODO, this is required because the orchestrator methods are not composableProbably this parameter can be easily removed because each orchestrator can usethe “get_inventory” method and the “drive_group.host_pattern” attributeto obtain the list of hosts where to apply the operation

    • Orchestrator.removeosds(_osd_ids)
      • Parameters
        • osd_ids – list of OSD IDs

        • destroy – marks the OSD as being destroyed. See OSD Replacement

    Note that this can only remove OSDs that were successfullycreated (i.e. got an OSD ID).

    • class ceph.deployment.drivegroup.DeviceSelection(_paths=None, model=None, size=None, rotational=None, limit=None, vendor=None, all=False)
    • Used within ceph.deployment.drive_group.DriveGroupSpec to specify the devicesused by the Drive Group.

    Any attributes (even none) can be included in the devicespecification structure.

    • all = None
    • Matches all devices. Can only be used for data devices

    • limit = None

    • Limit the number of devices added to this Drive Group. Devicesare used from top to bottom in the output of ceph-volume inventory

    • model = None

    • A wildcard string. e.g: “SDD*” or “SanDisk SD8SN8U5”

    • paths = None

    • List of absolute paths to the devices.

    • rotational = None

    • is the drive rotating or not

    • size = None

    • Size specification of format LOW:HIGH.Can also take the the form :HIGH, LOW:or an exact value (as ceph-volume inventory reports)

    • vendor = None

    • Match on the VENDOR property of the drive
    • class ceph.deployment.drivegroup.DriveGroupSpec(_host_pattern, data_devices=None, db_devices=None, wal_devices=None, journal_devices=None, data_directories=None, osds_per_device=None, objectstore='bluestore', encrypted=False, db_slots=None, wal_slots=None, osd_id_claims=None, block_db_size=None, block_wal_size=None, journal_size=None)
    • Describe a drive group in the same form that ceph-volumeunderstands.

      • blockdb_size = None_
      • Set (or override) the “bluestore_block_db_size” value, in bytes

      • blockwal_size = None_

      • Set (or override) the “bluestore_block_wal_size” value, in bytes

      • datadevices = None_

      • A ceph.deployment.drive_group.DeviceSelection

      • datadirectories = None_

      • A list of strings, containing paths which should back OSDs

      • dbdevices = None_

      • A ceph.deployment.drive_group.DeviceSelection

      • dbslots = None_

      • How many OSDs per DB device

      • encrypted = None

      • true or false

      • hostpattern = None_

      • An fnmatch pattern to select hosts. Can also be a single host.

      • journaldevices = None_

      • A ceph.deployment.drive_group.DeviceSelection

      • journalsize = None_

      • set journal_size is bytes

      • objectstore = None

      • filestore or bluestore

      • osdid_claims = None_

      • Optional: mapping of OSD id to DeviceSelection, used when thecreated OSDs are meant to replace previous OSDs onthe same node. See OSD Replacement

      • osdsper_device = None_

      • Number of osd daemons per “DATA” device.To fully utilize nvme devices multiple osds are required.

      • waldevices = None_

      • A ceph.deployment.drive_group.DeviceSelection

      • walslots = None_

      • How many OSDs per WAL device
    • Orchestrator.blinkdevice_light(_ident_fault, on, locations)
    • Instructs the orchestrator to enable or disable either the ident or the fault LED.

    • class orchestrator.DeviceLightLoc
    • Describes a specific device on a specific host. Used for enabling or disabling LEDson devices.

    hostname as in orchestrator.Orchestrator.get_hosts()

    • device_id: e.g. ABC1234DEF567-1R1234_ABC8DE0Q.
    • See ceph osd metadata | jq '.[].device_ids'

    OSD Replacement

    See Replacing an OSD for the underlying process.

    Replacing OSDs is fundamentally a two-staged process, as users need tophysically replace drives. The orchestrator therefor exposes this two-staged process.

    Phase one is a call to Orchestrator.remove_osds() with destroy=True in order to markthe OSD as destroyed.

    Phase two is a call to Orchestrator.create_osds() with a Drive Group with

    DriveGroupSpec.osd_id_claims set to the destroyed OSD ids.

    Stateless Services

    • class orchestrator.StatelessServiceSpec(name, placement=None)
    • Details of stateless service creation.

    Request to orchestrator for a group of stateless servicessuch as MDS, RGW or iscsi gateway

    • Orchestrator.addmds(_spec)
    • Create a new MDS cluster
    • Orchestrator.removemds(_name)
    • Remove an MDS cluster
    • Orchestrator.updatemds(_spec)
    • Update / redeploy existing MDS clusterLike for example changing the number of service instances.
    • Orchestrator.addrgw(_spec)
    • Create a new MDS zone
    • Orchestrator.removergw(_zone)
    • Remove a RGW zone
    • Orchestrator.updatergw(_spec)
    • Update / redeploy existing RGW zoneLike for example changing the number of service instances.
    • class orchestrator.NFSServiceSpec(name, pool=None, namespace=None, placement=None)
    • Orchestrator.addnfs(_spec)
    • Create a new MDS cluster
    • Orchestrator.removenfs(_name)
    • Remove a NFS cluster
    • Orchestrator.updatenfs(_spec)
    • Update / redeploy existing NFS clusterLike for example changing the number of service instances.

    Upgrades

    • Orchestrator.upgrade_available()
    • Report on what versions are available to upgrade to

      • Returns
      • List of strings
    • Orchestrator.upgradestart(_upgrade_spec)
    • Orchestrator.upgrade_status()
    • If an upgrade is currently underway, report on wherewe are in the process, or if some error has occurred.

      • Returns
      • UpgradeStatusSpec instance
    • class orchestrator.UpgradeSpec
    • class orchestrator.UpgradeStatusSpec

    Utility

    • Orchestrator.available()
    • Report whether we can talk to the orchestrator. This is theplace to give the user a meaningful message if the orchestratorisn’t running or can’t be contacted.

    This method may be called frequently (e.g. every page loadto conditionally display a warning banner), so make sure it’snot too expensive. It’s okay to give a slightly stale status(e.g. based on a periodic background ping of the orchestrator)if that’s necessary to make this method fast.

    Note

    True doesn’t mean that the desired functionalityis actually available in the orchestrator. I.e. thiswon’t work as expected:

    1. >>> if OrchestratorClientMixin().available()[0]: # wrong.
    2. ... OrchestratorClientMixin().get_hosts()
    • Returns
    • two-tuple of boolean, string
    • Orchestrator.get_feature_set()
    • Describes which methods this orchestrator implements

    Note

    True doesn’t mean that the desired functionalityis actually possible in the orchestrator. I.e. thiswon’t work as expected:

    1. >>> api = OrchestratorClientMixin()
    2. ... if api.get_feature_set()['get_hosts']['available']: # wrong.
    3. ... api.get_hosts()

    It’s better to ask for forgiveness instead:

    1. >>> try:
    2. ... OrchestratorClientMixin().get_hosts()
    3. ... except (OrchestratorError, NotImplementedError):
    4. ... ...
    • Returns
    • Dict of API method names to {'available': True or False}

    Client Modules

    • class orchestrator.OrchestratorClientMixin
    • A module that inherents from OrchestratorClientMixin can directly callall Orchestrator methods without manually calling remote.

    Every interface method from Orchestrator is converted into a stub method that internallycalls OrchestratorClientMixin._oremote()

    1. >>> class MyModule(OrchestratorClientMixin):
    2. ... def func(self):
    3. ... completion = self.add_host('somehost') # calls `_oremote()`
    4. ... self._orchestrator_wait([completion])
    5. ... self.log.debug(completion.result)
    • setmgr(_mgr)
    • Useable in the Dashbord that uses a global mgr