Ceph Cluster CRD

Rook allows creation and customization of storage clusters through the custom resource definitions (CRDs).

Sample

To get you started, here is a simple example of a CRD to configure a Ceph cluster with all nodes and all devices. More examples are included later in this doc.

NOTE In addition to your CephCluster object, you need to create RBAC rules for the namespace you are going to create the CephCluster in, see Common Cluster Resources section below.

  1. apiVersion: ceph.rook.io/v1
  2. kind: CephCluster
  3. metadata:
  4. name: rook-ceph
  5. namespace: rook-ceph
  6. spec:
  7. cephVersion:
  8. # see the "Cluster Settings" section below for more details on which image of ceph to run
  9. image: ceph/ceph:v13.2.4-20190109
  10. dataDirHostPath: /var/lib/rook
  11. storage:
  12. useAllNodes: true
  13. useAllDevices: true

In addition to the CRD, you will also need to create a namespace, role, and role binding as seen in the common cluster resources below.

Settings

Settings can be specified at the global level to apply to the cluster as a whole, while other settings can be specified at more fine-grained levels. If any setting is unspecified, a suitable default will be used automatically.

Cluster metadata

  • name: The name that will be used internally for the Ceph cluster. Most commonly the name is the same as the namespace since multiple clusters are not supported in the same namespace.
  • namespace: The Kubernetes namespace that will be created for the Rook cluster. The services, pods, and other resources created by the operator will be added to this namespace. The common scenario is to create a single Rook cluster. If multiple clusters are created, they must not have conflicting devices or host paths.

Cluster Settings

  • cephVersion: The version information for launching the ceph daemons.
    • image: The image used for running the ceph daemons. For example, ceph/ceph:v12.2.9-20181026 or ceph/ceph:v13.2.4-20190109. For the latest ceph images, see the Ceph DockerHub. To ensure a consistent version of the image is running across all nodes in the cluster, it is recommended to use a very specific image version. Tags also exist that would give the latest version, but they are only recommended for test environments. For example, the tag v13 will be updated each time a new mimic build is released. Using the v13 or similar tag is not recommended in production because it may lead to inconsistent versions of the image running across different nodes in the cluster.
    • allowUnsupported: If true, allow an unsupported major version of the Ceph release. Currently only luminous and mimic are supported, so nautilus would require this to be set to true. Should be set to false in production.
  • dataDirHostPath: The path on the host (hostPath) where config and data should be stored for each of the services. If the directory does not exist, it will be created. Because this directory persists on the host, it will remain after pods are deleted.
    • On Minikube environments, use /data/rook. Minikube boots into a tmpfs but it provides some directories where files can be persisted across reboots. Using one of these directories will ensure that Rook’s data and configuration files are persisted and that enough storage space is available.
    • WARNING: For test scenarios, if you delete a cluster and start a new cluster on the same hosts, the path used by dataDirHostPath must be deleted. Otherwise, stale keys and other config will remain from the previous cluster and the new mons will fail to start. If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes empty dir docs.
  • dashboard: Settings for the Ceph dashboard. To view the dashboard in your browser see the dashboard guide.
    • enabled: Whether to enable the dashboard to view cluster status
    • urlPrefix: Allows to serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
  • network: The network settings for the cluster
    • hostNetwork: uses network of the hosts instead of using the SDN below the containers.
  • mon: contains mon related options mon settings For more details on the mons and when to choose a number other than 3, see the mon health design doc.
  • rbdMirroring: The settings for rbd mirror daemon(s). Configuring which pools or images to be mirrored must be completed in the rook toolbox by running the rbd mirror command.
    • workers: The number of rbd daemons to perform the rbd mirroring between clusters.
  • placement: placement configuration settings
  • resources: resources configuration settings
  • storage: Storage selection and configuration that will be used across the cluster. Note that these settings can be overridden for specific nodes.
    • useAllNodes: true or false, indicating if all nodes in the cluster should be used for storage according to the cluster level storage selection and configuration values. If individual nodes are specified under the nodes field below, then useAllNodes must be set to false.
    • nodes: Names of individual nodes in the cluster that should have their storage included in accordance with either the cluster level configuration specified above or any node specific overrides described in the next section below. useAllNodes must be set to false to use specific nodes and their config.
    • config: Config settings applied to all OSDs on the node unless overridden by devices or directories. See the config settings below.
    • storage selection settings

Node Updates

Nodes can be added and removed over time by updating the Cluster CRD, for example with kubectl -n rook-ceph edit cephcluster rook-ceph. This will bring up your default text editor and allow you to add and remove storage nodes from the cluster. This feature is only available when useAllNodes has been set to false.

Mon Settings

  • count: set the number of mons to be started. The number should be odd and between 1 and 9. If not specified the default is set to 3 and allowMultiplePerNode is also set to true.
  • allowMultiplePerNode: enable (true) or disable (false) the placement of multiple mons on one node. Default is false.

If these settings are changed in the CRD the operator will update the number of mons during a periodic check of the mon health, which by default is every 45 seconds.

To change the defaults that the operator uses to determine the mon health and whether to failover a mon, the following environment variables can be changed in operator.yaml. The intervals should be small enough that you have confidence the mons will maintain quorum, while also being log enough to ignore network blips where mons are failed over too often.

  • ROOK_MON_HEALTHCHECK_INTERVAL: The frequency with which to check if mons are in quorum (default is 45 seconds)
  • ROOK_MON_OUT_TIMEOUT: The interval to wait before marking a mon as “out” and starting a new mon to replace it in the quroum (default is 5 minutes)

Node Settings

In addition to the cluster level settings specified above, each individual node can also specify configuration to override the cluster level settings and defaults. If a node does not specify any configuration then it will inherit the cluster level settings.

  • name: The name of the node, which should match its kubernetes.io/hostname label.
  • config: Config settings applied to all OSDs on the node unless overridden by devices or directories. See the config settings below.
  • storage selection settings

Storage Selection Settings

Below are the settings available, both at the cluster and individual node level, for selecting which storage resources will be included in the cluster.

  • useAllDevices: true or false, indicating whether all devices found on nodes in the cluster should be automatically consumed by OSDs. Not recommended unless you have a very controlled environment where you will not risk formatting of devices with existing data. When true, all devices will be used except those with partitions created or a local filesystem. Is overridden by deviceFilter if specified.
  • deviceFilter: A regular expression that allows selection of devices to be consumed by OSDs. If individual devices have been specified for a node then this filter will be ignored. This field uses golang regular expression syntax. For example:
    • sdb: Only selects the sdb device if found
    • ^sd.: Selects all devices starting with sd
    • ^sd[a-d]: Selects devices starting with sda, sdb, sdc, and sdd if found
    • ^s: Selects all devices that start with s
    • ^[^r]: Selects all devices that do not start with r
  • devices: A list of individual device names belonging to this node to include in the storage cluster.
    • name: The name of the device (e.g., sda).
    • config: Device-specific config settings. See the config settings below.
  • directories: A list of directory paths that will be included in the storage cluster. Note that using two directories on the same physical device can cause a negative performance impact.
    • path: The path on disk of the directory (e.g., /rook/storage-dir).
    • config: Directory-specific config settings. See the config settings below.
  • location: Location information about the cluster to help with data placement, such as region or data center. This is directly fed into the underlying Ceph CRUSH map. More information on CRUSH maps can be found in the ceph docs.

OSD Configuration Settings

The following storage selection settings are specific to Ceph and do not apply to other backends. All variables are key-value pairs represented as strings.

  • metadataDevice: Name of a device to use for the metadata of OSDs on each node. Performance can be improved by using a low latency device (such as SSD or NVMe) as the metadata device, while other spinning platter (HDD) devices on a node are used to store data.
  • storeType: filestore or bluestore, the underlying storage format to use for each OSD. The default is set dynamically to bluestore for devices, while filestore is the default for directories. Set this store type explicitly to override the default. Warning: Bluestore is not recommended for directories in production. Bluestore does not purge data from the directory and over time will grow without the ability to compact or shrink.
  • databaseSizeMB: The size in MB of a bluestore database. Include quotes around the size.
  • walSizeMB: The size in MB of a bluestore write ahead log (WAL). Include quotes around the size.
  • journalSizeMB: The size in MB of a filestore journal. Include quotes around the size.
  • osdsPerDevice**: The number of OSDs to create on each device. High performance devices such as NVMe can handle running multiple OSDs. If desired, this can be overridden for each node and each device.

** NOTE: Depending on the Ceph image running in your cluster, OSDs will be configured differently. Newer images will configure OSDs with ceph-volume, which provides support for osdsPerDevice as well as other features that will be exposed in future Rook releases. OSDs created prior to Rook v0.9 or with older images of Luminous and Mimic are not created with ceph-volume and thus would not support the same features. For ceph-volume, the following images are supported:

  • Luminous 12.2.10 or newer
  • Mimic 13.2.3 or newer
  • Nautilus

Placement Configuration Settings

Placement configuration for the cluster services. It includes the following keys: mgr, mon, osd and all. Each service will have its placement configuration generated by merging the generic configuration under all with the most specific one (which will override any attributes).

A Placement configuration is specified (according to the kubernetes PodSpec) as:

The mon pod does not allow Pod affinity or anti-affinity. This is because of the mons having built-in anti-affinity with each other through the operator. The operator chooses which nodes are to run a mon on. Each mon is then tied to a node with a node selector using a hostname. See the mon design doc for more details on the mon failover design.

Cluster-wide Resources Configuration Settings

Resources should be specified so that the rook components are handled after Kubernetes Pod Quality of Service classes. This allows to keep rook components running when for example a node runs out of memory and the rook components are not killed depending on their Quality of Service class.

You can set resource requests/limits for rook components through the Resource Requirements/Limits structure in the following keys:

  • mgr: Set resource requests/limits for MGRs.
  • mon: Set resource requests/limits for Mons.
  • osd: Set resource requests/limits for OSDs.

Resource Requirements/Limits

For more information on resource requests/limits see the official Kubernetes documentation: Kubernetes - Managing Compute Resources for Containers

  • requests: Requests for cpu or memory.
    • cpu: Request for CPU (example: one CPU core 1, 50% of one CPU core 500m).
    • memory: Limit for Memory (example: one gigabyte of memory 1Gi, half a gigabyte of memory 512Mi).
  • limits: Limits for cpu or memory.
    • cpu: Limit for CPU (example: one CPU core 1, 50% of one CPU core 500m).
    • memory: Limit for Memory (example: one gigabyte of memory 1Gi, half a gigabyte of memory 512Mi).

Samples

Here are several samples for configuring Ceph clusters. Each of the samples must also include the namespace and corresponding access granted for management by the Ceph operator. See the common cluster resources below.

Storage configuration: All devices

  1. apiVersion: ceph.rook.io/v1
  2. kind: CephCluster
  3. metadata:
  4. name: rook-ceph
  5. namespace: rook-ceph
  6. spec:
  7. cephVersion:
  8. image: ceph/ceph:v13.2.4-20190109
  9. dataDirHostPath: /var/lib/rook
  10. network:
  11. hostNetwork: false
  12. dashboard:
  13. enabled: true
  14. # cluster level storage configuration and selection
  15. storage:
  16. useAllNodes: true
  17. useAllDevices: true
  18. deviceFilter:
  19. location:
  20. config:
  21. metadataDevice:
  22. databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
  23. journalSizeMB: "1024" # this value can be removed for environments with normal sized disks (20 GB or larger)
  24. osdsPerDevice: "1"

Storage Configuration: Specific devices

Individual nodes and their config can be specified so that only the named nodes below will be used as storage resources. Each node’s ‘name’ field should match their ‘kubernetes.io/hostname’ label.

  1. apiVersion: ceph.rook.io/v1
  2. kind: CephCluster
  3. metadata:
  4. name: rook-ceph
  5. namespace: rook-ceph
  6. spec:
  7. cephVersion:
  8. image: ceph/ceph:v13.2.4-20190109
  9. dataDirHostPath: /var/lib/rook
  10. network:
  11. hostNetwork: false
  12. dashboard:
  13. enabled: true
  14. # cluster level storage configuration and selection
  15. storage:
  16. useAllNodes: false
  17. useAllDevices: false
  18. deviceFilter:
  19. location:
  20. config:
  21. metadataDevice:
  22. databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
  23. journalSizeMB: "1024" # this value can be removed for environments with normal sized disks (20 GB or larger)
  24. nodes:
  25. - name: "172.17.4.101"
  26. directories: # specific directories to use for storage can be specified for each node
  27. - path: "/rook/storage-dir"
  28. - name: "172.17.4.201"
  29. devices: # specific devices to use for storage can be specified for each node
  30. - name: "sdb"
  31. - name: "sdc"
  32. config: # configuration can be specified at the node level which overrides the cluster level config
  33. storeType: bluestore
  34. - name: "172.17.4.301"
  35. deviceFilter: "^sd."

Storage Configuration: Cluster wide Directories

This example is based up on the Storage Configuration: Specific devices. Individual nodes can override the cluster wide specified directories list.

  1. apiVersion: ceph.rook.io/v1
  2. kind: CephCluster
  3. metadata:
  4. name: rook-ceph
  5. namespace: rook-ceph
  6. spec:
  7. cephVersion:
  8. image: ceph/ceph:v13.2.4-20190109
  9. dataDirHostPath: /var/lib/rook
  10. network:
  11. hostNetwork: false
  12. dashboard:
  13. enabled: true
  14. # cluster level storage configuration and selection
  15. storage:
  16. useAllNodes: false
  17. useAllDevices: false
  18. config:
  19. databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
  20. journalSizeMB: "1024" # this value can be removed for environments with normal sized disks (20 GB or larger)
  21. directories:
  22. - path: "/rook/storage-dir"
  23. nodes:
  24. - name: "172.17.4.101"
  25. directories: # specific directories to use for storage can be specified for each node
  26. # overrides the above `directories` values for this node
  27. - path: "/rook/my-node-storage-dir"
  28. - name: "172.17.4.201"

Node Affinity

To control where various services will be scheduled by kubernetes, use the placement configuration sections below. The example under ‘all’ would have all services scheduled on kubernetes nodes labeled with ‘role=storage’ and tolerate taints with a key of ‘storage-node’.

  1. apiVersion: ceph.rook.io/v1
  2. kind: CephCluster
  3. metadata:
  4. name: rook-ceph
  5. namespace: rook-ceph
  6. spec:
  7. cephVersion:
  8. image: ceph/ceph:v13.2.4-20190109
  9. dataDirHostPath: /var/lib/rook
  10. network:
  11. hostNetwork: false
  12. dashboard:
  13. enabled: true
  14. placement:
  15. all:
  16. nodeAffinity:
  17. requiredDuringSchedulingIgnoredDuringExecution:
  18. nodeSelectorTerms:
  19. - matchExpressions:
  20. - key: role
  21. operator: In
  22. values:
  23. - storage-node
  24. tolerations:
  25. - key: storage-node
  26. operator: Exists
  27. mgr:
  28. nodeAffinity:
  29. tolerations:
  30. mon:
  31. nodeAffinity:
  32. tolerations:
  33. osd:
  34. nodeAffinity:
  35. tolerations:

Resource requests/Limits

To control how many resources the rook components can request/use, you can set requests and limits in Kubernetes for them. You can override these requests/limits for OSDs per node when using useAllNodes: false in the node item in the nodes list.

WARNING Before setting resource requests/limits, please take a look at the Ceph documentation for recommendations for each component: Ceph - Hardware Recommendations.

  1. apiVersion: ceph.rook.io/v1
  2. kind: CephCluster
  3. metadata:
  4. name: rook-ceph
  5. namespace: rook-ceph
  6. spec:
  7. cephVersion:
  8. image: ceph/ceph:v13.2.4-20190109
  9. dataDirHostPath: /var/lib/rook
  10. # cluster level resource requests/limits configuration
  11. resources:
  12. storage:
  13. useAllNodes: false
  14. nodes:
  15. - name: "172.17.4.201"
  16. resources:
  17. limits:
  18. cpu: "2"
  19. memory: "4096Mi"
  20. requests:
  21. cpu: "2"
  22. memory: "4096Mi"

Common Cluster Resources

Each Ceph cluster must be created in a namespace and also give access to the Rook operator to manage the cluster in the namespace. Creating the namespace and these controls must be added to each of the examples previously shown.

  1. apiVersion: v1
  2. kind: Namespace
  3. metadata:
  4. name: rook-ceph
  5. ---
  6. apiVersion: v1
  7. kind: ServiceAccount
  8. metadata:
  9. name: rook-ceph-osd
  10. namespace: rook-ceph
  11. ---
  12. apiVersion: v1
  13. kind: ServiceAccount
  14. metadata:
  15. name: rook-ceph-mgr
  16. namespace: rook-ceph
  17. ---
  18. kind: Role
  19. apiVersion: rbac.authorization.k8s.io/v1beta1
  20. metadata:
  21. name: rook-ceph-osd
  22. namespace: rook-ceph
  23. rules:
  24. - apiGroups: [""]
  25. resources: ["configmaps"]
  26. verbs: [ "get", "list", "watch", "create", "update", "delete" ]
  27. ---
  28. # Aspects of ceph-mgr that require access to the system namespace
  29. kind: Role
  30. apiVersion: rbac.authorization.k8s.io/v1beta1
  31. metadata:
  32. name: rook-ceph-mgr-system
  33. namespace: rook-ceph
  34. rules:
  35. - apiGroups:
  36. - ""
  37. resources:
  38. - configmaps
  39. verbs:
  40. - get
  41. - list
  42. - watch
  43. ---
  44. # Aspects of ceph-mgr that operate within the cluster's namespace
  45. kind: Role
  46. apiVersion: rbac.authorization.k8s.io/v1beta1
  47. metadata:
  48. name: rook-ceph-mgr
  49. namespace: rook-ceph
  50. rules:
  51. - apiGroups:
  52. - ""
  53. resources:
  54. - pods
  55. - services
  56. verbs:
  57. - get
  58. - list
  59. - watch
  60. - apiGroups:
  61. - batch
  62. resources:
  63. - jobs
  64. verbs:
  65. - get
  66. - list
  67. - watch
  68. - create
  69. - update
  70. - delete
  71. - apiGroups:
  72. - ceph.rook.io
  73. resources:
  74. - "*"
  75. verbs:
  76. - "*"
  77. ---
  78. # Allow the operator to create resources in this cluster's namespace
  79. kind: RoleBinding
  80. apiVersion: rbac.authorization.k8s.io/v1beta1
  81. metadata:
  82. name: rook-ceph-cluster-mgmt
  83. namespace: rook-ceph
  84. roleRef:
  85. apiGroup: rbac.authorization.k8s.io
  86. kind: ClusterRole
  87. name: rook-ceph-cluster-mgmt
  88. subjects:
  89. - kind: ServiceAccount
  90. name: rook-ceph-system
  91. namespace: rook-ceph-system
  92. ---
  93. # Allow the osd pods in this namespace to work with configmaps
  94. kind: RoleBinding
  95. apiVersion: rbac.authorization.k8s.io/v1beta1
  96. metadata:
  97. name: rook-ceph-osd
  98. namespace: rook-ceph
  99. roleRef:
  100. apiGroup: rbac.authorization.k8s.io
  101. kind: Role
  102. name: rook-ceph-osd
  103. subjects:
  104. - kind: ServiceAccount
  105. name: rook-ceph-osd
  106. namespace: rook-ceph
  107. ---
  108. # Allow the ceph mgr to access the cluster-specific resources necessary for the mgr modules
  109. kind: RoleBinding
  110. apiVersion: rbac.authorization.k8s.io/v1beta1
  111. metadata:
  112. name: rook-ceph-mgr
  113. namespace: rook-ceph
  114. roleRef:
  115. apiGroup: rbac.authorization.k8s.io
  116. kind: Role
  117. name: rook-ceph-mgr
  118. subjects:
  119. - kind: ServiceAccount
  120. name: rook-ceph-mgr
  121. namespace: rook-ceph
  122. ---
  123. # Allow the ceph mgr to access the rook system resources necessary for the mgr modules
  124. kind: RoleBinding
  125. apiVersion: rbac.authorization.k8s.io/v1beta1
  126. metadata:
  127. name: rook-ceph-mgr-system
  128. namespace: rook-ceph-system
  129. roleRef:
  130. apiGroup: rbac.authorization.k8s.io
  131. kind: Role
  132. name: rook-ceph-mgr-system
  133. subjects:
  134. - kind: ServiceAccount
  135. name: rook-ceph-mgr
  136. namespace: rook-ceph
  137. ---
  138. # Allow the ceph mgr to access cluster-wide resources necessary for the mgr modules
  139. kind: RoleBinding
  140. apiVersion: rbac.authorization.k8s.io/v1beta1
  141. metadata:
  142. name: rook-ceph-mgr-cluster
  143. namespace: rook-ceph
  144. roleRef:
  145. apiGroup: rbac.authorization.k8s.io
  146. kind: ClusterRole
  147. name: rook-ceph-mgr-cluster
  148. subjects:
  149. - kind: ServiceAccount
  150. name: rook-ceph-mgr
  151. namespace: rook-ceph