Hardware Recommendations

Ceph was designed to run on commodity hardware, which makes building andmaintaining petabyte-scale data clusters economically feasible.When planning out your cluster hardware, you will need to balance a numberof considerations, including failure domains and potential performanceissues. Hardware planning should include distributing Ceph daemons andother processes that use Ceph across many hosts. Generally, we recommendrunning Ceph daemons of a specific type on a host configured for that typeof daemon. We recommend using other hosts for processes that utilize yourdata cluster (e.g., OpenStack, CloudStack, etc).

Tip

Check out the Ceph blog too.

CPU

Ceph metadata servers dynamically redistribute their load, which is CPUintensive. So your metadata servers should have significant processing power(e.g., quad core or better CPUs). Ceph OSDs run the RADOS service, calculatedata placement with CRUSH, replicate data, and maintain their own copy of thecluster map. Therefore, OSDs should have a reasonable amount of processing power(e.g., dual core processors). Monitors simply maintain a master copy of thecluster map, so they are not CPU intensive. You must also consider whether thehost machine will run CPU-intensive processes in addition to Ceph daemons. Forexample, if your hosts will run computing VMs (e.g., OpenStack Nova), you willneed to ensure that these other processes leave sufficient processing power forCeph daemons. We recommend running additional CPU-intensive processes onseparate hosts.

RAM

Generally, more RAM is better.

Monitors and managers (ceph-mon and ceph-mgr)

Monitor and manager daemon memory usage generally scales with the size of thecluster. For small clusters, 1-2 GB is generally sufficient. Forlarge clusters, you should provide more (5-10 GB). You may also wantto consider tuning settings like mon_osd_cache_size orrocksdb_cache_size.

Metadata servers (ceph-mds)

The metadata daemon memory utilization depends on how much memory its cache isconfigured to consume. We recommend 1 GB as a minimum for most systems. Seemds_cache_memory.

OSDs (ceph-osd)

By default, OSDs that use the BlueStore backend require 3-5 GB of RAM. You canadjust the amount of memory the OSD consumes with the osd_memory_target configuration option when BlueStore is in use. When using the legacy FileStore backend, the operating system page cache is used for caching data, so no tuning is normally needed, and the OSD memory consumption is generally related to the number of PGs per daemon in the system.

Data Storage

Plan your data storage configuration carefully. There are significant cost andperformance tradeoffs to consider when planning for data storage. SimultaneousOS operations, and simultaneous request for read and write operations frommultiple daemons against a single drive can slow performance considerably.

Important

Since Ceph has to write all data to the journal before it cansend an ACK (for XFS at least), having the journal and OSDperformance in balance is really important!

Hard Disk Drives

OSDs should have plenty of hard disk drive space for object data. We recommend aminimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyteadvantage of larger disks. We recommend dividing the price of the hard diskdrive by the number of gigabytes to arrive at a cost per gigabyte, becauselarger drives may have a significant impact on the cost-per-gigabyte. Forexample, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 pergigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk pricedat $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In theforegoing example, using the 1 terabyte disks would generally increase the costper gigabyte by 40%–rendering your cluster substantially less cost efficient.Also, the larger the storage drive capacity, the more memory per Ceph OSD Daemonyou will need, especially during rebalancing, backfilling and recovery. Ageneral rule of thumb is ~1GB of RAM for 1TB of storage space.

Tip

Running multiple OSDs on a single disk–irrespective of partitions–isNOT a good idea.

Tip

Running an OSD and a monitor or a metadata server on a singledisk–irrespective of partitions–is NOT a good idea either.

Storage drives are subject to limitations on seek time, access time, read andwrite times, as well as total throughput. These physical limitations affectoverall system performance–especially during recovery. We recommend using adedicated drive for the operating system and software, and one drive for eachCeph OSD Daemon you run on the host. Most “slow OSD” issues arise due to runningan operating system, multiple OSDs, and/or multiple journals on the same drive.Since the cost of troubleshooting performance issues on a small cluster likelyexceeds the cost of the extra disk drives, you can optimize your clusterdesign planning by avoiding the temptation to overtax the OSD storage drives.

You may run multiple Ceph OSD Daemons per hard disk drive, but this will likelylead to resource contention and diminish the overall throughput. You may store ajournal and object data on the same drive, but this may increase the time ittakes to journal a write and ACK to the client. Ceph must write to the journalbefore it can ACK the write.

Ceph best practices dictate that you should run operating systems, OSD data andOSD journals on separate drives.

Solid State Drives

One opportunity for performance improvement is to use solid-state drives (SSDs)to reduce random access time and read latency while accelerating throughput.SSDs often cost more than 10x as much per gigabyte when compared to a hard diskdrive, but SSDs often exhibit access times that are at least 100x faster than ahard disk drive.

SSDs do not have moving mechanical parts so they are not necessarily subject tothe same types of limitations as hard disk drives. SSDs do have significantlimitations though. When evaluating SSDs, it is important to consider theperformance of sequential reads and writes. An SSD that has 400MB/s sequentialwrite throughput may have much better performance than an SSD with 120MB/s ofsequential write throughput when storing multiple journals for multiple OSDs.

Important

We recommend exploring the use of SSDs to improve performance.However, before making a significant investment in SSDs, we stronglyrecommend both reviewing the performance metrics of an SSD and testing theSSD in a test configuration to gauge performance.

Since SSDs have no moving mechanical parts, it makes sense to use them in theareas of Ceph that do not use a lot of storage space (e.g., journals).Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.Acceptable IOPS are not enough when selecting an SSD for use with Ceph. Thereare a few important performance considerations for journals and SSDs:

  • Write-intensive semantics: Journaling involves write-intensive semantics,so you should ensure that the SSD you choose to deploy will perform equal toor better than a hard disk drive when writing data. Inexpensive SSDs mayintroduce write latency even as they accelerate access time, becausesometimes high performance hard drives can write as fast or faster thansome of the more economical SSDs available on the market!

  • Sequential Writes: When you store multiple journals on an SSD you mustconsider the sequential write limitations of the SSD too, since they may behandling requests to write to multiple OSD journals simultaneously.

  • Partition Alignment: A common problem with SSD performance is thatpeople like to partition drives as a best practice, but they often overlookproper partition alignment with SSDs, which can cause SSDs to transfer datamuch more slowly. Ensure that SSD partitions are properly aligned.

While SSDs are cost prohibitive for object storage, OSDs may see a significantperformance improvement by storing an OSD’s journal on an SSD and the OSD’sobject data on a separate hard disk drive. The osd journal configurationsetting defaults to /var/lib/ceph/osd/$cluster-$id/journal. You can mountthis path to an SSD or to an SSD partition so that it is not merely a file onthe same disk as the object data.

One way Ceph accelerates CephFS file system performance is to segregate thestorage of CephFS metadata from the storage of the CephFS file contents. Cephprovides a default metadata pool for CephFS metadata. You will never have tocreate a pool for CephFS metadata, but you can create a CRUSH map hierarchy foryour CephFS metadata pool that points only to a host’s SSD storage media. SeeMapping Pools to Different Types of OSDs for details.

Controllers

Disk controllers also have a significant impact on write throughput. Carefully,consider your selection of disk controllers to ensure that they do not createa performance bottleneck.

Tip

The Ceph blog is often an excellent source of information on Cephperformance issues. See Ceph Write Throughput 1 and Ceph WriteThroughput 2 for additional details.

Additional Considerations

You may run multiple OSDs per host, but you should ensure that the sum of thetotal throughput of your OSD hard disks doesn’t exceed the network bandwidthrequired to service a client’s need to read or write data. You should alsoconsider what percentage of the overall data the cluster stores on each host. Ifthe percentage on a particular host is large and the host fails, it can lead toproblems such as exceeding the full ratio, which causes Ceph to haltoperations as a safety precaution that prevents data loss.

When you run multiple OSDs per host, you also need to ensure that the kernelis up to date. See OS Recommendations for notes on glibc andsyncfs(2) to ensure that your hardware performs as expected when runningmultiple OSDs per host.

Networks

We recommend that each host has at least two 1Gbps network interfacecontrollers (NICs). Since most commodity hard disk drives have a throughput ofapproximately 100MB/second, your NICs should be able to handle the traffic forthe OSD disks on your host. We recommend a minimum of two NICs to account for apublic (front-side) network and a cluster (back-side) network. A cluster network(preferably not connected to the internet) handles the additional load for datareplication and helps stop denial of service attacks that prevent the clusterfrom achieving active + clean states for placement groups as OSDs replicatedata across the cluster. Consider starting with a 10Gbps network in your racks.Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (atypical drive configuration) takes 9 hours. By contrast, with a 10Gbps network,the replication times would be 20 minutes and 1 hour respectively. In apetabyte-scale cluster, failure of an OSD disk should be an expectation, not anexception. System administrators will appreciate PGs recovering from adegraded state to an active + clean state as rapidly as possible, withprice / performance tradeoffs taken into consideration. Additionally, somedeployment tools (e.g., Dell’s Crowbar) deploy with five different networks,but employ VLANs to make hardware and network cabling more manageable. VLANsusing 802.1q protocol require VLAN-capable NICs and Switches. The added hardwareexpense may be offset by the operational cost savings for network setup andmaintenance. When using VLANs to handle VM traffic between the clusterand compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worthconsidering using 10G Ethernet. Top-of-rack routers for each network also needto be able to communicate with spine routers that have even fasterthroughput–e.g., 40Gbps to 100Gbps.

Your server hardware should have a Baseboard Management Controller (BMC).Administration and deployment tools may also use BMCs extensively, so considerthe cost/benefit tradeoff of an out-of-band network for administration.Hypervisor SSH access, VM image uploads, OS image installs, management sockets,etc. can impose significant loads on a network. Running three networks may seemlike overkill, but each traffic path represents a potential capacity, throughputand/or performance bottleneck that you should carefully consider beforedeploying a large scale data cluster.

Failure Domains

A failure domain is any failure that prevents access to one or more OSDs. Thatcould be a stopped daemon on a host; a hard disk failure, an OS crash, amalfunctioning NIC, a failed power supply, a network outage, a power outage, andso forth. When planning out your hardware needs, you must balance thetemptation to reduce costs by placing too many responsibilities into too fewfailure domains, and the added costs of isolating every potential failuredomain.

Minimum Hardware Recommendations

Ceph can run on inexpensive commodity hardware. Small production clustersand development clusters can run successfully with modest hardware.

ProcessCriteriaMinimum Recommended
ceph-osdProcessor-1x 64-bit AMD-64-1x 32-bit ARM dual-core or better
RAM~1GB for 1TB of storage per daemon
Volume Storage1x storage drive per daemon
Journal1x SSD partition per daemon (optional)
Network2x 1GB Ethernet NICs
ceph-monProcessor-1x 64-bit AMD-64-1x 32-bit ARM dual-core or better
RAM1 GB per daemon
Disk Space10 GB per daemon
Network2x 1GB Ethernet NICs
ceph-mdsProcessor-1x 64-bit AMD-64 quad-core-1x 32-bit ARM quad-core
RAM1 GB minimum per daemon
Disk Space1 MB per daemon
Network2x 1GB Ethernet NICs

Tip

If you are running an OSD with a single disk, create apartition for your volume storage that is separate from the partitioncontaining the OS. Generally, we recommend separate disks for theOS and the volume storage.

Production Cluster Examples

Production clusters for petabyte scale data storage may also use commodityhardware, but should have considerably more memory, processing power and datastorage to account for heavy traffic loads.

Dell Example

A recent (2012) Ceph cluster project is using two fairly robust hardwareconfigurations for Ceph OSDs, and a lighter configuration for monitors.

ConfigurationCriteriaMinimum Recommended
Dell PE R510Processor2x 64-bit quad-core Xeon CPUs
RAM16 GB
Volume Storage8x 2TB drives. 1 OS, 7 Storage
Client Network2x 1GB Ethernet NICs
OSD Network2x 1GB Ethernet NICs
Mgmt. Network2x 1GB Ethernet NICs
Dell PE R515Processor1x hex-core Opteron CPU
RAM16 GB
Volume Storage12x 3TB drives. Storage
OS Storage1x 500GB drive. Operating System.
Client Network2x 1GB Ethernet NICs
OSD Network2x 1GB Ethernet NICs
Mgmt. Network2x 1GB Ethernet NICs