Architecture

Ceph uniquely delivers object, block, and file storage in oneunified system. Ceph is highly reliable, easy to manage, and free. The power ofCeph can transform your company’s IT infrastructure and your ability to managevast amounts of data. Ceph delivers extraordinary scalability–thousands ofclients accessing petabytes to exabytes of data. A Ceph Node leveragescommodity hardware and intelligent daemons, and a Ceph Storage Clusteraccommodates large numbers of nodes, which communicate with each other toreplicate and redistribute data dynamically.../_images/stack.png

The Ceph Storage Cluster

Ceph provides an infinitely scalable Ceph Storage Cluster based uponRADOS, which you can readabout in RADOS - A Scalable, Reliable Storage Service for Petabyte-scaleStorage Clusters.

A Ceph Storage Cluster consists of two types of daemons:

Architecture - 图2

A Ceph Monitor maintains a master copy of the cluster map. A cluster of Cephmonitors ensures high availability should a monitor daemon fail. Storage clusterclients retrieve a copy of the cluster map from the Ceph Monitor.

A Ceph OSD Daemon checks its own state and the state of other OSDs and reportsback to monitors.

Storage cluster clients and each Ceph OSD Daemon use the CRUSH algorithmto efficiently compute information about data location, instead of having todepend on a central lookup table. Ceph’s high-level features include providing anative interface to the Ceph Storage Cluster via librados, and a number ofservice interfaces built on top of librados.

Storing Data

The Ceph Storage Cluster receives data from Ceph Clients–whether itcomes through a Ceph Block Device, Ceph Object Storage, theCeph File System or a custom implementation you create usinglibrados–and it stores the data as objects. Each object corresponds to afile in a filesystem, which is stored on an Object Storage Device. CephOSD Daemons handle the read/write operations on the storage disks.

Architecture - 图3

Ceph OSD Daemons store all data as objects in a flat namespace (e.g., nohierarchy of directories). An object has an identifier, binary data, andmetadata consisting of a set of name/value pairs. The semantics are completelyup to Ceph Clients. For example, CephFS uses metadata to store fileattributes such as the file owner, created date, last modified date, and soforth.

Architecture - 图4

Note

An object ID is unique across the entire cluster, not just the localfilesystem.

Scalability and High Availability

In traditional architectures, clients talk to a centralized component (e.g., agateway, broker, API, facade, etc.), which acts as a single point of entry to acomplex subsystem. This imposes a limit to both performance and scalability,while introducing a single point of failure (i.e., if the centralized componentgoes down, the whole system goes down, too).

Ceph eliminates the centralized gateway to enable clients to interact withCeph OSD Daemons directly. Ceph OSD Daemons create object replicas on otherCeph Nodes to ensure data safety and high availability. Ceph also uses a clusterof monitors to ensure high availability. To eliminate centralization, Cephuses an algorithm called CRUSH.

CRUSH Introduction

Ceph Clients and Ceph OSD Daemons both use the CRUSH algorithm to efficiently computeinformation about object location, instead of having to depend on acentral lookup table. CRUSH provides a better data management mechanism comparedto older approaches, and enables massive scale by cleanly distributing the workto all the clients and OSD daemons in the cluster. CRUSH uses intelligent datareplication to ensure resiliency, which is better suited to hyper-scale storage.The following sections provide additional details on how CRUSH works. For adetailed discussion of CRUSH, see CRUSH - Controlled, Scalable, DecentralizedPlacement of Replicated Data.

Cluster Map

Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of thecluster topology, which is inclusive of 5 maps collectively referred to as the“Cluster Map”:

  • The Monitor Map: Contains the cluster fsid, the position, nameaddress and port of each monitor. It also indicates the current epoch,when the map was created, and the last time it changed. To view a monitormap, execute ceph mon dump.

  • The OSD Map: Contains the cluster fsid, when the map was created andlast modified, a list of pools, replica sizes, PG numbers, a list of OSDsand their status (e.g., up, in). To view an OSD map, executeceph osd dump.

  • The PG Map: Contains the PG version, its time stamp, the last OSDmap epoch, the full ratios, and details on each placement group such asthe PG ID, the Up Set, the Acting Set, the state of the PG (e.g.,active + clean), and data usage statistics for each pool.

  • The CRUSH Map: Contains a list of storage devices, the failure domainhierarchy (e.g., device, host, rack, row, room, etc.), and rules fortraversing the hierarchy when storing data. To view a CRUSH map, executeceph osd getcrushmap -o {filename}; then, decompile it by executingcrushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}.You can view the decompiled map in a text editor or with cat.

  • The MDS Map: Contains the current MDS map epoch, when the map wascreated, and the last time it changed. It also contains the pool forstoring metadata, a list of metadata servers, and which metadata serversare up and in. To view an MDS map, execute ceph fs dump.

Each map maintains an iterative history of its operating state changes. CephMonitors maintain a master copy of the cluster map including the clustermembers, state, changes, and the overall health of the Ceph Storage Cluster.

High Availability Monitors

Before Ceph Clients can read or write data, they must contact a Ceph Monitorto obtain the most recent copy of the cluster map. A Ceph Storage Clustercan operate with a single monitor; however, this introduces a singlepoint of failure (i.e., if the monitor goes down, Ceph Clients cannotread or write data).

For added reliability and fault tolerance, Ceph supports a cluster of monitors.In a cluster of monitors, latency and other faults can cause one or moremonitors to fall behind the current state of the cluster. For this reason, Cephmust have agreement among various monitor instances regarding the state of thecluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)and the Paxos) algorithm to establish a consensus among the monitors about thecurrent state of the cluster.

For details on configuring monitors, see the Monitor Config Reference.

High Availability Authentication

To identify users and protect against man-in-the-middle attacks, Ceph providesits cephx authentication system to authenticate users and daemons.

Note

The cephx protocol does not address data encryption in transport(e.g., SSL/TLS) or encryption at rest.

Cephx uses shared secret keys for authentication, meaning both the client andthe monitor cluster have a copy of the client’s secret key. The authenticationprotocol is such that both parties are able to prove to each other they have acopy of the key without actually revealing it. This provides mutualauthentication, which means the cluster is sure the user possesses the secretkey, and the user is sure that the cluster has a copy of the secret key.

A key scalability feature of Ceph is to avoid a centralized interface to theCeph object store, which means that Ceph clients must be able to interact withOSDs directly. To protect data, Ceph provides its cephx authenticationsystem, which authenticates users operating Ceph clients. The cephx protocoloperates in a manner with behavior similar to Kerberos).

A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, eachmonitor can authenticate users and distribute keys, so there is no single pointof failure or bottleneck when using cephx. The monitor returns anauthentication data structure similar to a Kerberos ticket that contains asession key for use in obtaining Ceph services. This session key is itselfencrypted with the user’s permanent secret key, so that only the user canrequest services from the Ceph Monitor(s). The client then uses the session keyto request its desired services from the monitor, and the monitor provides theclient with a ticket that will authenticate the client to the OSDs that actuallyhandle data. Ceph Monitors and OSDs share a secret, so the client can use theticket provided by the monitor with any OSD or metadata server in the cluster.Like Kerberos, cephx tickets expire, so an attacker cannot use an expiredticket or session key obtained surreptitiously. This form of authentication willprevent attackers with access to the communications medium from either creatingbogus messages under another user’s identity or altering another user’slegitimate messages, as long as the user’s secret key is not divulged before itexpires.

To use cephx, an administrator must set up users first. In the followingdiagram, the client.admin user invokes ceph auth get-or-create-key fromthe command line to generate a username and secret key. Ceph’s authsubsystem generates the username and key, stores a copy with the monitor(s) andtransmits the user’s secret back to the client.admin user. This means thatthe client and the monitor share a secret key.

Note

The client.admin user must provide the user ID andsecret key to the user in a secure manner.

Architecture - 图5

To authenticate with the monitor, the client passes in the user name to themonitor, and the monitor generates a session key and encrypts it with the secretkey associated to the user name. Then, the monitor transmits the encryptedticket back to the client. The client then decrypts the payload with the sharedsecret key to retrieve the session key. The session key identifies the user forthe current session. The client then requests a ticket on behalf of the usersigned by the session key. The monitor generates a ticket, encrypts it with theuser’s secret key and transmits it back to the client. The client decrypts theticket and uses it to sign requests to OSDs and metadata servers throughout thecluster.

Architecture - 图6

The cephx protocol authenticates ongoing communications between the clientmachine and the Ceph servers. Each message sent between a client and server,subsequent to the initial authentication, is signed using a ticket that themonitors, OSDs and metadata servers can verify with their shared secret.

Architecture - 图7

The protection offered by this authentication is between the Ceph client and theCeph server hosts. The authentication is not extended beyond the Ceph client. Ifthe user accesses the Ceph client from a remote host, Ceph authentication is notapplied to the connection between the user’s host and the client host.

For configuration details, see Cephx Config Guide. For user managementdetails, see User Management.

Smart Daemons Enable Hyperscale

In many clustered architectures, the primary purpose of cluster membership isso that a centralized interface knows which nodes it can access. Then thecentralized interface provides services to the client through a doubledispatch–which is a huge bottleneck at the petabyte-to-exabyte scale.

Ceph eliminates the bottleneck: Ceph’s OSD Daemons AND Ceph Clients are clusteraware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSDDaemons in the cluster. This enables Ceph OSD Daemons to interact directly withother Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clientsto interact directly with Ceph OSD Daemons.

The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact witheach other means that Ceph OSD Daemons can utilize the CPU and RAM of the Cephnodes to easily perform tasks that would bog down a centralized server. Theability to leverage this computing power leads to several major benefits:

  • OSDs Service Clients Directly: Since any network device has a limit tothe number of concurrent connections it can support, a centralized systemhas a low physical limit at high scales. By enabling Ceph Clients to contactCeph OSD Daemons directly, Ceph increases both performance and total systemcapacity simultaneously, while removing a single point of failure. CephClients can maintain a session when they need to, and with a particular CephOSD Daemon instead of a centralized server.

  • OSD Membership and Status: Ceph OSD Daemons join a cluster and reporton their status. At the lowest level, the Ceph OSD Daemon status is upor down reflecting whether or not it is running and able to serviceCeph Client requests. If a Ceph OSD Daemon is down and in the CephStorage Cluster, this status may indicate the failure of the Ceph OSDDaemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSDDaemon cannot notify the Ceph Monitor that it is down. The OSDsperiodically send messages to the Ceph Monitor (MPGStats pre-luminous,and a new MOSDBeacon in luminous). If the Ceph Monitor doesn’t see thatmessage after a configurable period of time then it marks the OSD down.This mechanism is a failsafe, however. Normally, Ceph OSD Daemons willdetermine if a neighboring OSD is down and report it to the Ceph Monitor(s).This assures that Ceph Monitors are lightweight processes. See MonitoringOSDs and Heartbeats for additional details.

  • Data Scrubbing: As part of maintaining data consistency and cleanliness,Ceph OSD Daemons can scrub objects within placement groups. That is, CephOSD Daemons can compare object metadata in one placement group with itsreplicas in placement groups stored on other OSDs. Scrubbing (usuallyperformed daily) catches bugs or filesystem errors. Ceph OSD Daemons alsoperform deeper scrubbing by comparing data in objects bit-for-bit. Deepscrubbing (usually performed weekly) finds bad sectors on a drive thatweren’t apparent in a light scrub. See Data Scrubbing for details onconfiguring scrubbing.

  • Replication: Like Ceph Clients, Ceph OSD Daemons use the CRUSHalgorithm, but the Ceph OSD Daemon uses it to compute where replicas ofobjects should be stored (and for rebalancing). In a typical write scenario,a client uses the CRUSH algorithm to compute where to store an object, mapsthe object to a pool and placement group, then looks at the CRUSH map toidentify the primary OSD for the placement group.

The client writes the object to the identified placement group in theprimary OSD. Then, the primary OSD with its own copy of the CRUSH mapidentifies the secondary and tertiary OSDs for replication purposes, andreplicates the object to the appropriate placement groups in the secondaryand tertiary OSDs (as many OSDs as additional replicas), and responds to theclient once it has confirmed the object was stored successfully.

Architecture - 图8

With the ability to perform data replication, Ceph OSD Daemons relieve Cephclients from that duty, while ensuring high data availability and data safety.

Dynamic Cluster Management

In the Scalability and High Availability section, we explained how Ceph usesCRUSH, cluster awareness and intelligent daemons to scale and maintain highavailability. Key to Ceph’s design is the autonomous, self-healing, andintelligent Ceph OSD Daemon. Let’s take a deeper look at how CRUSH works toenable modern cloud storage infrastructures to place data, rebalance the clusterand recover from faults dynamically.

About Pools

The Ceph storage system supports the notion of ‘Pools’, which are logicalpartitions for storing objects.

Ceph Clients retrieve a Cluster Map from a Ceph Monitor, and write objects topools. The pool’s size or number of replicas, the CRUSH rule and thenumber of placement groups determine how Ceph will place the data.

Architecture - 图9

Pools set at least the following parameters:

  • Ownership/Access to Objects

  • The Number of Placement Groups, and

  • The CRUSH Rule to Use.

See Set Pool Values for details.

Mapping PGs to OSDs

Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.When a Ceph Client stores objects, CRUSH will map each object to a placementgroup.

Mapping objects to placement groups creates a layer of indirection between theCeph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able togrow (or shrink) and rebalance where it stores objects dynamically. If the CephClient “knew” which Ceph OSD Daemon had which object, that would create a tightcoupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSHalgorithm maps each object to a placement group and then maps each placementgroup to one or more Ceph OSD Daemons. This layer of indirection allows Ceph torebalance dynamically when new Ceph OSD Daemons and the underlying OSD devicescome online. The following diagram depicts how CRUSH maps objects to placementgroups, and placement groups to OSDs.

Architecture - 图10

With a copy of the cluster map and the CRUSH algorithm, the client can computeexactly which OSD to use when reading or writing a particular object.

Calculating PG IDs

When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of theCluster Map. With the cluster map, the client knows about all of the monitors,OSDs, and metadata servers in the cluster. However, it doesn’t know anythingabout object locations.

Object locations get computed.

The only input required by the client is the object ID and the pool.It’s simple: Ceph stores data in named pools (e.g., “liverpool”). When a clientwants to store a named object (e.g., “john,” “paul,” “george,” “ringo”, etc.)it calculates a placement group using the object name, a hash code, thenumber of PGs in the pool and the pool name. Ceph clients use the followingsteps to compute PG IDs.

  • The client inputs the pool name and the object ID. (e.g., pool = “liverpool”and object-id = “john”)

  • Ceph takes the object ID and hashes it.

  • Ceph calculates the hash modulo the number of PGs. (e.g., 58) to geta PG ID.

  • Ceph gets the pool ID given the pool name (e.g., “liverpool” = 4)

  • Ceph prepends the pool ID to the PG ID (e.g., 4.58).

Computing object locations is much faster than performing object location queryover a chatty session. The CRUSH algorithm allows a client to compute where objects should be stored,and enables the client to contact the primary OSD to store or retrieve theobjects.

Peering and Sets

In previous sections, we noted that Ceph OSD Daemons check each othersheartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemonsdo is called ‘peering’, which is the process of bringing all of the OSDs thatstore a Placement Group (PG) into agreement about the state of all of theobjects (and their metadata) in that PG. In fact, Ceph OSD Daemons ReportPeering Failure to the Ceph Monitors. Peering issues usually resolvethemselves; however, if the problem persists, you may need to refer to theTroubleshooting Peering Failure section.

Note

Agreeing on the state does not mean that the PGs have the latest contents.

The Ceph Storage Cluster was designed to store at least two copies of an object(i.e., size = 2), which is the minimum requirement for data safety. For highavailability, a Ceph Storage Cluster should store more than two copies of an object(e.g., size = 3 and min size = 2) so that it can continue to run in adegraded state while maintaining data safety.

Referring back to the diagram in Smart Daemons Enable Hyperscale, we do notname the Ceph OSD Daemons specifically (e.g., osd.0, osd.1, etc.), butrather refer to them as Primary, Secondary, and so forth. By convention,the Primary is the first OSD in the Acting Set, and is responsible forcoordinating the peering process for each placement group where it acts asthe Primary, and is the ONLY OSD that that will accept client-initiatedwrites to objects for a given placement group where it acts as the Primary.

When a series of OSDs are responsible for a placement group, that series ofOSDs, we refer to them as an Acting Set. An Acting Set may refer to the CephOSD Daemons that are currently responsible for the placement group, or the CephOSD Daemons that were responsible for a particular placement group as of someepoch.

The Ceph OSD daemons that are part of an Acting Set may not always be up.When an OSD in the Acting Set is up, it is part of the Up Set. The UpSet is an important distinction, because Ceph can remap PGs to other Ceph OSDDaemons when an OSD fails.

Note

In an Acting Set for a PG containing osd.25, osd.32 andosd.61, the first OSD, osd.25, is the Primary. If that OSD fails,the Secondary, osd.32, becomes the Primary, and osd.25 will beremoved from the Up Set.

Rebalancing

When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map getsupdated with the new OSD. Referring back to Calculating PG IDs, this changesthe cluster map. Consequently, it changes object placement, because it changesan input for the calculations. The following diagram depicts the rebalancingprocess (albeit rather crudely, since it is substantially less impactful withlarge clusters) where some, but not all of the PGs migrate from existing OSDs(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH isstable. Many of the placement groups remain in their original configuration,and each OSD gets some added capacity, so there are no load spikes on thenew OSD after rebalancing is complete.

Architecture - 图11

Data Consistency

As part of maintaining data consistency and cleanliness, Ceph OSDs can alsoscrub objects within placement groups. That is, Ceph OSDs can compare objectmetadata in one placement group with its replicas in placement groups stored inother OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystemerrors. OSDs can also perform deeper scrubbing by comparing data in objectsbit-for-bit. Deep scrubbing (usually performed weekly) finds bad sectors on adisk that weren’t apparent in a light scrub.

See Data Scrubbing for details on configuring scrubbing.

Erasure Coding

An erasure coded pool stores each object as K+M chunks. It is divided intoK data chunks and M coding chunks. The pool is configured to have a sizeof K+M so that each chunk is stored in an OSD in the acting set. The rank ofthe chunk is stored as an attribute of the object.

For instance an erasure coded pool is created to use five OSDs (K+M = 5) andsustain the loss of two of them (M = 2).

Reading and Writing Encoded Chunks

When the object NYAN containing ABCDEFGHI is written to the pool, the erasureencoding function splits the content into three data chunks simply by dividingthe content in three: the first contains ABC, the second DEF and thelast GHI. The content will be padded if the content length is not a multipleof K. The function also creates two coding chunks: the fourth with YXYand the fifth with QGC. Each chunk is stored in an OSD in the acting set.The chunks are stored in objects that have the same name (NYAN) but resideon different OSDs. The order in which the chunks were created must be preservedand is stored as an attribute of the object (shard_t), in addition to itsname. Chunk 1 contains ABC and is stored on OSD5 while chunk 4 containsYXY and is stored on OSD3.

Architecture - 图12

When the object NYAN is read from the erasure coded pool, the decodingfunction reads three chunks: chunk 1 containing ABC, chunk 3 containingGHI and chunk 4 containing YXY. Then, it rebuilds the original contentof the object ABCDEFGHI. The decoding function is informed that the chunks 2and 5 are missing (they are called ‘erasures’). The chunk 5 could not be readbecause the OSD4 is out. The decoding function can be called as soon asthree chunks are read: OSD2 was the slowest and its chunk was not taken intoaccount.

Architecture - 图13

Interrupted Full Writes

In an erasure coded pool, the primary OSD in the up set receives all writeoperations. It is responsible for encoding the payload into K+M chunks andsends them to the other OSDs. It is also responsible for maintaining anauthoritative version of the placement group logs.

In the following diagram, an erasure coded placement group has been created withK = 2 + M = 1 and is supported by three OSDs, two for K and one forM. The acting set of the placement group is made of OSD 1, OSD 2 andOSD 3. An object has been encoded and stored in the OSDs : the chunkD1v1 (i.e. Data chunk number 1, version 1) is on OSD 1, D2v1 onOSD 2 and C1v1 (i.e. Coding chunk number 1, version 1) on OSD 3. Theplacement group logs on each OSD are identical (i.e. 1,1 for epoch 1,version 1).

Architecture - 图14

OSD 1 is the primary and receives a WRITE FULL from a client, whichmeans the payload is to replace the object entirely instead of overwriting aportion of it. Version 2 (v2) of the object is created to override version 1(v1). OSD 1 encodes the payload into three chunks: D1v2 (i.e. Datachunk number 1 version 2) will be on OSD 1, D2v2 on OSD 2 andC1v2 (i.e. Coding chunk number 1 version 2) on OSD 3. Each chunk is sentto the target OSD, including the primary OSD which is responsible for storingchunks in addition to handling write operations and maintaining an authoritativeversion of the placement group logs. When an OSD receives the messageinstructing it to write the chunk, it also creates a new entry in the placementgroup logs to reflect the change. For instance, as soon as OSD 3 storesC1v2, it adds the entry 1,2 ( i.e. epoch 1, version 2 ) to its logs.Because the OSDs work asynchronously, some chunks may still be in flight ( suchas D2v2 ) while others are acknowledged and on disk ( such as C1v1 andD1v1).

Architecture - 图15

If all goes well, the chunks are acknowledged on each OSD in the acting set andthe logs’ last_complete pointer can move from 1,1 to 1,2.

Architecture - 图16

Finally, the files used to store the chunks of the previous version of theobject can be removed: D1v1 on OSD 1, D2v1 on OSD 2 and C1v1on OSD 3.

Architecture - 图17

But accidents happen. If OSD 1 goes down while D2v2 is still in flight,the object’s version 2 is partially written: OSD 3 has one chunk but that isnot enough to recover. It lost two chunks: D1v2 and D2v2 and theerasure coding parameters K = 2, M = 1 require that at least two chunks areavailable to rebuild the third. OSD 4 becomes the new primary and finds thatthe last_complete log entry (i.e., all objects before this entry were knownto be available on all OSDs in the previous acting set ) is 1,1 and thatwill be the head of the new authoritative log.

Architecture - 图18

The log entry 1,2 found on OSD 3 is divergent from the new authoritative logprovided by OSD 4: it is discarded and the file containing the C1v2chunk is removed. The D1v1 chunk is rebuilt with the decode function ofthe erasure coding library during scrubbing and stored on the new primaryOSD 4.

Architecture - 图19

See Erasure Code Notes for additional details.

Cache Tiering

A cache tier provides Ceph Clients with better I/O performance for a subset ofthe data stored in a backing storage tier. Cache tiering involves creating apool of relatively fast/expensive storage devices (e.g., solid state drives)configured to act as a cache tier, and a backing pool of either erasure-codedor relatively slower/cheaper devices configured to act as an economical storagetier. The Ceph objecter handles where to place the objects and the tieringagent determines when to flush objects from the cache to the backing storagetier. So the cache tier and the backing storage tier are completely transparentto Ceph clients.

Architecture - 图20

See Cache Tiering for additional details.

Extending Ceph

You can extend Ceph by creating shared object classes called ‘Ceph Classes’.Ceph loads .so classes stored in the osd class dir directory dynamically(i.e., $libdir/rados-classes by default). When you implement a class, youcan create new object methods that have the ability to call the native methodsin the Ceph Object Store, or other class methods you incorporate via librariesor create yourself.

On writes, Ceph Classes can call native or class methods, perform any series ofoperations on the inbound data and generate a resulting write transaction thatCeph will apply atomically.

On reads, Ceph Classes can call native or class methods, perform any series ofoperations on the outbound data and return the data to the client.

See src/objclass/objclass.h, src/fooclass.cc and src/barclass forexemplary implementations.

Summary

Ceph Storage Clusters are dynamic–like a living organism. Whereas, many storageappliances do not fully utilize the CPU and RAM of a typical commodity server,Ceph does. From heartbeats, to peering, to rebalancing the cluster orrecovering from faults, Ceph offloads work from clients (and from a centralizedgateway which doesn’t exist in the Ceph architecture) and uses the computingpower of the OSDs to perform the work. When referring to HardwareRecommendations and the Network Config Reference, be cognizant of theforegoing concepts to understand how Ceph utilizes computing resources.

Ceph Protocol

Ceph Clients use the native protocol for interacting with the Ceph StorageCluster. Ceph packages this functionality into the librados library so thatyou can create your own custom Ceph Clients. The following diagram depicts thebasic architecture.

Architecture - 图21

Native Protocol and librados

Modern applications need a simple object storage interface with asynchronouscommunication capability. The Ceph Storage Cluster provides a simple objectstorage interface with asynchronous communication capability. The interfaceprovides direct, parallel access to objects throughout the cluster.

  • Pool Operations

  • Snapshots and Copy-on-write Cloning

  • Read/Write Objects- Create or Remove- Entire Object or Byte Range- Append or Truncate

  • Create/Set/Get/Remove XATTRs

  • Create/Set/Get/Remove Key/Value Pairs

  • Compound operations and dual-ack semantics

  • Object Classes

Object Watch/Notify

A client can register a persistent interest with an object and keep a session tothe primary OSD open. The client can send a notification message and a payload toall watchers and receive notification when the watchers receive thenotification. This enables a client to use any object as asynchronization/communication channel.

Architecture - 图22

Data Striping

Storage devices have throughput limitations, which impact performance andscalability. So storage systems often support striping–storing sequentialpieces of information across multiple storage devices–to increase throughputand performance. The most common form of data striping comes from RAID.The RAID type most similar to Ceph’s striping is RAID 0, or a ‘stripedvolume’. Ceph’s striping offers the throughput of RAID 0 striping, thereliability of n-way RAID mirroring and faster recovery.

Ceph provides three types of clients: Ceph Block Device, Ceph File System, andCeph Object Storage. A Ceph Client converts its data from the representationformat it provides to its users (a block device image, RESTful objects, CephFSfilesystem directories) into objects for storage in the Ceph Storage Cluster.

Tip

The objects Ceph stores in the Ceph Storage Cluster are not striped.Ceph Object Storage, Ceph Block Device, and the Ceph File System stripe theirdata over multiple Ceph Storage Cluster objects. Ceph Clients that writedirectly to the Ceph Storage Cluster via librados must perform thestriping (and parallel I/O) for themselves to obtain these benefits.

The simplest Ceph striping format involves a stripe count of 1 object. CephClients write stripe units to a Ceph Storage Cluster object until the object isat its maximum capacity, and then create another object for additional stripesof data. The simplest form of striping may be sufficient for small block deviceimages, S3 or Swift objects and CephFS files. However, this simple form doesn’ttake maximum advantage of Ceph’s ability to distribute data across placementgroups, and consequently doesn’t improve performance very much. The followingdiagram depicts the simplest form of striping:

Architecture - 图23

If you anticipate large images sizes, large S3 or Swift objects (e.g., video),or large CephFS directories, you may see considerable read/write performanceimprovements by striping client data over multiple objects within an object set.Significant write performance occurs when the client writes the stripe units totheir corresponding objects in parallel. Since objects get mapped to differentplacement groups and further mapped to different OSDs, each write occurs inparallel at the maximum write speed. A write to a single disk would be limitedby the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.100MB/s). By spreading that write over multiple objects (which map to differentplacement groups and OSDs) Ceph can reduce the number of seeks per drive andcombine the throughput of multiple drives to achieve much faster write (or read)speeds.

Note

Striping is independent of object replicas. Since CRUSHreplicates objects across OSDs, stripes get replicated automatically.

In the following diagram, client data gets striped across an object set(object set 1 in the following diagram) consisting of 4 objects, where thefirst stripe unit is stripe unit 0 in object 0, and the fourth stripeunit is stripe unit 3 in object 3. After writing the fourth stripe, theclient determines if the object set is full. If the object set is not full, theclient begins writing a stripe to the first object again (object 0 in thefollowing diagram). If the object set is full, the client creates a new objectset (object set 2 in the following diagram), and begins writing to the firststripe (stripe unit 16) in the first object in the new object set (object4 in the diagram below).

Architecture - 图24

Three important variables determine how Ceph stripes data:

  • Object Size: Objects in the Ceph Storage Cluster have a maximumconfigurable size (e.g., 2MB, 4MB, etc.). The object size should be largeenough to accommodate many stripe units, and should be a multiple ofthe stripe unit.

  • Stripe Width: Stripes have a configurable unit size (e.g., 64kb).The Ceph Client divides the data it will write to objects into equallysized stripe units, except for the last stripe unit. A stripe width,should be a fraction of the Object Size so that an object may containmany stripe units.

  • Stripe Count: The Ceph Client writes a sequence of stripe unitsover a series of objects determined by the stripe count. The seriesof objects is called an object set. After the Ceph Client writes tothe last object in the object set, it returns to the first object inthe object set.

Important

Test the performance of your striping configuration beforeputting your cluster into production. You CANNOT change these stripingparameters after you stripe the data and write it to objects.

Once the Ceph Client has striped data to stripe units and mapped the stripeunits to objects, Ceph’s CRUSH algorithm maps the objects to placement groups,and the placement groups to Ceph OSD Daemons before the objects are stored asfiles on a storage disk.

Note

Since a client writes to a single pool, all data striped into objectsget mapped to placement groups in the same pool. So they use the same CRUSHmap and the same access controls.

Ceph Clients

Ceph Clients include a number of service interfaces. These include:

  • Block Devices: The Ceph Block Device (a.k.a., RBD) serviceprovides resizable, thin-provisioned block devices with snapshotting andcloning. Ceph stripes a block device across the cluster for highperformance. Ceph supports both kernel objects (KO) and a QEMU hypervisorthat uses librbd directly–avoiding the kernel object overhead forvirtualized systems.

  • Object Storage: The Ceph Object Storage (a.k.a., RGW) serviceprovides RESTful APIs with interfaces that are compatible with Amazon S3and OpenStack Swift.

  • Filesystem: The Ceph File System (CephFS) service providesa POSIX compliant filesystem usable with mount or asa filesystem in user space (FUSE).

Ceph can run additional instances of OSDs, MDSs, and monitors for scalabilityand high availability. The following diagram depicts the high-levelarchitecture.

Architecture - 图25

Ceph Object Storage

The Ceph Object Storage daemon, radosgw, is a FastCGI service that providesa RESTful HTTP API to store objects and metadata. It layers on top of the CephStorage Cluster with its own data formats, and maintains its own user database,authentication, and access control. The RADOS Gateway uses a unified namespace,which means you can use either the OpenStack Swift-compatible API or the AmazonS3-compatible API. For example, you can write data using the S3-compatible APIwith one application and then read data using the Swift-compatible API withanother application.

See Ceph Object Storage for details.

Ceph Block Device

A Ceph Block Device stripes a block device image over multiple objects in theCeph Storage Cluster, where each object gets mapped to a placement group anddistributed, and the placement groups are spread across separate ceph-osddaemons throughout the cluster.

Important

Striping allows RBD block devices to perform better than a singleserver could!

Thin-provisioned snapshottable Ceph Block Devices are an attractive option forvirtualization and cloud computing. In virtual machine scenarios, peopletypically deploy a Ceph Block Device with the rbd network storage driver inQEMU/KVM, where the host machine uses librbd to provide a block deviceservice to the guest. Many cloud computing stacks use libvirt to integratewith hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU andlibvirt to support OpenStack and CloudStack among other solutions.

While we do not provide librbd support with other hypervisors at this time,you may also use Ceph Block Device kernel objects to provide a block device to aclient. Other virtualization technologies such as Xen can access the Ceph BlockDevice kernel object(s). This is done with the command-line tool rbd.

Ceph File System

The Ceph File System (CephFS) provides a POSIX-compliant filesystem as aservice that is layered on top of the object-based Ceph Storage Cluster.CephFS files get mapped to objects that Ceph stores in the Ceph StorageCluster. Ceph Clients mount a CephFS filesystem as a kernel object or asa Filesystem in User Space (FUSE).

Architecture - 图26

The Ceph File System service includes the Ceph Metadata Server (MDS) deployedwith the Ceph Storage cluster. The purpose of the MDS is to store all thefilesystem metadata (directories, file ownership, access modes, etc) inhigh-availability Ceph Metadata Servers where the metadata resides in memory.The reason for the MDS (a daemon called ceph-mds) is that simple filesystemoperations like listing a directory or changing a directory (ls, cd)would tax the Ceph OSD Daemons unnecessarily. So separating the metadata fromthe data means that the Ceph File System can provide high performance serviceswithout taxing the Ceph Storage Cluster.

CephFS separates the metadata from the data, storing the metadata in the MDS,and storing the file data in one or more objects in the Ceph Storage Cluster.The Ceph filesystem aims for POSIX compatibility. ceph-mds can run as asingle process, or it can be distributed out to multiple physical machines,either for high availability or for scalability.

  • High Availability: The extra ceph-mds instances can be standby,ready to take over the duties of any failed ceph-mds that wasactive. This is easy because all the data, including the journal, isstored on RADOS. The transition is triggered automatically by ceph-mon.

  • Scalability: Multiple ceph-mds instances can be active, and theywill split the directory tree into subtrees (and shards of a singlebusy directory), effectively balancing the load amongst all _active_servers.

Combinations of standby and active etc are possible, for examplerunning 3 active ceph-mds instances for scaling, and one _standby_instance for high availability.