RBD Exclusive Locks

Exclusive locks are a mechanism designed to prevent multiple processesfrom accessing the same Rados Block Device (RBD) in an uncoordinatedfashion. Exclusive locks are heavily used in virtualization (wherethey prevent VMs from clobbering each others’ writes), and also in RBDmirroring (where they are a prerequisite for journaling).

Exclusive locks are enabled on newly created images by default, unlessoverridden via the rbd_default_features configuration option orthe —image-feature flag for rbd create.

In order to ensure proper exclusive locking operations, any clientusing an RBD image whose exclusive-lock feature is enabled shouldbe using a CephX identity whose capabilities include profile rbd.

Exclusive locking is mostly transparent to the user.

  • Whenever any librbd client process or kernel RBD clientstarts using an RBD image on which exclusive locking has beenenabled, it obtains an exclusive lock on the image before the firstwrite.

  • Whenever any such client process gracefully terminates, itautomatically relinquishes the lock.

  • This subsequently enables another process to acquire the lock, andwrite to the image.

Note that it is perfectly possible for two or more concurrentlyrunning processes to merely open the image, and also to read fromit. The client acquires the exclusive lock only when attempting towrite to the image.

Blacklisting

Sometimes, a client process (or, in case of a krbd client, a clientnode’s kernel thread) that previously held an exclusive lock on animage does not terminate gracefully, but dies abruptly. This may bedue to having received a KILL or ABRT signal, for example, ora hard reboot or power failure of the client node. In that case, theexclusive lock is never gracefully released. Thus, when a new processstarts and attempts to use the device, it needs a way to break thepreviously held exclusive lock.

However, a process (or kernel thread) may also hang, or merely losenetwork connectivity to the Ceph cluster for some amount of time. Inthat case, simply breaking the lock would be potentially catastrophic:the hung process or connectivity issue may resolve itself, and the oldprocess may then compete with one that has started in the interim,accessing RBD data in an uncoordinated and destructive manner.

Thus, in the event that a lock cannot be acquired in the standardgraceful manner, the overtaking process not only breaks the lock, butalso blacklists the previous lock holder. This is negotiated betweenthe new client process and the Ceph Mon: upon receiving the blacklistrequest,

  • the Mon instructs the relevant OSDs to no longer serve requests fromthe old client process;

  • once the associated OSD map update is complete, the Mon grants thelock to the new client;

  • once the new client has acquired the lock, it can commence writingto the image.

Blacklisting is thus a form of storage-level resource fencing).

In order for blacklisting to work, the client must have the osdblacklist capability. This capability is included in the profilerbd capability profile, which should generally be set on all Cephclient identities using RBD.