CephFS Distributed Metadata Cache

While the data for inodes in a Ceph file system is stored in RADOS andaccessed by the clients directly, inode metadata and directoryinformation is managed by the Ceph metadata server (MDS). The MDS’sact as mediator for all metadata related activity, storing the resultinginformation in a separate RADOS pool from the file data.

CephFS clients can request that the MDS fetch or change inode metadataon its behalf, but an MDS can also grant the client capabilities(aka caps) for each inode (see Capabilities in CephFS).

A capability grants the client the ability to cache and possiblymanipulate some portion of the data or metadata associated with theinode. When another client needs access to the same information, the MDSwill revoke the capability and the client will eventually return it,along with an updated version of the inode’s metadata (in the event thatit made changes to it while it held the capability).

Clients can request capabilities and will generally get them, but whenthere is competing access or memory pressure on the MDS, they may berevoked. When a capability is revoked, the client is responsible forreturning it as soon as it is able. Clients that fail to do so in atimely fashion may end up blacklisted and unable to communicate withthe cluster.

Since the cache is distributed, the MDS must take great care to ensurethat no client holds capabilities that may conflict with other clients’capabilities, or operations that it does itself. This allows cephfsclients to rely on much greater cache coherence than a filesystem likeNFS, where the client may cache data and metadata beyond the point whereit has changed on the server.

Client Metadata Requests

When a client needs to query/change inode metadata or perform anoperation on a directory, it has two options. It can make a request tothe MDS directly, or serve the information out of its cache. WithCephFS, the latter is only possible if the client has the necessarycaps.

Clients can send simple requests to the MDS to query or request changesto certain metadata. The replies to these requests may also grant theclient a certain set of caps for the inode, allowing it to performsubsequent requests without consulting the MDS.

Clients can also request caps directly from the MDS, which is necessaryin order to read or write file data.

Distributed Locks in an MDS Cluster

When an MDS wants to read or change information about an inode, it mustgather the appropriate locks for it. The MDS cluster may have a seriesof different types of locks on the given inode and each MDS may havedisjoint sets of locks.

If there are outstanding caps that would conflict with these locks, thenthey must be revoked before the lock can be acquired. Once the competingcaps are returned to the MDS, then it can get the locks and do theoperation.

On a filesystem served by multiple MDS’, the metadata cache is alsodistributed among the MDS’ in the cluster. For every inode, at any giventime, only one MDS in the cluster is considered authoritative. Anyrequests to change that inode must be done by the authoritative MDS,though non-authoritative MDS can forward requests to the authoritativeone.

Non-auth MDS’ can also obtain read locks that prevent the auth MDS fromchanging the data until the lock is dropped, so that they can serveinode info to the clients.

The auth MDS for an inode can change over time as well. The MDS’ willactively balance responsibility for the inode cache amongstthemselves, but this can be overridden by pinning certain subtreesto a single MDS.