CephFS Reclaim Interface

CephFS Reclaim Interface

Introduction

NFS servers typically do not track ephemeral state on stable storage. Ifthe NFS server is restarted, then it will be resurrected with noephemeral state, and the NFS clients are expected to send requests toreclaim what state they held during a grace period.

In order to support this use-case, libcephfs has grown several functionsthat allow a client that has been stopped and restarted to destroy orreclaim state held by a previous incarnation of itself. This allows theclient to reacquire state held by its previous incarnation, and to avoidthe long wait for the old session to time out before releasing the statepreviously held.

As soon as an NFS server running over cephfs goes down, it’s racingagainst its MDS session timeout. If the Ceph session times out beforethe NFS grace period is started, then conflicting state could beacquired by another client. This mechanism also allows us to increasethe timeout for these clients, to ensure that the server has a longwindow of time to be restarted.

Setting the UUID

In order to properly reset or reclaim against the old session, we need away to identify the old session. This done by setting a unique opaquevalue on the session using ceph_set_uuid(). The uuid value can beany string and is treated as opaque by the client.

Setting the uuid directly can only be done on a new session, prior tomounting. When reclaim is performed the current session will inherit theold session’s uuid.

Starting Reclaim

After calling ceph_create and ceph_init on the resulting structceph_mount_info, the client should then issue ceph_start_reclaim,passing in the uuid of the previous incarnation of the client with anyflags.

CEPH_RECLAIM_RESET
This flag indicates that we do not intend to do any sort of reclaimagainst the old session indicated by the given uuid, and that itshould just be discarded. Any state held by the previous clientshould be released immediately.

Finishing Reclaim

After the Ceph client has completed all of its reclaim operations, theclient should issue ceph_finish_reclaim to indicate that the reclaim isnow complete.

Setting Session Timeout (Optional)

When a client dies and is restarted, and we need to preserve its state,we are effectively racing against the session expiration clock. In thissituation we generally want a longer timeout since we expect toeventually kill off the old session manually.

Example 1: Reset Old Session

This example just kills off the MDS session held by a previous instanceof itself. An NFS server can start a grace period and then ask the MDSto tear down the old session. This allows clients to start reclaimimmediately.

(Note: error handling omitted for clarity)

struct ceph_mount_info *cmount;
const char *uuid = "foobarbaz";
 
/* Set up a new cephfs session, but don't mount it yet. */
rc = ceph_create(&cmount);
rc = ceph_init(&cmount);
 
/*
 * Set the timeout to 5 minutes to lengthen the window of time for
 * the server to restart, should it crash.
 */
ceph_set_session_timeout(cmount, 300);
 
/*
 * Start reclaim vs. session with old uuid. Before calling this,
 * all NFS servers that could acquire conflicting state _must_ be
 * enforcing their grace period locally.
 */
rc = ceph_start_reclaim(cmount, uuid, CEPH_RECLAIM_RESET);
 
/* Declare reclaim complete */
rc = ceph_finish_reclaim(cmount);
 
/* Set uuid held by new session */
ceph_set_uuid(cmount, nodeid);
 
/*
 * Now mount up the file system and do normal open/lock operations to
 * satisfy reclaim requests.
 */
ceph_mount(cmount, rootpath);
...