Ceph file system client eviction

Ceph file system client eviction

When a file system client is unresponsive or otherwise misbehaving, itmay be necessary to forcibly terminate its access to the file system. Thisprocess is called eviction.

Evicting a CephFS client prevents it from communicating further with MDSdaemons and OSD daemons. If a client was doing buffered IO to the file system,any un-flushed data will be lost.

Clients may either be evicted automatically (if they fail to communicatepromptly with the MDS), or manually (by the system administrator).

The client eviction process applies to clients of all kinds, this includesFUSE mounts, kernel mounts, nfs-ganesha gateways, and any process usinglibcephfs.

Automatic client eviction

There are three situations in which a client may be evicted automatically.

On an active MDS daemon, if a client has not communicated with the MDS for oversession_autoclose (a file system variable) seconds (300 seconds bydefault), then it will be evicted automatically.
On an active MDS daemon, if a client has not responded to cap revoke messagesfor over mds_cap_revoke_eviction_timeout (configuration option) seconds.This is disabled by default.
During MDS startup (including on failover), the MDS passes through astate called reconnect. During this state, it waits for all theclients to connect to the new MDS daemon. If any clients fail to doso within the time window (mds_reconnect_timeout, 45 seconds by default)then they will be evicted.

A warning message is sent to the cluster log if either of these situationsarises.

Manual client eviction

Sometimes, the administrator may want to evict a client manually. Thiscould happen if a client has died and the administrator does notwant to wait for its session to time out, or it could happen ifa client is misbehaving and the administrator does not have access tothe client node to unmount it.

It is useful to inspect the list of clients first:

ceph tell mds.0 client ls
 
[
    {
        "id": 4305,
        "num_leases": 0,
        "num_caps": 3,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "inst": "client.4305 172.21.9.34:0/422650892",
        "client_metadata": {
            "ceph_sha1": "ae81e49d369875ac8b569ff3e3c456a31b8f3af5",
            "ceph_version": "ceph version 12.0.0-1934-gae81e49 (ae81e49d369875ac8b569ff3e3c456a31b8f3af5)",
            "entity_id": "0",
            "hostname": "senta04",
            "mount_point": "/tmp/tmpcMpF1b/mnt.0",
            "pid": "29377",
            "root": "/"
        }
    }
]

Once you have identified the client you want to evict, you cando that using its unique ID, or various other attributes to identify it:

# These all work
ceph tell mds.0 client evict id=4305
ceph tell mds.0 client evict client_metadata.=4305

Advanced: Un-blacklisting a client

Ordinarily, a blacklisted client may not reconnect to the servers: itmust be unmounted and then mounted anew.

However, in some situations it may be useful to permit a client thatwas evicted to attempt to reconnect.

Because CephFS uses the RADOS OSD blacklist to control client eviction,CephFS clients can be permitted to reconnect by removing them fromthe blacklist:

$ ceph osd blacklist ls
listed 1 entries
127.0.0.1:0/3710147553 2018-03-19 11:32:24.716146
$ ceph osd blacklist rm 127.0.0.1:0/3710147553
un-blacklisting 127.0.0.1:0/3710147553

Doing this may put data integrity at risk if other clients have accessedfiles that the blacklisted client was doing buffered IO to. It is also notguaranteed to result in a fully functional client – the best way to geta fully healthy client back after an eviction is to unmount the clientand do a fresh mount.

If you are trying to reconnect clients in this way, you may alsofind it useful to set client_reconnect_stale to true in theFUSE client, to prompt the client to try to reconnect.

Advanced: Configuring blacklisting

If you are experiencing frequent client evictions, due to slowclient hosts or an unreliable network, and you cannot fix the underlyingissue, then you may want to ask the MDS to be less strict.

It is possible to respond to slow clients by simply dropping theirMDS sessions, but permit them to re-open sessions and permit themto continue talking to OSDs. To enable this mode, setmds_session_blacklist_on_timeout to false on your MDS nodes.

For the equivalent behaviour on manual evictions, setmds_session_blacklist_on_evict to false.

Note that if blacklisting is disabled, then evicting a client willonly have an effect on the MDS you send the command to. On a systemwith multiple active MDS daemons, you would need to send aneviction command to each active daemon. When blacklisting is enabled(the default), sending an eviction command to just a singleMDS is sufficient, because the blacklist propagates it to the others.

Background: Blacklisting and OSD epoch barrier

After a client is blacklisted, it is necessary to make sure thatother clients and MDS daemons have the latest OSDMap (includingthe blacklist entry) before they try to access any data objectsthat the blacklisted client might have been accessing.

This is ensured using an internal “osdmap epoch barrier” mechanism.

The purpose of the barrier is to ensure that when we hand out anycapabilities which might allow touching the same RADOS objects, theclients we hand out the capabilities to must have a sufficiently recentOSD map to not race with cancelled operations (from ENOSPC) orblacklisted clients (from evictions).

More specifically, the cases where an epoch barrier is set are:

Client eviction (where the client is blacklisted and other clientsmust wait for a post-blacklist epoch to touch the same objects).
OSD map full flag handling in the client (where the client maycancel some OSD ops from a pre-full epoch, so other clients mustwait until the full epoch or later before touching the same objects).
MDS startup, because we don’t persist the barrier epoch, so mustassume that latest OSD map is always required after a restart.

Note that this is a global value for simplicity. We could maintain this ona per-inode basis. But we don’t, because:

It would be more complicated.
It would use an extra 4 bytes of memory for every inode.
It would not be much more efficient as, almost always, everyone hasthe latest OSD map. And, in most cases everyone will breeze through thisbarrier rather than waiting.
This barrier is done in very rare cases, so any benefit from per-inodegranularity would only very rarely be seen.

The epoch barrier is transmitted along with all capability messages, andinstructs the receiver of the message to avoid sending any more RADOSoperations to OSDs until it has seen this OSD epoch. This mainly appliesto clients (doing their data writes directly to files), but also appliesto the MDS because things like file size probing and file deletion aredone directly from the MDS.