Configuring multiple active MDS daemons

Also known as: multi-mds, active-active MDS

Each CephFS file system is configured for a single active MDS daemonby default. To scale metadata performance for large scale systems, youmay enable multiple active MDS daemons, which will share the metadataworkload with one another.

When should I use multiple active MDS daemons?

You should configure multiple active MDS daemons when your metadata performanceis bottlenecked on the single MDS that runs by default.

Adding more daemons may not increase performance on all workloads. Typically,a single application running on a single client will not benefit from anincreased number of MDS daemons unless the application is doing a lot ofmetadata operations in parallel.

Workloads that typically benefit from a larger number of active MDS daemonsare those with many clients, perhaps working on many separate directories.

Increasing the MDS active cluster size

Each CephFS file system has a max_mds setting, which controls how many rankswill be created. The actual number of ranks in the file system will only beincreased if a spare daemon is available to take on the new rank. For example,if there is only one MDS daemon running, and max_mds is set to two, no secondrank will be created. (Note that such a configuration is not Highly Available(HA) because no standby is available to take over for a failed rank. Thecluster will complain via health warnings when configured this way.)

Set max_mds to the desired number of ranks. In the following examplesthe “fsmap” line of “ceph status” is shown to illustrate the expectedresult of commands.

  1. # fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
  2.  
  3. ceph fs set <fs_name> max_mds 2
  4.  
  5. # fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
  6. # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby

The newly created rank (1) will pass through the ‘creating’ stateand then enter this ‘active state’.

Standby daemons

Even with multiple active MDS daemons, a highly available system stillrequires standby daemons to take over if any of the servers runningan active daemon fail.

Consequently, the practical maximum of max_mds for highly available systemsis at most one less than the total number of MDS servers in your system.

To remain available in the event of multiple server failures, increase thenumber of standby daemons in the system to match the number of server failuresyou wish to withstand.

Decreasing the number of ranks

Reducing the number of ranks is as simple as reducing max_mds:

  1. # fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
  2. ceph fs set <fs_name> max_mds 1
  3. # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
  4. # fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
  5. ...
  6. # fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby

The cluster will automatically stop extra ranks incrementally until max_mdsis reached.

See CephFS Administrative commands for more details which forms <role> cantake.

Note: stopped ranks will first enter the stopping state for a period oftime while it hands off its share of the metadata to the remaining activedaemons. This phase can take from seconds to minutes. If the MDS appears tobe stuck in the stopping state then that should be investigated as a possiblebug.

If an MDS daemon crashes or is killed while in the up:stopping state, astandby will take over and the cluster monitors will against try to stopthe daemon.

When a daemon finishes stopping, it will respawn itself and go back to being astandby.

Manually pinning directory trees to a particular rank

In multiple active metadata server configurations, a balancer runs which worksto spread metadata load evenly across the cluster. This usually works wellenough for most users but sometimes it is desirable to override the dynamicbalancer with explicit mappings of metadata to particular ranks. This can allowthe administrator or users to evenly spread application load or limit impact ofusers’ metadata requests on the entire cluster.

The mechanism provided for this purpose is called an export pin, anextended attribute of directories. The name of this extended attribute isceph.dir.pin. Users can set this attribute using standard commands:

  1. setfattr -n ceph.dir.pin -v 2 path/to/dir

The value of the extended attribute is the rank to assign the directory subtreeto. A default value of -1 indicates the directory is not pinned.

A directory’s export pin is inherited from its closest parent with a set exportpin. In this way, setting the export pin on a directory affects all of itschildren. However, the parents pin can be overridden by setting the childdirectory’s export pin. For example:

  1. mkdir -p a/b
  2. # "a" and "a/b" both start without an export pin set
  3. setfattr -n ceph.dir.pin -v 1 a/
  4. # a and b are now pinned to rank 1
  5. setfattr -n ceph.dir.pin -v 0 a/b
  6. # a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1