Configuring Directory fragmentation

In CephFS, directories are fragmented when they become very largeor very busy. This splits up the metadata so that it can be sharedbetween multiple MDS daemons, and between multiple objects in themetadata pool.

In normal operation, directory fragmentation is invisible tousers and administrators, and all the configuration settings mentionedhere should be left at their default values.

While directory fragmentation enables CephFS to handle very largenumbers of entries in a single directory, application programmers shouldremain conservative about creating very large directories, as they stillhave a resource cost in situations such as a CephFS client listingthe directory, where all the fragments must be loaded at once.

Tip

The root directory cannot be fragmented.

All directories are initially created as a single fragment. This fragmentmay be split to divide up the directory into more fragments, and thesefragments may be merged to reduce the number of fragments in the directory.

Splitting and merging

When an MDS identifies a directory fragment to be split, it does notdo the split immediately. Because splitting interrupts metadata IO,a short delay is used to allow short bursts of client IO to completebefore the split begins. This delay is configured withmds_bal_fragment_interval, which defaults to 5 seconds.

When the split is done, the directory fragment is broken up intoa power of two number of new fragments. The number of newfragments is given by two to the power mds_bal_split_bits, i.e.if mds_bal_split_bits is 2, then four new fragments will becreated. The default setting is 3, i.e. splits create 8 new fragments.

The criteria for initiating a split or a merge are described in thefollowing sections.

Size thresholds

A directory fragment is eligible for splitting when its size exceedsmds_bal_split_size (default 10000). Ordinarily this split isdelayed by mds_bal_fragment_interval, but if the fragment sizeexceeds a factor of mds_bal_fragment_fast_factor the split size,the split will happen immediately (holding up any client metadataIO on the directory).

mds_bal_fragment_size_max is the hard limit on the size ofdirectory fragments. If it is reached, clients will receiveENOSPC errors if they try to create files in the fragment. Ona properly configured system, this limit should never be reached onordinary directories, as they will have split long before. By default,this is set to 10 times the split size, giving a dirfrag size limit of100000. Increasing this limit may lead to oversized directory fragmentobjects in the metadata pool, which the OSDs may not be able to handle.

A directory fragment is eligible for merging when its size is lessthan mds_bal_merge_size. There is no merge equivalent of the“fast splitting” explained above: fast splitting exists to avoidcreating oversized directory fragments, there is no equivalent issueto avoid when merging. The default merge size is 50.

Activity thresholds

In addition to splitting fragments basedon their size, the MDS may split directory fragments if theiractivity exceeds a threshold.

The MDS maintains separate time-decaying load counters for read and writeoperations on directory fragments. The decaying load counters have anexponential decay based on the mds_decay_halflife setting.

On writes, the write counter isincremented, and compared with mds_bal_split_wr, triggering asplit if the threshold is exceeded. Write operations include metadata IOsuch as renames, unlinks and creations.

The mds_bal_split_rd threshold is applied based on the read operationload counter, which tracks readdir operations.

By the default, the read threshold is 25000 and the write threshold is10000, i.e. 2.5x as many reads as writes would be required to triggera split.

After fragments are split due to the activity thresholds, they are onlymerged based on the size threshold (mds_bal_merge_size), soa spike in activity may cause a directory to stay fragmentedforever unless some entries are unlinked.