CephFS Dynamic Metadata Management

Metadata operations usually take up more than 50 percent of allfile system operations. Also the metadata scales in a more complexfashion when compared to scaling storage (which in turn scales I/Othroughput linearly). This is due to the hierarchical andinterdependent nature of the file system metadata. So in CephFS,the metadata workload is decoupled from data workload so as toavoid placing unnecessary strain on the RADOS cluster. The metadatais hence handled by a cluster of Metadata Servers (MDSs).CephFS distributes metadata across MDSs via Dynamic Subtree Partitioning.

Dynamic Subtree Partitioning

In traditional subtree partitioning, subtrees of the file systemhierarchy are assigned to individual MDSs. This metadata distributionstrategy provides good hierarchical locality, linear growth ofcache and horizontal scaling across MDSs and a fairly good distributionof metadata across MDSs.../../_images/subtree-partitioning.svgThe problem with traditional subtree partitioning is that the workloadgrowth by depth (across a single MDS) leads to a hotspot of activity.This results in lack of vertical scaling and wastage of non-busy resources/MDSs.

This led to the adoption of a more dynamic way of handlingmetadata: Dynamic Subtree Partitioning, where load intensive portionsof the directory hierarchy from busy MDSs are migrated to non busy MDSs.

This strategy ensures that activity hotspots are relieved as theyappear and so leads to vertical scaling of the metadata workload inaddition to horizontal scaling.

Export Process During Subtree Migration

Once the exporter verifies that the subtree is permissible to be exported(Non degraded cluster, non-frozen subtree root), the subtree rootdirectory is temporarily auth pinned, the subtree freeze is initiated,and the exporter is committed to the subtree migration, barring anintervening failure of the importer or itself.

The MExportDiscover message is exchanged to ensure that the inode for thebase directory being exported is open on the destination node. It isauth pinned by the importer to prevent it from being trimmed. This occursbefore the exporter completes the freeze of the subtree to ensure thatthe importer is able to replicate the necessary metadata. When theexporter receives the MDiscoverAck, it allows the freeze to proceed byremoving its temporary auth pin.

A warning stage occurs only if the base subtree directory is open bynodes other than the importer and exporter. If it is not, then thisimplies that no metadata within or nested beneath the subtree isreplicated by any node other than the importer and exporter. If it is,then an MExportWarning message informs any bystanders that theauthority for the region is temporarily ambiguous, and lists both theexporter and importer as authoritative MDS nodes. In particular,bystanders who are trimming items from their cache must sendMCacheExpire messages to both the old and new authorities. This isnecessary to ensure that the surviving authority reliably receives allexpirations even if the importer or exporter fails. While the subtreeis frozen (on both the importer and exporter), expirations will not beimmediately processed; instead, they will be queued until the regionis unfrozen and it can be determined that the node is or is notauthoritative.

The exporter then packages an MExport message containing all metadataof the subtree and flags the objects as non-authoritative. The MExport message sendsthe actual subtree metadata to the importer. Upon receipt, theimporter inserts the data into its cache, marks all objects asauthoritative, and logs a copy of all metadata in an EImportStartjournal message. Once that has safely flushed, it replies with anMExportAck. The exporter can now log an EExport journal entry, whichultimately specifies that the export was a success. In the presenceof failures, it is the existence of the EExport entry only thatdisambiguates authority during recovery.

Once logged, the exporter will send an MExportNotify to anybystanders, informing them that the authority is no longer ambiguousand cache expirations should be sent only to the new authority (theimporter). Once these are acknowledged back to the exporter,implicitly flushing the bystander to exporter message streams of anystray expiration notices, the exporter unfreezes the subtree, cleansup its migration-related state, and sends a final MExportFinish to theimporter. Upon receipt, the importer logs an EImportFinish(true)(noting locally that the export was indeed a success), unfreezes itssubtree, processes any queued cache expierations, and cleans up itsstate.