Advanced: Metadata repair tools

Warning

If you do not have expert knowledge of CephFS internals, you willneed to seek assistance before using any of these tools.

The tools mentioned here can easily cause damage as well as fixing it.

It is essential to understand exactly what has gone wrong with yourfile system before attempting to repair it.

If you do not have access to professional support for your cluster,consult the ceph-users mailing list or the #ceph IRC channel.

Journal export

Before attempting dangerous operations, make a copy of the journal like so:

  1. cephfs-journal-tool journal export backup.bin

Note that this command may not always work if the journal is badly corrupted,in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).

Dentry recovery from journal

If a journal is damaged or for any reason an MDS is incapable of replaying it,attempt to recover what file metadata we can like so:

  1. cephfs-journal-tool event recover_dentries summary

This command by default acts on MDS rank 0, pass –rank=<n> to operate on other ranks.

This command will write any inodes/dentries recoverable from the journalinto the backing store, if these inodes/dentries are higher-versionedthan the previous contents of the backing store. If any regions of the journalare missing/damaged, they will be skipped.

Note that in addition to writing out dentries and inodes, this command will updatethe InoTables of each ‘in’ MDS rank, to indicate that any written inodes’ numbersare now in use. In simple cases, this will result in an entirely valid backingstore state.

Warning

The resulting state of the backing store is not guaranteed to be self-consistent,and an online MDS scrub will be required afterwards. The journal contentswill not be modified by this command, you should truncate the journalseparately after recovering what you can.

Journal truncation

If the journal is corrupt or MDSs cannot replay it for any reason, you cantruncate it like so:

  1. cephfs-journal-tool [--rank=N] journal reset

Specify the MDS rank using the —rank option when the file system has/hadmultiple active MDS.

Warning

Resetting the journal will lose metadata unless you have extractedit by other means such as recover_dentries. It is likely to leavesome orphaned objects in the data pool. It may result in re-allocationof already-written inodes, such that permissions rules could be violated.

MDS table wipes

After the journal has been reset, it may no longer be consistent with respectto the contents of the MDS tables (InoTable, SessionMap, SnapServer).

To reset the SessionMap (erase all sessions), use:

  1. cephfs-table-tool all reset session

This command acts on the tables of all ‘in’ MDS ranks. Replace ‘all’ with an MDSrank to operate on that rank only.

The session table is the table most likely to need resetting, but if you know youalso need to reset the other tables then replace ‘session’ with ‘snap’ or ‘inode’.

MDS map reset

Once the in-RADOS state of the file system (i.e. contents of the metadata pool)is somewhat recovered, it may be necessary to update the MDS map to reflectthe contents of the metadata pool. Use the following command to reset the MDSmap to a single MDS:

  1. ceph fs reset <fs name> --yes-i-really-mean-it

Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored:as a result it is possible for this to result in data loss.

One might wonder what the difference is between ‘fs reset’ and ‘fs remove; fs new’. Thekey distinction is that doing a remove/new will leave rank 0 in ‘creating’ state, suchthat it would overwrite any existing root inode on disk and orphan any existing files. Incontrast, the ‘reset’ command will leave rank 0 in ‘active’ state such that the next MDSdaemon to claim the rank will go ahead and use the existing in-RADOS metadata.

Recovery from missing metadata objects

Depending on what objects are missing or corrupt, you may need torun various commands to regenerate default versions of theobjects.

  1. # Session table
  2. cephfs-table-tool 0 reset session
  3. # SnapServer
  4. cephfs-table-tool 0 reset snap
  5. # InoTable
  6. cephfs-table-tool 0 reset inode
  7. # Journal
  8. cephfs-journal-tool --rank=0 journal reset
  9. # Root inodes ("/" and MDS directory)
  10. cephfs-data-scan init

Finally, you can regenerate metadata objects for missing filesand directories based on the contents of a data pool. This isa three-phase process. First, scanning all objects to calculatesize and mtime metadata for inodes. Second, scanning the firstobject from every file to collect this metadata and inject it intothe metadata pool. Third, checking inode linkages and fixing founderrors.

  1. cephfs-data-scan scan_extents <data pool>
  2. cephfs-data-scan scan_inodes <data pool>
  3. cephfs-data-scan scan_links

‘scanextents’ and ‘scan_inodes’ commands may take a _very long timeif there are many files or very large files in the data pool.

To accelerate the process, run multiple instances of the tool.

Decide on a number of workers, and pass each worker a number withinthe range 0-(worker_m - 1).

The example below shows how to run 4 workers simultaneously:

  1. # Worker 0
  2. cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
  3. # Worker 1
  4. cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
  5. # Worker 2
  6. cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
  7. # Worker 3
  8. cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
  9.  
  10. # Worker 0
  11. cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
  12. # Worker 1
  13. cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
  14. # Worker 2
  15. cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
  16. # Worker 3
  17. cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>

It is important to ensure that all workers have completed thescan_extents phase before any workers enter the scan_inodes phase.

After completing the metadata recovery, you may want to run cleanupoperation to delete ancillary data geneated during recovery.

  1. cephfs-data-scan cleanup <data pool>

Using an alternate metadata pool for recovery

Warning

There has not been extensive testing of this procedure. It should beundertaken with great care.

If an existing file system is damaged and inoperative, it is possible to createa fresh metadata pool and attempt to reconstruct the file system metadatainto this new pool, leaving the old metadata in place. This could be used tomake a safer attempt at recovery since the existing metadata pool would not beoverwritten.

Caution

During this process, multiple metadata pools will contain data referring tothe same data pool. Extreme caution must be exercised to avoid changing thedata pool contents while this is the case. Once recovery is complete, thedamaged metadata pool should be deleted.

To begin this process, first create the fresh metadata pool and initializeit with empty file system data structures:

  1. ceph fs flag set enable_multiple true --yes-i-really-mean-it
  2. ceph osd pool create recovery replicated <crush-rule-name>
  3. ceph fs new recovery-fs recovery <data pool> --allow-dangerous-metadata-overlay
  4. cephfs-data-scan init --force-init --filesystem recovery-fs --alternate-pool recovery
  5. ceph fs reset recovery-fs --yes-i-really-mean-it
  6. cephfs-table-tool recovery-fs:all reset session
  7. cephfs-table-tool recovery-fs:all reset snap
  8. cephfs-table-tool recovery-fs:all reset inode

Next, run the recovery toolset using the –alternate-pool argument to outputresults to the alternate pool:

  1. cephfs-data-scan scan_extents --alternate-pool recovery --filesystem <original file system name> <original data pool name>
  2. cephfs-data-scan scan_inodes --alternate-pool recovery --filesystem <original file system name> --force-corrupt --force-init <original data pool name>
  3. cephfs-data-scan scan_links --filesystem recovery-fs

If the damaged file system contains dirty journal data, it may be recovered nextwith:

  1. cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery
  2. cephfs-journal-tool --rank recovery-fs:0 journal reset --force

After recovery, some recovered directories will have incorrect statistics.Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are setto false (the default) to prevent the MDS from checking the statistics, thenrun a forward scrub to repair them. Ensure you have anMDS running and issue:

  1. ceph tell mds.a scrub start / recursive repair

Note

In Nautilus and above versions, tell interface scrub command is preferredthan scrub_path. For older versions only scrub_path asok command issupported. Example:

  1. ceph daemon mds.a scrub_path / recursive repair