Application best practices for distributed file systems

CephFS is POSIX compatible, and therefore should work with any existingapplications that expect a POSIX file system. However, because it is anetwork file system (unlike e.g. XFS) and it is highly consistent (unlikee.g. NFS), there are some consequences that application authors maybenefit from knowing about.

The following sections describe some areas where distributed file systemsmay have noticeably different performance behaviours compared withlocal file systems.

ls -l

When you run “ls -l”, the ls programis first doing a directory listing, and then calling stat on everyfile in the directory.

This is usually far in excess of what an application really needs, andit can be slow for large directories. If you don’t really need allthis metadata for each file, then use a plain ls.

ls/stat on files being extended

If another client is currently extending files in the listed directory,then an ls -l may take an exceptionally long time to complete, asthe lister must wait for the writer to flush data in order to do a validread of the every file’s size. So unless you really need to know theexact size of every file in the directory, just don’t do it!

This would also apply to any application code that was directlyissuing stat system calls on files being appended fromanother node.

Very large directories

Do you really need that 10,000,000 file directory? While directoryfragmentation enables CephFS to handle it, it is always going to beless efficient than splitting your files into more modest-sized directories.

Even standard userspace tools can become quite slow when operating on verylarge directories. For example, the default behaviour of lsis to give an alphabetically ordered result, but readdir systemcalls do not give an ordered result (this is true in general, not justwith CephFS). So when you ls on a million file directory, it isloading a list of a million names into memory, sorting the list, then writingit out to the display.

Hard links have an intrinsic cost in terms of the internal housekeepingthat a file system has to do to keep two references to the same data. InCephFS there is a particular performance cost, because with normal filesthe inode is embedded in the directory (i.e. there is no extra fetch ofthe inode after looking up the path).

Working set size

The MDS acts as a cache for the metadata stored in RADOS. Metadataperformance is very different for workloads whose metadata fits withinthat cache.

If your workload has more files than fit in your cache (configured usingmds_cache_memory_limit settings), then make sure you test itappropriately: don’t test your system with a small number of files and thenexpect equivalent performance when you move to a much larger number of files.

Do you need a file system?

Remember that Ceph also includes an object storage interface. If yourapplication needs to store huge flat collections of files where you justread and write whole files at once, then you might well be better offusing the Object Gateway