Differences from POSIX

CephFS aims to adhere to POSIX semantics wherever possible. Forexample, in contrast to many other common network file systems likeNFS, CephFS maintains strong cache coherency across clients. The goalis for processes communicating via the file system to behave the samewhen they are on different hosts as when they are on the same host.

However, there are a few places where CephFS diverges from strictPOSIX semantics for various reasons:

  • If a client is writing to a file and fails, its writes are notnecessarily atomic. That is, the client may call write(2) on a fileopened with O_SYNC with an 8 MB buffer and then crash and the writemay be only partially applied. (Almost all file systems, even localfile systems, have this behavior.)

  • In shared simultaneous writer situations, a write that crossesobject boundaries is not necessarily atomic. This means that youcould have writer A write “aa|aa” and writer B write “bb|bb”simultaneously (where | is the object boundary), and end up with“aa|bb” rather than the proper “aa|aa” or “bb|bb”.

  • Sparse files propagate incorrectly to the stat(2) st_blocks field.Because CephFS does not explicitly track which parts of a file areallocated/written, the st_blocks field is always populated by thefile size divided by the block size. This will cause tools likedu(1) to overestimate consumed space. (The recursive size field,maintained by CephFS, also includes file “holes” in its count.)

  • When a file is mapped into memory via mmap(2) on multiple hosts,writes are not coherently propagated to other clients’ caches. Thatis, if a page is cached on host A, and then updated on host B, hostA’s page is not coherently invalidated. (Shared writable mmapappears to be quite rare–we have yet to here any complaints about thisbehavior, and implementing cache coherency properly is complex.)

  • CephFS clients present a hidden .snap directory that is used toaccess, create, delete, and rename snapshots. Although the virtualdirectory is excluded from readdir(2), any process that tries tocreate a file or directory with the same name will get an errorcode. The name of this hidden directory can be changed at mounttime with -o snapdirname=.somethingelse (Linux) or the configoption client_snapdir (libcephfs, ceph-fuse).

Perspective

People talk a lot about “POSIX compliance,” but in reality most filesystem implementations do not strictly adhere to the spec, includinglocal Linux file systems like ext4 and XFS. For example, forperformance reasons, the atomicity requirements for reads are relaxed:processing reading from a file that is also being written may see tornresults.

Similarly, NFS has extremely weak consistency semantics when multipleclients are interacting with the same files or directories, optinginstead for “close-to-open”. In the world of network attachedstorage, where most environments use NFS, whether or not the server’sfile system is “fully POSIX” may not be relevant, and whether clientapplications notice depends on whether data is being shared betweenclients or not. NFS may also “tear” the results of concurrent writersas client data may not even be flushed to the server until the file isclosed (and more generally writes will be significantly moretime-shifted than CephFS, leading to less predictable results).

However, all of there are very close to POSIX, and most of the timeapplications don’t notice too much. Many other storage systems (e.g.,HDFS) claim to be “POSIX-like” but diverge significantly from thestandard by dropping support for things like in-place filemodifications, truncate, or directory renames.

Bottom line

CephFS relaxes more than local Linux kernel file systems (e.g., writesspanning object boundaries may be torn). It relaxes strictly lessthan NFS when it comes to multiclient consistency, and generally lessthan NFS when it comes to write atomicity.

In other words, when it comes to POSIX,

  1. HDFS < NFS < CephFS < {XFS, ext4}

fsync() and error reporting

POSIX is somewhat vague about the state of an inode after fsync reportsan error. In general, CephFS uses the standard error-reportingmechanisms in the client’s kernel, and therefore follows the sameconventions as other file systems.

In modern Linux kernels (v4.17 or later), writeback errors are reportedonce to every file description that is open at the time of the error. Inaddition, unreported errors that occurred before the file description wasopened will also be returned on fsync.

See PostgreSQL’s summary of fsync() error reporting across operating systems and Matthew Wilcox’spresentation on Linux IO error handling for more information.