BlueStore Internals

Small write strategies

  • U: Uncompressed write of a complete, new blob.

    • write to new blob

    • kv commit

  • P: Uncompressed partial write to unused region of an existingblob.

    • write to unused chunk(s) of existing blob

    • kv commit

  • W: WAL overwrite: commit intent to overwrite, then overwriteasync. Must be chunk_size = MAX(block_size, csum_block_size)aligned.

    • kv commit

    • wal overwrite (chunk-aligned) of existing blob

  • N: Uncompressed partial write to a new blob. Initially sparselyutilized. Future writes will either be P or W.

    • write into a new (sparse) blob

    • kv commit

  • R+W: Read partial chunk, then to WAL overwrite.

    • read (out to chunk boundaries)

    • kv commit

    • wal overwrite (chunk-aligned) of existing blob

  • C: Compress data, write to new blob.

    • compress and write to new blob

    • kv commit

Possible future modes

  • F: Fragment lextent space by writing small piece of data into apiecemeal blob (that collects random, noncontiguous bits of data weneed to write).

    • write to a piecemeal blob (min_alloc_size or larger, but we use just one block of it)

    • kv commit

  • X: WAL read/modify/write on a single block (like legacybluestore). No checksum.

    • kv commit

    • wal read/modify/write

Mapping

This very roughly maps the type of write onto what we do when weencounter a given blob. In practice it’s a bit more complicated since theremight be several blobs to consider (e.g., we might be able to W into one orP into another), but it should communicate a rough idea of strategy.

rawraw (cached)csum (4 KB)csum (16 KB)comp (128 KB)
128+ KB (over)writeUUUUC
64 KB (over)writeUUUUU or C
4 KB overwriteWP | WP | WP | R+WP | N (F?)
100 byte overwriteR+WP | WP | R+WP | R+WP | N (F?)
100 byte appendR+WP | WP | R+WP | R+WP | N (F?)
4 KB clone overwriteP | NP | NP | NP | NN (F?)
100 byte clone overwriteP | NP | NP | NP | NN (F?)