SeaStore

This is a rough design doc for a new ObjectStore implementation designto facilitate higher performance on solid state devices.

Name

SeaStore maximizes the opportunity for confusion (seastar? seashore?)and associated fun. Alternative suggestions welcome.

Goals

  • Target NVMe devices. Not primarily concerned with pmem or HDD.

  • make use of SPDK for user-space driven IO

  • Use Seastar futures programming model to facilitate run-to-completion and a sharded memory/processing model

  • Allow zero- (or minimal) data copying on read and write paths when combined with a seastar-based messenger using DPDK

Motivation and background

All flash devices are internally structured in terms of segments thatcan be written efficiently but must be erased in their entirety. TheNVMe device generally has limited knowledge about what data in asegment is still “live” (hasn’t been logically discarded), making theinevitable garbage collection within the device inefficient. We candesign an on-disk layout that is friendly to GC at lower layers anddrive garbage collection at higher layers.

In principle a fine-grained discard could communicate our intent tothe device, but in practice discard is poorly implemented in thedevice and intervening software layers.

Basics

The basic idea is that all data will be stream out sequentially tolarge segments on the device. In the SSD hardware, segments arelikely to be on the order of 100’s of MB to tens of GB.

SeaStore’s logical segments would ideally be perfectly aligned withthe hardware segments. In practice, it may be challenging todetermine geometry and to sufficiently hint to the device that LBAsbeing written should be aligned to the underlying hardware. In theworst case, we can structure our logical segments to correspond toe.g. 5x the physical segment size so that we have about ~20% of ourdata misaligned.

When we reach some utilization threshold, we mix cleaning work in withthe ongoing write workload in order to evacuate live data frompreviously written segments. Once they are completely free we candiscard the entire segment so that it can be erased and reclaimed bythe device.

The key is to mix a small bit of cleaning work with every writetransaction to avoid spikes and variance in write latency.

Data layout basics

One or more cores/shards will be reading and writing to the device atonce. Each shard will have its own independent data it is operatingon and stream to its own open segments. Devices that support streamscan be hinted accordingly so that data from different shards is notmixed on the underlying media.

Global state

There will be a simple global table of segments and their usage/emptystatus. Each shard will occasionally claim new empty segments forwriting as needed, or return cleaned segments to the global free list.

At a high level, all metadata will be structured as a b-tree. Theroot for the metadata btree will also be stored centrally (along withthe segment allocation table).

This is hand-wavey, but it is probably sufficient to update the rootpointer for the btree either as each segment is sealed or as a newsegment is opened.

Writing segments

Each segment will be written sequentially as a sequence oftransactions. Each transaction will be on-disk expression of anObjectStore::Transaction. It will consist of

  • new data blocks

  • some metadata describing changes to b-tree metadata blocks. Thiswill be written compact as a delta: which keys are removed and whichkeys/values are inserted into the b-tree block.

As each b-tree block is modified, we update the block in memory andput it on a ‘dirty’ list. However, again, only the (compact) delta is journaledto the segment.

As we approach the end of the segment, the goal is to undirty all ofour dirty blocks in memory. Based on the number of dirty blocks andthe remaining space, we include a proportional number of dirty blocksin each transaction write so that we undirty some of the b-treeblocks. Eventually, the last transaction written to the segment willinclude all of the remaining dirty b-tree blocks.

Segment inventory

At the end of each segment, an inventory will be written that includesany metadata needed to test whether blocks in the segment are stilllive. For data blocks, that means an object id (e.g., ino number) andoffset to test whether the block is still reference. For metadatablocks, it would be at least one metadata key that lands in any b-treeblock that is modified (via a delta) in the segment–enough for us todo a forward lookup in the b-tree to check whether the b-tree block isstill referenced. Once this is written, the segment is sealed and read-only.

Crash recovery

On any crash, we simply “replay” the currently open segment in memory.For any b-tree delta encountered, we load the original block, modifyin memory, and mark it dirty. Once we continue writing, the normal “writedirty blocks as we near the end of the segment” behavior will pick up wherewe left off.

ObjectStore considerations

Splits, merges, and sharding

One of the current ObjectStore requirements is to be able to split acollection (PG) in O(1) time. Starting in mimic, we also need to beable to merge two collections into one (i.e., exactly the reverse of asplit).

However, the PGs that we split into would hash to different shards ofthe OSD in the current sharding scheme. One can imagine replacingthat sharding scheme with a temporary mapping directing the smallerchild PG to the right shard since we generally then migrate that PG toanother OSD anyway, but this wouldn’t help us in the merge case wherethe constituent pieces may start out on different shards andultimately need to be handled in the same collection (and be operatedon via single transactions).

This suggests that we likely need a way for data written via one shardto “switch ownership” and later be read and managed by a differentshard.