Peering

Concepts

  • Peering
  • the process of bringing all of the OSDs that storea Placement Group (PG) into agreement about the stateof all of the objects (and their metadata) in that PG.Note that agreeing on the state does not mean thatthey all have the latest contents.

  • Acting set

  • the ordered list of OSDs who are (or were as of some epoch)responsible for a particular PG.

  • Up set

  • the ordered list of OSDs responsible for a particular PG fora particular epoch according to CRUSH. Normally thisis the same as the acting set, except when the acting set has beenexplicitly overridden via PG temp in the OSDMap.

  • PG temp

  • a temporary placement group acting set used while backfilling theprimary osd. Let say acting is [0,1,2] and we areactive+clean. Something happens and acting is now [3,1,2]. osd 3 isempty and can’t serve reads although it is the primary. osd.3 willsee that and request a PG temp of [1,2,3] to the monitors using aMOSDPGTemp message so that osd.1 temporarily becomes theprimary. It will select osd.3 as a backfill peer and continue toserve reads and writes while osd.3 is backfilled. When backfillingis complete, PG temp is discarded and the acting set changes backto [3,1,2] and osd.3 becomes the primary.

  • current interval or past interval

  • a sequence of OSD map epochs during which the acting set and upset for particular PG do not change

  • primary

  • the (by convention first) member of the acting set,who is responsible for coordination peering, and isthe only OSD that will accept client initiatedwrites to objects in a placement group.

  • replica

  • a non-primary OSD in the acting set for a placement group(and who has been recognized as such and activated by the primary).

  • stray

  • an OSD who is not a member of the current acting set, buthas not yet been told that it can delete its copies of aparticular placement group.

  • recovery

  • ensuring that copies of all of the objects in a PGare on all of the OSDs in the acting set. Oncepeering has been performed, the primary can startaccepting write operations, and recovery can proceedin the background.

  • PG info basic metadata about the PG’s creation epoch, the version

  • for the most recent write to the PG, last epoch started, lastepoch clean, and the beginning of the current interval. Anyinter-OSD communication about PGs includes the PG info, such thatany OSD that knows a PG exists (or once existed) also has a lowerbound on last epoch clean or last epoch started.

  • PG log

  • a list of recent updates made to objects in a PG.Note that these logs can be truncated after all OSDsin the acting set have acknowledged up to a certainpoint.

  • missing set

  • Each OSD notes update log entries and if they imply updates tothe contents of an object, adds that object to a list of neededupdates. This list is called the missing set for that .

  • Authoritative History

  • a complete, and fully ordered set of operations that, ifperformed, would bring an OSD’s copy of a Placement Groupup to date.

  • epoch

  • a (monotonically increasing) OSD map version number

  • last epoch start

  • the last epoch at which all nodes in the acting set_for a particular placement group agreed on an_authoritative history. At this point, peering isdeemed to have been successful.

  • up_thru

  • before a primary can successfully complete the peering process,it must inform a monitor that is alive through the currentOSD map epoch by having the monitor set its up_thru in the osdmap. This helps peering ignore previous acting sets for whichpeering never completed after certain sequences of failures, such asthe second interval below:

    • acting set = [A,B]

    • acting set = [A]

    • acting set = [] very shortly after (e.g., simultaneous failure, but staggered detection)

    • acting set = [B] (B restarts, A does not)

  • last epoch clean

  • the last epoch at which all nodes in the acting set_for a particular placement group were completelyup to date (both PG logs and object contents).At this point, _recovery is deemed to have beencompleted.

Description of the Peering Process

The Golden Rule is that no write operation to any PGis acknowledged to a client until it has been persistedby all members of the acting set for that PG. This meansthat if we can communicate with at least one member ofeach acting set since the last successful peering, someonewill have a record of every (acknowledged) operationsince the last successful peering.This means that it should be possible for the currentprimary to construct and disseminate a new authoritative history.

It is also important to appreciate the role of the OSD map(list of all known OSDs and their states, as well as someinformation about the placement groups) in the _peering_process:

When OSDs go up or down (or get added or removed)this has the potential to affect the active setsof many placement groups.

Before a primary successfully completes the peeringprocess, the OSD map must reflect that the OSD was aliveand well as of the first epoch in the current interval.

Changes can only be made after successful peering.

Thus, a new primary can use the latest OSD map along with a recenthistory of past maps to generate a set of past intervals todetermine which OSDs must be consulted before we can successfullypeer. The set of past intervals is bounded by last epoch started,the most recent past interval for which we know peering completed.The process by which an OSD discovers a PG exists in the first place isby exchanging PG info messages, so the OSD always has some lowerbound on last epoch started.

The high level process is for the current PG primary to:

  1. get a recent OSD map (to identify the members of the allinteresting acting sets, and confirm that we are still theprimary).

  2. generate a list of past intervals since last epoch started.Consider the subset of those for which up_thru was greater thanthe first interval epoch by the last interval epoch’s OSD map; that is,the subset for which peering could have completed before the actingset changed to another set of OSDs.

    Successful peering will require that we be able to contact atleast one OSD from each of past interval’s acting set.

  3. ask every node in that list for its PG info, which includes the mostrecent write made to the PG, and a value for last epoch started. Ifwe learn about a last epoch started that is newer than our own, we canprune older past intervals and reduce the peer OSDs we need to contact.

  4. if anyone else has (in its PG log) operations that I do not have,instruct them to send me the missing log entries so that the primary’sPG log is up to date (includes the newest write)..

  5. for each member of the current acting set:

    1. ask it for copies of all PG log entries since last epoch startso that I can verify that they agree with mine (or know whatobjects I will be telling it to delete).

      If the cluster failed before an operation was persisted by allmembers of the acting set, and the subsequent peering did notremember that operation, and a node that did remember thatoperation later rejoined, its logs would record a different(divergent) history than the authoritative history that wasreconstructed in the peering after the failure.

      Since the divergent events were not recorded in other logsfrom that acting set, they were not acknowledged to the client,and there is no harm in discarding them (so that all OSDs agreeon the authoritative history). But, we will have to instructany OSD that stores data from a divergent update to delete theaffected (and now deemed to be apocryphal) objects.

    2. ask it for its missing set (object updates recordedin its PG log, but for which it does not have the new data).This is the list of objects that must be fully replicatedbefore we can accept writes.

  6. at this point, the primary’s PG log contains an authoritative history ofthe placement group, and the OSD now has sufficientinformation to bring any other OSD in the acting set up to date.

  7. if the primary’s up_thru value in the current OSD map is not greater thanor equal to the first epoch in the current interval, send a request to themonitor to update it, and wait until receive an updated OSD map that reflectsthe change.

  8. for each member of the current acting set:

    1. send them log updates to bring their PG logs into agreement withmy own (authoritative history) … which may involve decidingto delete divergent objects.

    2. await acknowledgment that they have persisted the PG log entries.

  9. at this point all OSDs in the acting set agree on all of the meta-data,and would (in any future peering) return identical accounts of allupdates.

    1. start accepting client write operations (because we have unanimousagreement on the state of the objects into which those updates arebeing accepted). Note, however, that if a client tries to write to anobject it will be promoted to the front of the recovery queue, and thewrite willy be applied after it is fully replicated to the current acting set.

    2. update the last epoch started value in our local PG info, and instructother active set OSDs to do the same.

    3. start pulling object data updates that other OSDs have, but I do not. We mayneed to query OSDs from additional past intervals prior to last epoch started(the last time peering completed) and following last epoch clean (the last epoch thatrecovery completed) in order to find copies of all objects.

    4. start pushing object data updates to other OSDs that do not yet have them.

      We push these updates from the primary (rather than having the replicaspull them) because this allows the primary to ensure that a replica hasthe current contents before sending it an update write. It also makesit possible for a single read (from the primary) to be used to writethe data to multiple replicas. If each replica did its own pulls,the data might have to be read multiple times.

  10. once all replicas store the all copies of all objects (thatexisted prior to the start of this epoch) we can update lastepoch clean in the PG info, and we can dismiss all of thestray replicas, allowing them to delete their copies of objectsfor which they are no longer in the acting set.

    We could not dismiss the strays prior to this because it was possiblethat one of those strays might hold the sole surviving copy of anold object (all of whose copies disappeared before they could bereplicated on members of the current acting set).

State Model

digraph G {size="7,7"compound=true;subgraph cluster0 {label = "PeeringMachine";color = "blue";Crashed;Initial[shape=Mdiamond];Reset;subgraph cluster1 {label = "Started";color = "blue";Start[shape=Mdiamond];subgraph cluster2 {label = "Primary";color = "blue";WaitActingChange;subgraph cluster3 {label = "Peering";color = "blue";GetInfo[shape=Mdiamond];GetLog;GetMissing;WaitUpThru;Down;Incomplete;}subgraph cluster4 {label = "Active";color = "blue";Clean;Recovered;Backfilling;WaitRemoteBackfillReserved;WaitLocalBackfillReserved;NotBackfilling;NotRecovering;Recovering;WaitRemoteRecoveryReserved;WaitLocalRecoveryReserved;Activating[shape=Mdiamond];}}subgraph cluster5 {label = "ReplicaActive";color = "blue";RepRecovering;RepWaitBackfillReserved;RepWaitRecoveryReserved;RepNotRecovering[shape=Mdiamond];}Stray;subgraph cluster6 {label = "ToDelete";color = "blue";WaitDeleteReserved[shape=Mdiamond];Deleting;}}}GetInfo -> WaitActingChange [label="NeedActingChange",ltail=cluster2,];RepRecovering -> RepNotRecovering [label="RemoteReservationCanceled",];RepNotRecovering -> RepNotRecovering [label="RemoteReservationCanceled",];RepWaitRecoveryReserved -> RepNotRecovering [label="RemoteReservationCanceled",];RepWaitBackfillReserved -> RepNotRecovering [label="RemoteReservationCanceled",];Clean -> WaitLocalRecoveryReserved [label="DoRecovery",];Recovered -> WaitLocalRecoveryReserved [label="DoRecovery",];NotRecovering -> WaitLocalRecoveryReserved [label="DoRecovery",];Activating -> WaitLocalRecoveryReserved [label="DoRecovery",];Recovered -> Clean [label="GoClean",];Start -> GetInfo [label="MakePrimary",lhead=cluster2,];Initial -> Crashed [label="boost::statechart::event_base",];Reset -> Crashed [label="boost::statechart::event_base",];Start -> Crashed [label="boost::statechart::event_base",ltail=cluster1,];GetLog -> GetMissing [label="GotLog",];Initial -> GetInfo [label="MNotifyRec",lhead=cluster2,];Down -> GetInfo [label="MNotifyRec",];Incomplete -> GetLog [label="MNotifyRec",];Initial -> Stray [label="MLogRec",];Stray -> RepNotRecovering [label="MLogRec",lhead=cluster5,];Recovering -> NotRecovering [label="DeferRecovery",];Activating -> Recovered [label="AllReplicasRecovered",];Recovering -> Recovered [label="AllReplicasRecovered",];Recovering -> NotRecovering [label="UnfoundRecovery",];RepNotRecovering -> RepWaitRecoveryReserved [label="RequestRecoveryPrio",];WaitRemoteRecoveryReserved -> Recovering [label="AllRemotesReserved",];Initial -> Reset [label="Initialize",];Backfilling -> NotBackfilling [label="RemoteReservationRevokedTooFull",];Backfilling -> NotBackfilling [label="UnfoundBackfill",];Deleting -> WaitDeleteReserved [label="DeleteInterrupted",];NotBackfilling -> WaitLocalBackfillReserved [label="RequestBackfill",];Activating -> WaitLocalBackfillReserved [label="RequestBackfill",];Recovering -> WaitLocalBackfillReserved [label="RequestBackfill",];Reset -> Start [label="ActMap",lhead=cluster1,];WaitDeleteReserved -> WaitDeleteReserved [label="ActMap",ltail=cluster6,lhead=cluster6,];GetMissing -> WaitUpThru [label="NeedUpThru",];RepWaitRecoveryReserved -> RepRecovering [label="RemoteRecoveryReserved",];WaitLocalRecoveryReserved -> WaitRemoteRecoveryReserved [label="LocalRecoveryReserved",];RepNotRecovering -> RepWaitBackfillReserved [label="RequestBackfillPrio",];WaitRemoteBackfillReserved -> Backfilling [label="AllBackfillsReserved",];Backfilling -> Recovered [label="Backfilled",];Backfilling -> NotBackfilling [label="DeferBackfill",];RepNotRecovering -> WaitDeleteReserved [label="DeleteStart",ltail=cluster5,lhead=cluster6,];Stray -> WaitDeleteReserved [label="DeleteStart",lhead=cluster6,];Initial -> Stray [label="MInfoRec",];Stray -> RepNotRecovering [label="MInfoRec",lhead=cluster5,];GetInfo -> Down [label="IsDown",];RepRecovering -> RepNotRecovering [label="RecoveryDone",];RepNotRecovering -> RepNotRecovering [label="RecoveryDone",];RepRecovering -> RepNotRecovering [label="RemoteReservationRejectedTooFull",];RepNotRecovering -> RepNotRecovering [label="RemoteReservationRejectedTooFull",];WaitRemoteBackfillReserved -> NotBackfilling [label="RemoteReservationRejectedTooFull",];RepWaitBackfillReserved -> RepNotRecovering [label="RemoteReservationRejectedTooFull",];GetLog -> Incomplete [label="IsIncomplete",];WaitLocalBackfillReserved -> WaitRemoteBackfillReserved [label="LocalBackfillReserved",];GetInfo -> Activating [label="Activate",ltail=cluster3,lhead=cluster4,];WaitLocalRecoveryReserved -> NotRecovering [label="RecoveryTooFull",];GetInfo -> GetLog [label="GotInfo",];Start -> Reset [label="AdvMap",ltail=cluster1,];GetInfo -> Reset [label="AdvMap",ltail=cluster3,];GetLog -> Reset [label="AdvMap",];WaitActingChange -> Reset [label="AdvMap",];Incomplete -> Reset [label="AdvMap",];RepWaitBackfillReserved -> RepRecovering [label="RemoteBackfillReserved",];Start -> Stray [label="MakeStray",];WaitDeleteReserved -> Deleting [label="DeleteReserved",];Backfilling -> WaitLocalBackfillReserved [label="RemoteReservationRevoked",];WaitRemoteBackfillReserved -> NotBackfilling [label="RemoteReservationRevoked",];}