RADOS client protocol

This is very incomplete, but one must start somewhere.

Basics

Requests are MOSDOp messages. Replies are MOSDOpReply messages.

An object request is targeted at an hobject_t, which includes a pool,hash value, object name, placement key (usually empty), and snapid.

The hash value is a 32-bit hash value, normally generated by hashingthe object name. The hobject_t can be arbitrarily constructed,though, with any hash value and name. Note that in the MOSDOp thesecomponents are spread across several fields and not logicallyassembled in an actual hobject_t member (mainly historical reasons).

A request can also target a PG. In this case, the ps value matchesa specific PG, the object name is empty, and (hopefully) the ops inthe request are PG ops.

Either way, the request ultimately targets a PG, either by using theexplicit pgid or by folding the hash value onto the current number ofpgs in the pool. The client sends the request to the primary for theassociated PG.

Each request is assigned a unique tid.

Resends

If there is a connection drop, the client will resend any outstandingrequests.

Any time there is a PG mapping change such that the primary changes,the client is responsible for resending the request. Note thatalthough there may be an interval change from the OSD’s perspective(triggering PG peering), if the primary doesn’t change then the clientneed not resend.

There are a few exceptions to this rule:

  • There is a last_force_op_resend field in the pg_pool_t in theOSDMap. If this changes, then the clients are forced to resend anyoutstanding requests. (This happens when tiering is adjusted, forexample.)

  • Some requests are such that they are resent on any PG intervalchange, as defined by pg_interval_t’s is_new_interval() (the samecriteria used by peering in the OSD).

  • If the PAUSE OSDMap flag is set and unset.

Each time a request is sent to the OSD the attempt field is incremented. Thefirst time it is 0, the next 1, etc.

Backoff

Ordinarily the OSD will simply queue any requests it can’t immediatelyprocess in memory until such time as it can. This can becomeproblematic because the OSD limits the total amount of RAM consumed byincoming messages: if either of the thresholds for the number ofmessages or the number of bytes is reached, new messages will not beread off the network socket, causing backpressure through the network.

In some cases, though, the OSD knows or expects that a PG or objectwill be unavailable for some time and does not want to consume memoryby queuing requests. In these cases it can send a MOSDBackoff messageto the client.

A backoff request has four properties:

  • the op code (block, unblock, or ack-block)

  • id, a unique id assigned within this session

  • hobject_t begin

  • hobject_t end

There are two types of backoff: a PG backoff will plug all requeststargeting an entire PG at the client, as described by a range of thehash/hobjectt space [begin,end), while an _object backoff will plugall requests targeting a single object (begin == end).

When the client receives a block backoff message, it is nowresponsible for not sending any requests for hobjectts described bythe backoff. The backoff remains in effect until the backoff iscleared (via an ‘unblock’ message) or the OSD session is closed. A_ack_block message is sent back to the OSD immediately to acknowledgereceipt of the backoff.

When an unblock isreceived, it will reference a specific id that the client previous hadblocked. However, the range described by the unblock may be smallerthan the original range, as the PG may have split on the OSD. The unblockshould only unblock the range specified in the unblock message. Any requeststhat fall within the unblock request range are reexamined and, if no otherinstalled backoff applies, resent.

On the OSD, Backoffs are also tracked across ranges of the hash space, andexist in three states:

  • new

  • acked

  • deleting

A newly installed backoff is set to new and a message is sent to theclient. When the ack-block message is received it is changed to theacked state. The OSD may process other messages from the client thatare covered by the backoff in the new state, but once the backoff isacked it should never see a blocked request unless there is a bug.

If the OSD wants to a remove a backoff in the acked state it cansimply remove it and notify the client. If the backoff is in thenew state it must move it to the deleting state and continue touse it to discard client requests until the ack-block message isreceived, at which point it can finally be removed. This is necessary topreserve the order of operations processed by the OSD.