msgr2 protocol

This is a revision of the legacy Ceph on-wire protocol that wasimplemented by the SimpleMessenger. It addresses performance andsecurity issues.

Goals

This protocol revision has several goals relative to the original protocol:

  • Flexible handshaking. The original protocol did not have asufficiently flexible protocol negotiation that allows for featuresthat were not required.

  • Encryption. We will incorporate encryption over the wire.

  • Performance. We would like to provide for protocol features(e.g., padding) that keep computation and memory copies out of thefast path where possible.

  • Signing. We will allow for traffic to be signed (but notnecessarily encrypted). This may not be implemented in the initial version.

Definitions

  • client (C): the party initiating a (TCP) connection

  • server (S): the party accepting a (TCP) connection

  • connection: an instance of a (TCP) connection between two processes.

  • entity: a ceph entity instantiation, e.g. ‘osd.0’. each entityhas one or more unique entity_addr_t’s by virtue of the ‘nonce’field, which is typically a pid or random value.

  • session: a stateful session between two entities in which messageexchange is ordered and lossless. A session might span multipleconnections if there is an interruption (TCP connection disconnect).

  • frame: a discrete message sent between the peers. Each frameconsists of a tag (type code), payload, and (if signingor encryption is enabled) some other fields. See below for thestructure.

  • tag: a type code associated with a frame. The tagdetermines the structure of the payload.

Phases

A connection has four distinct phases:

  • banner

  • authentication frame exchange

  • message flow handshake frame exchange

  • message frame exchange

Banner

Both the client and server, upon connecting, send a banner:

  1. "ceph %x %x\n", protocol_features_suppored, protocol_features_required

The protocol features are a new, distinct namespace. Initially nofeatures are defined or required, so this will be “ceph 0 0n”.

If the remote party advertises required features we don’t support, wecan disconnect.

msgr2 protocol - 图1

Frame format

All further data sent or received is contained by a frame. Each frame hasthe form:

  1. frame_len (le32)
  2. tag (TAG_* le32)
  3. frame_header_checksum (le32)
  4. payload
  5. [payload padding -- only present after stream auth phase]
  6. [signature -- only present after stream auth phase]
  • The frame_header_checksum is over just the frame_len and tag values (8 bytes).

  • frame_len includes everything after the frame_len le32 up to the end of theframe (all payloads, signatures, and padding).

  • The payload format and length is determined by the tag.

  • The signature portion is only present if the authentication phasehas completed (TAG_AUTH_DONE has been sent) and signatures areenabled.

Hello

  • TAG_HELLO: client->server and server->client:
  1. __u8 entity_type
  2. entity_addr_t peer_socket_address
  • We immediately share our entity type and the address of the peer (which can be usefulfor detecting our effective IP address, especially in the presence of NAT).

Authentication

  • TAG_AUTH_REQUEST: client->server:
  1. __le32 method; // CEPH_AUTH_{NONE, CEPHX, ...}
  2. __le32 num_preferred_modes;
  3. list<__le32> mode // CEPH_CON_MODE_*
  4. method specific payload
  • TAG_AUTH_BAD_METHOD server -> client: reject client-selected auth method:
  1. __le32 method
  2. __le32 negative error result code
  3. __le32 num_methods
  4. list<__le32> allowed_methods // CEPH_AUTH_{NONE, CEPHX, ...}
  5. __le32 num_modes
  6. list<__le32> allowed_modes // CEPH_CON_MODE_*
  • Returns the attempted auth method, and error code (-EOPNOTSUPP ifthe method is unsupported), and the list of allowed authenticationmethods.
  • TAG_AUTH_REPLY_MORE: server->client:
  1. __le32 len;
  2. method specific payload
  • TAG_AUTH_REQUEST_MORE: client->server:
  1. __le32 len;
  2. method specific payload
  • TAG_AUTH_DONE: (server->client):
  1. __le64 global_id
  2. __le32 connection mode // CEPH_CON_MODE_*
  3. method specific payload
  • The server is the one to decide authentication has completed and whatthe final connection mode will be.

Example of authentication phase interaction when the client uses anallowed authentication method:

msgr2 protocol - 图2

Example of authentication phase interaction when the client uses a forbiddenauthentication method as the first attempt:

msgr2 protocol - 图3

Post-auth frame format

The frame format is fixed (see above), but can take three differentforms, depending on the AUTH_DONE flags:

  • If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple:
  1. frame_len
  2. tag
  3. payload
  4. payload_padding (out to auth block_size)
  • The padding is some number of bytes < the auth block_size thatbrings the total length of the payload + payload_padding to amultiple of block_size. It does not include the frame_len or tag. Paddingcontent can be zeros or (better) random bytes.
  • If FLAG_SIGNED has been specified:
  1. frame_len
  2. tag
  3. payload
  4. payload_padding (out to auth block_size)
  5. signature (sig_size bytes)

Here the padding just makes life easier for the signature. It can berandom data to add additional confounder. Note also that thesignature input must include some state from the session key and theprevious message.

  • If FLAG_ENCRYPTED has been specified:
  1. frame_len
  2. tag
  3. {
  4. payload
  5. payload_padding (out to auth block_size)
  6. } ^ stream cipher

Note that the padding ensures that the total frame is a multiple ofthe auth method’s block_size so that the message can be sent out overthe wire without waiting for the next frame in the stream.

Message flow handshake

In this phase the peers identify each other and (if desired) reconnect toan established session.

  • TAG_CLIENT_IDENT (client->server): identify ourselves:
  1. __le32 num_addrs
  2. entity_addrvec_t*num_addrs entity addrs
  3. entity_addr_t target entity addr
  4. __le64 gid (numeric part of osd.0, client.123456, ...)
  5. __le64 global_seq
  6. __le64 features supported (CEPH_FEATURE_* bitmask)
  7. __le64 features required (CEPH_FEATURE_* bitmask)
  8. __le64 flags (CEPH_MSG_CONNECT_* bitmask)
  9. __le64 cookie
  • client will send first, server will reply with same. if this is anew session, the client and server can proceed to the message exchange.

  • the target addr is who the client is trying to connect to, sothat the server side can close the connection if the client istalking to the wrong daemon.

  • type.gid (entity_name_t) is set here, by combinging the type shared in the helloframe with the gid here. this means we don’t need itin the header of every message. it also means that we can’t sendmessages “from” other entity_name_t’s. the currentimplementations set this at the top of _send_message etc so thisshouldn’t break any existing functionality. implementation willlikely want to mask this against what the authenticated credentialallows.

  • cookie is the client coookie used to identify a session, and can be usedto reconnect to an existing session.

  • we’ve dropped the ‘protocol_version’ field from msgr1

  • TAG_IDENT_MISSING_FEATURES (server->client): complain about a TAG_IDENTwith too few features:
  1. __le64 features we require that the peer didn't advertise
  • TAG_SERVER_IDENT (server->client): accept client ident and identify server:
  1. __le32 num_addrs
  2. entity_addrvec_t*num_addrs entity addrs
  3. __le64 gid (numeric part of osd.0, client.123456, ...)
  4. __le64 global_seq
  5. __le64 features supported (CEPH_FEATURE_* bitmask)
  6. __le64 features required (CEPH_FEATURE_* bitmask)
  7. __le64 flags (CEPH_MSG_CONNECT_* bitmask)
  8. __le64 cookie
  • The server cookie can be used by the client if it is later disconnectedand wants to reconnect and resume the session.
  • TAG_RECONNECT (client->server): reconnect to an established session:
  1. __le32 num_addrs
  2. entity_addr_t * num_addrs
  3. __le64 client_cookie
  4. __le64 server_cookie
  5. __le64 global_seq
  6. __le64 connect_seq
  7. __le64 msg_seq (the last msg seq received)
  • TAG_RECONNECT_OK (server->client): acknowledge a reconnect attempt:
  1. __le64 msg_seq (last msg seq received)
  • once the client receives this, the client can proceed to message exchange.

  • once the server sends this, the server can proceed to message exchange.

  • TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq

  • TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq

  • TAG_RECONNECT_WAIT (server only): fail reconnect due to connect race.

    • Indicates that the server is already connecting to the client, andthat direction should win the race. The client should wait for thatconnection to complete.
  • TAG_RESET_SESSION (server only): ask client to reset session:

  1. __u8 full
  • full flag indicates whether peer should do a full reset, i.e., dropmessage queue.

Example of failure scenarios:

  • First client’s client_ident message is lost, and then client reconnects.

msgr2 protocol - 图4

  • Server’s server_ident message is lost, and then client reconnects.

msgr2 protocol - 图5

  • Server’s server_ident message is lost, and then server reconnects.

msgr2 protocol - 图6

  • Connection failure after session is established, and then client reconnects.

msgr2 protocol - 图7

  • Connection failure after session is established because server reset,and then client reconnects.

msgr2 protocol - 图8

RC* means that the reset session full flag depends on the policy.resetcheckof the connection.

  • Connection failure after session is established because client reset,and then client reconnects.

msgr2 protocol - 图9

Message exchange

Once a session is established, we can exchange messages.

  • TAG_MSG: a message:
  1. ceph_msg_header2
  2. front
  3. middle
  4. data_pre_padding
  5. data
    • The ceph_msg_header2 is modified from ceph_msg_header:
      • include an ack_seq. This avoids the need for a TAG_ACKmessage most of the time.

      • remove the src field, which we now get from the message flowhandshake (TAG_IDENT).

      • specifies the data_pre_padding length, which can be used toadjust the alignment of the data payload. (NOTE: is this isuseful?)

  • TAG_ACK: acknowledge receipt of message(s):
  1. __le64 seq
  • This is only used for stateful sessions.
  • TAG_KEEPALIVE2: check for connection liveness:
  1. ceph_timespec stamp
  • Time stamp is local to sender.
  • TAG_KEEPALIVE2_ACK: reply to a keepalive2:
  1. ceph_timestamp stamp
  • Time stamp is from the TAG_KEEPALIVE2 we are responding to.
  • TAG_CLOSE: terminate a connection

Indicates that a connection should be terminated. This is equivalentto a hangup or reset (i.e., should trigger ms_handle_reset). Itisn’t strictly necessary or useful as we could just disconnect theTCP connection.

Example of protocol interaction (WIP)

msgr2 protocol - 图10