Rados Gateway Data Layout

Although the source code is the ultimate guide, this document helpsnew developers to get up to speed with the implementation details.

Introduction

Swift offers something called a container, that we use interchangeably withthe term bucket. One may say that RGW’s buckets implement Swift containers.

This document does not consider how RGW operates on these structures,e.g. the use of encode() and decode() methods for serialization and so on.

Conceptual View

Although RADOS only knows about pools and objects with their xattrs andomap[1], conceptually RGW organizes its data into three different kinds:metadata, bucket index, and data.

Metadata

We have 3 ‘sections’ of metadata: ‘user’, ‘bucket’, and ‘bucket.instance’.You can use the following commands to introspect metadata entries:

  1. $ radosgw-admin metadata list
  2. $ radosgw-admin metadata list bucket
  3. $ radosgw-admin metadata list bucket.instance
  4. $ radosgw-admin metadata list user
  5.  
  6. $ radosgw-admin metadata get bucket:<bucket>
  7. $ radosgw-admin metadata get bucket.instance:<bucket>:<bucket_id>
  8. $ radosgw-admin metadata get user:<user> # get or set

Some variables have been used in above commands, they are:

  • user: Holds user information

  • bucket: Holds a mapping between bucket name and bucket instance id

  • bucket.instance: Holds bucket instance information[2]

Every metadata entry is kept on a single rados object. See below for implementation details.

Note that the metadata is not indexed. When listing a metadata section we do arados pgls operation on the containing pool.

Bucket Index

It’s a different kind of metadata, and kept separately. The bucket index holdsa key-value map in rados objects. By default it is a single rados object perbucket, but it is possible since Hammer to shard that map over multiple radosobjects. The map itself is kept in omap, associated with each rados object.The key of each omap is the name of the objects, and the value holds some basicmetadata of that object – metadata that shows up when listing the bucket.Also, each omap holds a header, and we keep some bucket accounting metadatain that header (number of objects, total size, etc.).

Note that we also hold other information in the bucket index, and it’s kept inother key namespaces. We can hold the bucket index log there, and for versionedobjects there is more information that we keep on other keys.

Data

Objects data is kept in one or more rados objects for each rgw object.

Object Lookup Path

When accessing objects, ReST APIs come to RGW with three parameters:account information (access key in S3 or account name in Swift),bucket or container name, and object name (or key). At present, RGW onlyuses account information to find out the user ID and for access control.Only the bucket name and object key are used to address the object in a pool.

The user ID in RGW is a string, typically the actual user name from the usercredentials and not a hashed or mapped identifier.

When accessing a user’s data, the user record is loaded from an object“<user_id>” in pool “default.rgw.meta” with namespace “users.uid”.

Bucket names are represented in the pool “default.rgw.meta” with namespace“root”. Bucket record isloaded in order to obtain so-called marker, which serves as a bucket ID.

The object is located in pool “default.rgw.buckets.data”.Object name is “<marker>_<key>”,for example “default.7593.4_image.png”, where the marker is “default.7593.4”and the key is “image.png”. Since these concatenated names are not parsed,only passed down to RADOS, the choice of the separator is not important andcauses no ambiguity. For the same reason, slashes are permitted in objectnames (keys).

It is also possible to create multiple data pools and make it so thatdifferent users buckets will be created in different rados pools by default,thus providing the necessary scaling. The layout and naming of these poolsis controlled by a ‘policy’ setting.[3]

An RGW object may consist of several RADOS objects, the first of whichis the head that contains the metadata, such as manifest, ACLs, content type,ETag, and user-defined metadata. The metadata is stored in xattrs.The head may also contain up to 512 kilobytes of object data, for efficiencyand atomicity. The manifest describes how each object is laid out in RADOSobjects.

Bucket and Object Listing

Buckets that belong to a given user are listed in an omap of an object named“<user_id>.buckets” (for example, “foo.buckets”) in pool “default.rgw.meta”with namespace “users.uid”.These objects are accessed when listing buckets, when updating bucketcontents, and updating and retrieving bucket statistics (e.g. for quota).

See the user-visible, encoded class ‘cls_user_bucket_entry’ and itsnested class ‘cls_user_bucket’ for the values of these omap entires.

These listings are kept consistent with buckets in pool “.rgw”.

Objects that belong to a given bucket are listed in a bucket index,as discussed in sub-section ‘Bucket Index’ above. The default namingfor index objects is “.dir.<marker>” in pool “default.rgw.buckets.index”.

Footnotes

[1] Omap is a key-value store, associated with an object, in a way similarto how Extended Attributes associate with a POSIX file. An object’s omapis not physically located in the object’s storage, but its preciseimplementation is invisible and immaterial to RADOS Gateway.In Hammer, one LevelDB is used to store omap in each OSD.

[2] Before the Dumpling release, the ‘bucket.instance’ metadata did notexist and the ‘bucket’ metadata contained its information. It is possibleto encounter such buckets in old installations.

[3] The pool names have been changed starting with the Infernalis release.If you are looking at an older setup, some details may be different. Inparticular there was a different pool for each of the namespaces that arenow being used inside the default.root.meta pool.

Appendix: Compendium

Known pools:

  • .rgw.root
  • Unspecified region, zone, and global information records, one per object.

  • .rgw.control

  • notify.

  • .rgw.meta

  • Multiple namespaces with different kinds of metadata:

    • namespace: root
    • .bucket.meta.: # see put_bucket_instance_info()

The tenant is used to disambiguate buckets, but not bucket instances.Example:

  1. .bucket.meta.prodtx:test%25star:default.84099.6
  2. .bucket.meta.testcont:default.4126.1
  3. .bucket.meta.prodtx:testcont:default.84099.4
  4. prodtx/testcont
  5. prodtx/test%25star
  6. testcont
  • namespace: users.uid
  • Contains both per-user information (RGWUserInfo) in “” objectsand per-user lists of buckets in omaps of “.buckets” objects.The “” may contain the tenant if non-empty, for example:
  1. prodtx$prodt
  2. test2.buckets
  3. prodtx$prodt.buckets
  4. test2
  • namespace: users.email
  • Unimportant

  • namespace: users.keys

  • 47UA98JSTJZ9YAN3OS3O

This allows radosgw to look up users by their access keys during authentication.

  • namespace: users.swift
  • test:tester
  • .rgw.buckets.index
  • Objects are named “.dir.”, each contains a bucket index.If the index is sharded, each shard appends the shard index afterthe marker.

  • .rgw.buckets.data

  • default.7593.4_shadow.488urDFerTYXavx4yAd-Op8mxehnvTI1

An example of a marker would be “default.16004.1” or “default.7593.4”.The current format is “<zone>.<instance_id>.<bucket_id>”. But oncegenerated, a marker is not parsed again, so its format may changefreely in the future.