Deduplication

Introduction

Applying data deduplication on an existing software stack is not easydue to additional metadata management and original data processingprocedure.

In a typical deduplication system, the input source as a dataobject is split into multiple chunks by a chunking algorithm.The deduplication system then compares each chunk withthe existing data chunks, stored in the storage previously.To this end, a fingerprint index that stores the hash valueof each chunk is employed by the deduplication systemin order to easily find the existing chunks by comparinghash value rather than searching all contents that reside inthe underlying storage.

There are many challenges in order to implement deduplication on topof Ceph. Among them, two issues are essential for deduplication.First is managing scalability of fingerprint index; Second isit is complex to ensure compatibility between newly applieddeduplication metadata and existing metadata.

Key Idea

  1. Content hashing (Double hashing): Each client can find an object datafor an object ID using CRUSH. With CRUSH, a client knows object’s locationin Base tier.By hashing object’s content at Base tier, a new OID (chunk ID) is generated.Chunk tier stores in the new OID that has a partial content of original object.

Client 1 -> OID=1 -> HASH(1’s content)=K -> OID=K ->CRUSH(K) -> chunk’s location

  1. Self-contained object: The external metadata designmakes difficult for integration with storage feature supportsince existing storage features cannot recognize theadditional external data structures. If we can design datadeduplication system without any external component, theoriginal storage features can be reused.

More details in https://ieeexplore.ieee.org/document/8416369

Design

Deduplication - 图1

Pool-based object management:We define two pools.The metadata pool stores metadata objects and the chunk pool storeschunk objects. Since these two pools are divided based onthe purpose and usage, each pool can be managed moreefficiently according to its different characteristics. Basepool and the chunk pool can separately select a redundancyscheme between replication and erasure coding depending onits usage and each pool can be placed in a different storagelocation depending on the required performance.

Manifest Object:Metadata objects are stored in thebase pool, which contains metadata for data deduplication.

  1. struct object_manifest_t {
  2. enum {
  3. TYPE_NONE = 0,
  4. TYPE_REDIRECT = 1,
  5. TYPE_CHUNKED = 2,
  6. };
  7. uint8_t type; // redirect, chunked, ...
  8. hobject_t redirect_target;
  9. std::map<uint64_t, chunk_info_t> chunk_map;
  10. }

A chunk Object:Chunk objects are stored in the chunk pool. Chunk object contains chunk dataand its reference count information.

Although chunk objects and manifest objects have a different purposefrom existing objects, they can be handled the same way asoriginal objects. Therefore, to support existing features such as replication,no additional operations for dedup are needed.

Usage

To set up deduplication pools, you must have two pools. One will act as thebase pool and the other will act as the chunk pool. The base pool need to beconfigured with fingerprint_algorithm option as follows.

  1. ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
  2. --yes-i-really-mean-it
  • Create objects
  1. - rados -p base_pool put foo ./foo
  2.  
  3. - rados -p chunk_pool put foo-chunk ./foo-chunk
  • Make a manifest object
  1. - rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
  2. chunk_pool foo-chunk $START_OFFSET --with-reference

Interface

  • set-redirect

set redirection between a base_object in the base_pool and a target_objectin the target_pool.A redirected object will forward all operations from the client to thetarget_object.

  1. rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
  2. <target_object>
  • set-chunk

set chunk-offset in a source_object to make a link between it and atarget_object.

  1. rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
  2. <caspool> <target_object> <taget-offset>
  • tier-promote

promote the object (including chunks).

  1. rados -p base_pool tier-promote <obj-name>
  • unset-manifest

unset manifest option from the object that has manifest.

  1. rados -p base_pool unset-manifest <obj-name>
  • tier-flush

flush the object that has chunks to the chunk pool.

  1. rados -p base_pool tier-flush <obj-name>

Dedup tool

Dedup tool has two features: finding optimal chunk offset for dedup chunkingand fixing the reference count.

  • find optimal chunk offset

    • fixed chunk

To find out a fixed chunk length, you need to run following command manytimes while changing the chunk_size.

  1. ceph-dedup-tool op estimate pool $POOL chunk-size chunk_size chunk-algorithm fixed fingerprint-algorithm sha1|sha256|sha512
  • rabin chunk(Rabin-karp algorithm)

As you know, Rabin-karp algorithm is string-searching algorithm basedon a rolling-hash. But rolling-hash is not enough to do deduplication becausewe don’t know the chunk boundary. So, we need content-based slicing usinga rolling hash for content-defined chunking.The current implementation uses the simplest approach: look for chunk boundariesby inspecting the rolling hash for pattern(like thelower N bits are all zeroes).

  • Usage

    Users who want to use deduplication need to find an ideal chunk offset.To find out ideal chunk offset, Users should discoverthe optimal configuration for their data workload via ceph-dedup-tool.And then, this chunking information will be used for object chunking throughset-chunk api.

    1. ceph-dedup-tool op estimate pool $POOL min-chunk min_size chunk-algorithm rabin fingerprint-algorithm rabin

    ceph-dedup-tool has many options to utilize rabin chunk.These are options for rabin chunk.

    1. mod-prime <uint64_t>rabin-prime <uint64_t>pow <uint64_t>chunk-mask-bit <uint32_t>window-size <uint32_t>min-chunk <uint32_t>max-chunk <uint64_t>

    Users need to refer following equation to use above options for rabin chunk.

    1. rabin_hash = (rabin_hash rabin_prime + new_byte - old_byte pow) % (mod_prime)
  • Fixed chunk vs content-defined chunk

Content-defined chunking may or not be optimal solution.For example,

Data chunk A : abcdefgabcdefgabcdefg

Let’s think about Data chunk A’s deduplication. Ideal chunk offset isfrom 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.But, in the case of content-based slicing, the optimal chunk lengthcould not be found (dedup ratio will not be 100%).Because we need to find optimal parameter suchas boundary bit, window size and prime value. This is as easy as fixed chunk.But, content defined chunking is very effective in the following case.

Data chunk B : abcdefgabcdefgabcdefg

Data chunk C : Tabcdefgabcdefgabcdefg

  • fix reference count

The key idea behind of reference counting for dedup is false-positive, which means(manifest object (no ref), chunk object(has ref)) happen instead of(manifest object (has ref), chunk 1(no ref)).To fix such inconsistency, ceph-dedup-tool supports chunk_scrub.

  1. ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL