FULL OSDMAP VERSION PRUNING

For each incremental osdmap epoch, the monitor will keep a full osdmapepoch in the store.

While this is great when serving osdmap requests from clients, allowingus to fulfill their request without having to recompute the full osdmapfrom a myriad of incrementals, it can also become a burden once we startkeeping an unbounded number of osdmaps.

The monitors will attempt to keep a bounded number of osdmaps in the store.This number is defined (and configurable) via mon_min_osdmap_epochs, anddefaults to 500 epochs. Generally speaking, we will remove older osdmapepochs once we go over this limit.

However, there are a few constraints to removing osdmaps. These are alldefined in OSDMonitor::get_trim_to().

In the event one of these conditions is not met, we may go over the boundsdefined by mon_min_osdmap_epochs. And if the cluster does not meet thetrim criteria for some time (e.g., unclean pgs), the monitor may startkeeping a lot of osdmaps. This can start putting pressure on the underlyingkey/value store, as well as on the available disk space.

One way to mitigate this problem would be to stop keeping full osdmapepochs on disk. We would have to rebuild osdmaps on-demand, or grab themfrom cache if they had been recently served. We would still have to keepat least one osdmap, and apply all incrementals on top of either thisoldest map epoch kept in the store or a more recent map grabbed from cache.While this would be feasible, it seems like a lot of cpu (and potentiallyIO) would be going into rebuilding osdmaps.

Additionally, this would prevent the aforementioned problem going forward,but would do nothing for stores currently in a state that would trulybenefit from not keeping osdmaps.

This brings us to full osdmap pruning.

Instead of not keeping full osdmap epochs, we are going to prune some ofthem when we have too many.

Deciding whether we have too many will be dictated by a configurable optionmon_osdmap_full_prune_min (default: 10000). The pruning algorithm will beengaged once we go over this threshold.

We will not remove all mon_osdmap_full_prune_min full osdmap epochsthough. Instead, we are going to poke some holes in the sequence of fullmaps. By default, we will keep one full osdmap per 10 maps since the lastmap kept; i.e., if we keep epoch 1, we will also keep epoch 10 and removefull map epochs 2 to 9. The size of this interval is configurable withmon_osdmap_full_prune_interval.

Essentially, we are proposing to keep ~10% of the full maps, but we willalways honour the minimum number of osdmap epochs, as defined bymon_min_osdmap_epochs, and these won’t be used for the count of theminimum versions to prune. For instance, if we have on-disk versions[1..50000], we would allow the pruning algorithm to operate only overosdmap epochs [1..49500); but, if have on-disk versions [1..10200], wewon’t be pruning because the algorithm would only operate on versions[1..9700), and this interval contains less versions than the minimumrequired by mon_osdmap_full_prune_min.

ALGORITHM

Say we have 50,000 osdmap epochs in the store, and we’re using thedefaults for all configurable options.

  1. -----------------------------------------------------------
  2. |1|2|..|10|11|..|100|..|1000|..|10000|10001|..|49999|50000|
  3. -----------------------------------------------------------
  4. ^ first last ^

We will prune when all the following constraints are met:

  • number of versions is greater than mon_min_osdmap_epochs;

  • the number of versions between first and prune_to is greater (orequal) than mon_osdmap_full_prune_min, with prune_to being equal tolast minus mon_min_osdmap_epochs.

If any of these conditions fails, we will not prune any maps.

Furthermore, if it is known that we have been pruning, but since then weare no longer satisfying at least one of the above constraints, we willnot continue to prune. In essence, we only prune full osdmaps if thenumber of epochs in the store so warrants it.

As pruning will create gaps in the sequence of full maps, we need to keeptrack of the intervals of missing maps. We do this by keeping a manifest ofpinned maps – i.e., a list of maps that, by being pinned, are not to bepruned.

While pinned maps are not removed from the store, maps between two consecutivepinned maps will; and the number of maps to be removed will be dictated by theconfigurable option mon_osdmap_full_prune_interval. The algorithm makes aneffort to keep pinned maps apart by as many maps as defined by this option,but in the event of corner cases it may allow smaller intervals. Additionally,as this is a configurable option that is read any time a prune iterationoccurs, there is the possibility this interval will change if the user changesthis config option.

Pinning maps is performed lazily: we will be pinning maps as we are removingmaps. This grants us more flexibility to change the prune interval whilepruning is happening, but also simplifies considerably the algorithm, as wellas the information we need to keep in the manifest. Below we show a simplifiedversion of the algorithm::

  1. manifest.pin(first)
  2. last_to_prune = last - mon_min_osdmap_epochs
  3.  
  4. while manifest.get_last_pinned() + prune_interval < last_to_prune AND
  5. last_to_prune - first > mon_min_osdmap_epochs AND
  6. last_to_prune - first > mon_osdmap_full_prune_min AND
  7. num_pruned < mon_osdmap_full_prune_txsize:
  8.  
  9. last_pinned = manifest.get_last_pinned()
  10. new_pinned = last_pinned + prune_interval
  11. manifest.pin(new_pinned)
  12. for e in (last_pinned .. new_pinned):
  13. store.erase(e)
  14. ++num_pruned

In essence, the algorithm ensures that the first version in the store isalways pinned. After all, we need a starting point when rebuilding maps, andwe can’t simply remove the earliest map we have; otherwise we would be unableto rebuild maps for the very first pruned interval.

Once we have at least one pinned map, each iteration of the algorithm cansimply base itself on the manifest’s last pinned map (which we can obtain byreading the element at the tail of the manifest’s pinned maps list).

We’ll next need to determine the interval of maps to be removed: all the mapsfrom last_pinned up to new_pinned, which in turn is nothing more thanlast_pinned plus mon_osdmap_full_prune_interval. We know that all mapsbetween these two values, last_pinned and new_pinned can be removed,considering new_pinned has been pinned.

The algorithm ceases to execute as soon as one of the two initialpreconditions is not met, or if we do not meet two additional conditions thathave no weight on the algorithm’s correctness:

  • We will stop if we are not able to create a new pruning interval properlyaligned with mon_osdmap_full_prune_interval that is lower thanlast_pruned. There is no particular technical reason why we enforcethis requirement, besides allowing us to keep the intervals with anexpected size, and preventing small, irregular intervals that would bebound to happen eventually (e.g., pruning continues over the course ofseveral iterations, removing one or two or three maps each time).

  • We will stop once we know that we have pruned more than a certain number ofmaps. This value is defined by mon_osdmap_full_prune_txsize, andensures we don’t spend an unbounded number of cycles pruning maps. We don’tenforce this value religiously (deletes do not cost much), but we make aneffort to honor it.

We could do the removal in one go, but we have no idea how long that wouldtake. Therefore, we will perform several iterations, removing at mostmon_osdmap_full_prune_txsize osdmaps per iteration.

In the end, our on-disk map sequence will look similar to:

  1. ------------------------------------------
  2. |1|10|20|30|..|49500|49501|..|49999|50000|
  3. ------------------------------------------
  4. ^ first last ^

Because we are not pruning all versions in one go, we need to keep stateabout how far along on our pruning we are. With that in mind, we havecreated a data structure, osdmap_manifest_t, that holds the set of pinnedmaps::

  1. struct osdmap_manifest_t:
  2. set<version_t> pinned;

Given we are only pinning maps while we are pruning, we don’t need to keeptrack of additional state about the last pruned version. We know as a matterof fact that we have pruned all the intermediate maps between any twoconsecutive pinned maps.

The question one could ask, though, is how can we be sure we pruned all theintermediate maps if, for instance, the monitor crashes. To ensure we areprotected against such an event, we always write the osdmap manifest to diskon the same transaction that is deleting the maps. This way we have theguarantee that, if the monitor crashes, we will read the latest version of themanifest: either containing the newly pinned maps, meaning we also pruned thein-between maps; or we will find the previous version of the osdmap manifest,which will not contain the maps we were pinning at the time we crashed, giventhe transaction on which we would be writing the updated osdmap manifest wasnot applied (alongside with the maps removal).

The osdmap manifest will be written to the store each time we prune, with anupdated list of pinned maps. It is written in the transaction effectivelypruning the maps, so we guarantee the manifest is always up to date. As aconsequence of this criteria, the first time we will write the osdmap manifestis the first time we prune. If an osdmap manifest does not exist, we can becertain we do not hold pruned map intervals.

We will rely on the manifest to ascertain whether we have pruned mapsintervals. In theory, this will always be the on-disk osdmap manifest, but wemake sure to read the on-disk osdmap manifest each time we update from paxos;this way we always ensure having an up to date in-memory osdmap manifest.

Once we finish pruning maps, we will keep the manifest in the store, toallow us to easily find which maps have been pinned (instead of checkingthe store until we find a map). This has the added benefit of allowing us toquickly figure out which is the next interval we need to prune (i.e., lastpinned plus the prune interval). This doesn’t however mean we will foreverkeep the osdmap manifest: the osdmap manifest will no longer be required oncethe monitor trims osdmaps and the earliest available epoch in the store isgreater than the last map we pruned.

The same conditions from OSDMonitor::get_trim_to() that force the monitorto keep a lot of osdmaps, thus requiring us to prune, may eventually changeand allow the monitor to remove some of its oldest maps.

MAP TRIMMING

If the monitor trims maps, we must then adjust the osdmap manifest toreflect our pruning status, or remove the manifest entirely if it no longermakes sense to keep it. For instance, take the map sequence from before, butlet us assume we did not finish pruning all the maps.:

  1. -------------------------------------------------------------
  2. |1|10|20|30|..|490|500|501|502|..|49500|49501|..|49999|50000|
  3. -------------------------------------------------------------
  4. ^ first ^ pinned.last() last ^
  5.  
  6. pinned = {1, 10, 20, ..., 490, 500}

Now let us assume that the monitor will trim up to epoch 501. This meansremoving all maps prior to epoch 501, and updating the first_committedpointer to 501. Given removing all those maps would invalidate ourexisting pruning efforts, we can consider our pruning has finished and dropour osdmap manifest. Doing so also simplifies starting a new prune, if allthe starting conditions are met once we refreshed our state from thestore.

We would then have the following map sequence:

  1. ---------------------------------------
  2. |501|502|..|49500|49501|..|49999|50000|
  3. ---------------------------------------
  4. ^ first last ^

However, imagine a slightly more convoluted scenario: the monitor will trimup to epoch 491. In this case, epoch 491 has been previously pruned from thestore.

Given we will always need to have the oldest known map in the store, beforewe trim we will have to check whether that map is in the prune interval(i.e., if said map epoch belongs to [ pinned.first()..pinned.last() )).If so, we need to check if this is a pinned map, in which case we don’t havemuch to be concerned aside from removing lower epochs from the manifest’spinned list. On the other hand, if the map being trimmed to is not a pinnedmap, we will need to rebuild said map and pin it, and only then will we removethe pinned maps prior to the map’s epoch.

In this case, we would end up with the following sequence::

  1. -----------------------------------------------
  2. |491|500|501|502|..|49500|49501|..|49999|50000|
  3. -----------------------------------------------
  4. ^ ^- pinned.last() last ^
  5. `- first

There is still an edge case that we should mention. Consider that we aregoing to trim up to epoch 499, which is the very last pruned epoch.

Much like the scenario above, we would end up writing osdmap epoch 499 tothe store; but what should we do about pinned maps and pruning?

The simplest solution is to drop the osdmap manifest. After all, given weare trimming to the last pruned map, and we are rebuilding this map, we canguarantee that all maps greater than e 499 are sequential (because we havenot pruned any of them). In essence, dropping the osdmap manifest in thiscase is essentially the same as if we were trimming over the last prunedepoch: we can prune again later if we meet the required conditions.

And, with this, we have fully dwelled into full osdmap pruning. Later in thisdocument one can find detailed REQUIREMENTS, CONDITIONS & INVARIANTS for thewhole algorithm, from pruning to trimming. Additionally, the next sectiondetails several additional checks to guarantee the sanity of our configurationoptions. Enjoy.

CONFIGURATION OPTIONS SANITY CHECKS

We perform additional checks before pruning to ensure all configurationoptions involved are sane:

  • If mon_osdmap_full_prune_interval is zero we will not prune; werequire an actual positive number, greater than one, to be able to prunemaps. If the interval is one, we would not actually be pruning any maps, asthe interval between pinned maps would essentially be a single epoch. Thismeans we would have zero maps in-between pinned maps, hence no maps wouldever be pruned.

  • If mon_osdmap_full_prune_min is zero we will not prune; we require apositive, greater than zero, value so we know the threshold over which weshould prune. We don’t want to guess.

  • If mon_osdmap_full_prune_interval is greater thanmon_osdmap_full_prune_min we will not prune, as it is impossible toascertain a proper prune interval.

  • If mon_osdmap_full_prune_txsize is lower thanmon_osdmap_full_prune_interval we will not prune; we require atxsize with a value at least equal than interval, and (depending onthe value of the latter) ideally higher.

REQUIREMENTS, CONDITIONS & INVARIANTS

REQUIREMENTS

  • All monitors in the quorum need to support pruning.

  • Once pruning has been enabled, monitors not supporting pruning will not beallowed in the quorum, nor will be allowed to synchronize.

  • Removing the osdmap manifest results in disabling the pruning feature quorumrequirement. This means that monitors not supporting pruning will be allowedto synchronize and join the quorum, granted they support any other featuresrequired.

CONDITIONS & INVARIANTS

  • Pruning has never happened, or we have trimmed past its previousintervals::
  1. invariant: first_committed > 1
  2.  
  3. condition: pinned.empty() AND !store.exists(manifest)
  • Pruning has happened at least once::
  1. invariant: first_committed > 0
  2. invariant: !pinned.empty())
  3. invariant: pinned.first() == first_committed
  4. invariant: pinned.last() < last_committed
  5.  
  6. precond: pinned.last() < prune_to AND
  7. pinned.last() + prune_interval < prune_to
  8.  
  9. postcond: pinned.size() > old_pinned.size() AND
  10. (for each v in [pinned.first()..pinned.last()]:
  11. if pinned.count(v) > 0: store.exists_full(v)
  12. else: !store.exists_full(v)
  13. )
  • Pruning has finished::
  1. invariant: first_committed > 0
  2. invariant: !pinned.empty()
  3. invariant: pinned.first() == first_committed
  4. invariant: pinned.last() < last_committed
  5.  
  6. condition: pinned.last() == prune_to OR
  7. pinned.last() + prune_interval < prune_to
  • Pruning intervals can be trimmed::
  1. precond: OSDMonitor::get_trim_to() > 0
  2.  
  3. condition: !pinned.empty()
  4.  
  5. invariant: pinned.first() == first_committed
  6. invariant: pinned.last() < last_committed
  7. invariant: pinned.first() <= OSDMonitor::get_trim_to()
  8. invariant: pinned.last() >= OSDMonitor::get_trim_to()
  • Trim pruned intervals::
  1. invariant: !pinned.empty()
  2. invariant: pinned.first() == first_committed
  3. invariant: pinned.last() < last_committed
  4. invariant: pinned.first() <= OSDMonitor::get_trim_to()
  5. invariant: pinned.last() >= OSDMonitor::get_trim_to()
  6.  
  7. postcond: pinned.empty() OR
  8. (pinned.first() == OSDMonitor::get_trim_to() AND
  9. pinned.last() > pinned.first() AND
  10. (for each v in [0..pinned.first()]:
  11. !store.exists(v) AND
  12. !store.exists_full(v)
  13. ) AND
  14. (for each m in [pinned.first()..pinned.last()]:
  15. if pinned.count(m) > 0: store.exists_full(m)
  16. else: !store.exists_full(m) AND store.exists(m)
  17. )
  18. )
  19. postcond: !pinned.empty() OR
  20. (!store.exists(manifest) AND
  21. (for each v in [pinned.first()..pinned.last()]:
  22. !store.exists(v) AND
  23. !store.exists_full(v)
  24. )
  25. )