Fast Cluster Restore

Fast Cluster Restore

The Fast Cluster Restore procedure documented in this page is recommendedto speed-up the performance of arangorestorein a Cluster environment.

It is assumed that a Cluster environment is running and a logical backupwith arangodump has already been taken.

The procedure described in this page is particularly useful for ArangoDBversion 3.3, but can be used in 3.4 and later versions as well. Note thatfrom v3.4, arangorestore includes the option —threads which can be a firstgood step already in achieving restore parallelization and its speed benefit.However, the procedure below allows for even further parallelization (makinguse of different Coordinators), and the part regarding temporarily settingreplication factor to 1 is still useful in 3.4 and later versions.

The speed improvement obtained by the procedure below is achieved by:

Restoring into a Cluster that has replication factor 1, thus reducingnumber of network hops needed during the restore operation (_replication factor_is reverted to initial value at the end of the procedure - steps #2, #3 and #6).
Restoring in parallel multiple collections on different Coordinators(steps #4 and #5).

Please refer tothissection for further context on the factors affecting restore speed when restoringusing arangorestore in a Cluster.

Step 1: Copy the dump directory to all Coordinators

The first step is to copy the directory that contains the dump to all machineswhere Coordinators are running.

This step is not strictly required as the backup can be restored over thenetwork. However, if the restore is executed locally the restore speed issignificantly improved.

Step 2: Restore collection structures

The collection structures have to be restored from exactly one Coordinator (anyCoordinator can be used) with a command similar to the following one. Please addany additional needed option for your specific use case, e.g. —create-databaseif the database where you want to restore does not exist yet:

arangorestore
  --server.endpoint <endpoint-of-a-coordinator>
  --server.database <database-name>
  --server.password <password>
  --import-data false
  --input-directory <dump-directory>

If you are using v3.3.22 or higher, or v3.4.2 or higher, please also add in thecommand above the option —replication-factor 1.

The option —import-data false tells arangorestore to restore only thecollection structure and no data.

Step 3: Set Replication Factor to 1

This step is not needed if you are using v3.3.22 or higher or v3.4.2 or higherand you have used in the previous step the option —replication-factor 1.

To speed up restore, it is possible to set the replication factor to 1 beforeimporting any data. Run the following command from exactly one Coordinator (anyCoordinator can be used):

echo 'db._collections().filter(function(c) { return c.name()[0] !== "_"; })
.forEach(function(c) { print("collection:", c.name(), "replicationFactor:",
c.properties().replicationFactor); c.properties({ replicationFactor: 1 }); });'
| arangosh
  --server.endpoint <endpoint-of-a-coordinator>
  --server.database <database-name>
  --server.username <user-name>
  --server.password <password>

Step 4: Create parallel restore scripts

Now that the Cluster is prepared, the parallelRestore script will be used.

Please create the below parallelRestore script in any of your Coordinators.

When executed (see below for further details), this script will create other scriptsthat can be then copied and executed on each Coordinator.

#!/bin/sh
#
# Version: 0.3
#
# Release Notes:
# - v0.3: fixed a bug that was happening when the collection name included an underscore 
# - v0.2: compatibility with version 3.4: now each coordinator_<number-of-coordinator>.sh
#         includes a single restore command (instead of one for each collection)
#         which allows making using of the --threads option in v.3.4.0 and later
# - v0.1: initial version
if test -z "$ARANGOSH" ; then
  export ARANGOSH=arangosh
fi
cat > /tmp/parallelRestore$$.js <<'EOF'
var fs = require("fs");
var print = require("internal").print;
var exit = require("internal").exit;
var arangorestore = "arangorestore";
var env = require("internal").env;
if (env.hasOwnProperty("ARANGORESTORE")) {
  arangorestore = env["ARANGORESTORE"];
}
// Check ARGUMENTS: dumpDir coordinator1 coordinator2 ...
if (ARGUMENTS.length < 2) {
  print("Need at least two arguments DUMPDIR and COORDINATOR_ENDPOINTS!");
  exit(1);
}
var dumpDir = ARGUMENTS[0];
var coordinators = ARGUMENTS[1].split(",");
var otherArgs = ARGUMENTS.slice(2);
// Quickly check the dump dir:
var files = fs.list(dumpDir).filter(f => !fs.isDirectory(f));
var found = files.indexOf("ENCRYPTION");
if (found === -1) {
  print("This directory does not have an ENCRYPTION entry.");
  exit(2);
}
// Remove ENCRYPTION entry:
files = files.slice(0, found).concat(files.slice(found+1));
for (let i = 0; i < files.length; ++i) {
  if (files[i].slice(-5) !== ".json") {
    print("This directory has files which do not end in '.json'!");
    exit(3);
  }
}
files = files.map(function(f) {
  var fullName = fs.join(dumpDir, f);
  var collName = "";
  if (f.slice(-10) === ".data.json") {
    var pos;
    if (f.slice(0, 1) === "_") {  // system collection
      pos = f.slice(1).indexOf("_") + 1;
      collName = "_" + f.slice(1, pos);
    } else {
      pos = f.lastIndexOf("_")
      collName = f.slice(0, pos);
    }
  }
  return {name: fullName, collName, size: fs.size(fullName)};
});
files = files.sort(function(a, b) { return b.size - a.size; });
var dataFiles = [];
for (let i = 0; i < files.length; ++i) {
  if (files[i].name.slice(-10) === ".data.json") {
    dataFiles.push(i);
  }
}
// Produce the scripts, one for each coordinator:
var scripts = [];
var collections = [];
for (let i = 0; i < coordinators.length; ++i) {
  scripts.push([]);
  collections.push([]);
}
var cnum = 0;
var temp = '';
var collections = [];
for (let i = 0; i < dataFiles.length; ++i) {
  var f = files[dataFiles[i]];
  if (typeof collections[cnum] == 'undefined') {
    collections[cnum] = (`--collection ${f.collName}`);
  } else {
    collections[cnum] += (` --collection ${f.collName}`);
  }
  cnum += 1;
  if (cnum >= coordinators.length) {
    cnum = 0;
  }
}
var cnum = 0;
for (let i = 0; i < coordinators.length; ++i) {
  scripts[i].push(`${arangorestore} --input-directory ${dumpDir} --server.endpoint ${coordinators[i]} ` + collections[i] + ' ' + otherArgs.join(" "));
}
for (let i = 0; i < coordinators.length; ++i) {
  let f = "coordinator_" + i + ".sh";
  print("Writing file", f, "...");
  fs.writeFileSync(f, scripts[i].join("\n"));
}
EOF
${ARANGOSH} --javascript.execute /tmp/parallelRestore$$.js -- "$@"
rm /tmp/parallelRestore$$.js

To run this script, all Coordinator endpoints of the Cluster have to beprovided. The script accepts all options of the tool arangorestore.

The command below can for instance be used on a Cluster with threeCoordinators:

./parallelRestore <dump-directory>
  tcp://<ip-of-coordinator1>:<port of coordinator1>,
  tcp://<ip-of-coordinator2>:<port of coordinator2>,
  tcp://<ip-of-coordinator3>:<port of coordinator3>
  --server.username <username>
  --server.password <password>
  --server.database <database_name>
  --create-collection false

Notes:

The option —create-collection false is passed since the collectionstructures were created already in the previous step.
Starting from v3.4.0 the arangorestore option —threads N can bepassed to the command above, where N is an integer, to further parallelizethe restore (default is —threads 2).The above command will create three scripts, where three corresponds tothe amount of listed Coordinators.

The resulting scripts are named coordinator_<number-of-coordinator>.sh (e.g.coordinator_0.sh, coordinator_1.sh, coordinator_2.sh).

Step 5: Execute parallel restore scripts

The coordinator_<number-of-coordinator>.sh scripts, that were created in theprevious step, now have to be executed on each machine where a _Coordinator_is running. This will start a parallel restore of the dump.

Step 6: Revert to the initial Replication Factor

Once the arangorestore process on every Coordinator is completed, thereplication factor has to be set to its initial value.

Run the following command from exactly one Coordinator (any Coordinator can beused). Please adjust the replicationFactor value to your specific case (2 in theexample below):

echo 'db._collections().filter(function(c) { return c.name()[0] !== "_"; })
.forEach(function(c) { print("collection:", c.name(), "replicationFactor:",
c.properties().replicationFactor); c.properties({ replicationFactor: 2 }); });'
| arangosh
  --server.endpoint <endpoint-of-a-coordinator>
  --server.database <database-name>
  --server.username <user-name>
  --server.password <password>