Dumping Data from an ArangoDB database

To dump data from an ArangoDB server instance, you will need to invoke arangodump.Dumps can be re-imported with arangorestore. arangodump can be invoked by executingthe following command:

  1. unix> arangodump --output-directory "dump"

This will connect to an ArangoDB server and dump all non-system collections fromthe default database (_system) into an output directory named dump.Invoking arangodump will fail if the output directory already exists. This isan intentional security measure to prevent you from accidentally overwriting alreadydumped data. If you are positive that you want to overwrite data in the output directory, you can use the parameter —overwrite true to confirm this:

  1. unix> arangodump --output-directory "dump" --overwrite true

arangodump will by default connect to the _system database using the defaultendpoint. If you want to connect to a different database or a different endpoint, or use authentication, you can use the following command-line options:

  • —server.database <string>: name of the database to connect to
  • —server.endpoint <string>: endpoint to connect to
  • —server.username <string>: username
  • —server.password <string>: password to use (omit this and you’ll be prompted for thepassword)
  • —server.authentication <bool>: whether or not to use authenticationHere’s an example of dumping data from a non-standard endpoint, using a dedicateddatabase name:
  1. unix> arangodump --server.endpoint tcp://192.168.173.13:8531 --server.username backup --server.database mydb --output-directory "dump"

When finished, arangodump will print out a summary line with some aggregate statistics about what it did, e.g.:

  1. Processed 43 collection(s), wrote 408173500 byte(s) into datafiles, sent 88 batch(es)

By default, arangodump will dump both structural information and documents from allnon-system collections. To adjust this, there are the following command-line arguments:

  • —dump-data <bool>: set to true to include documents in the dump. Set to false to exclude documents. The default value is true.
  • —include-system-collections <bool>: whether or not to include system collectionsin the dump. The default value is false.For example, to only dump structural information of all collections (including systemcollections), use:
  1. unix> arangodump --dump-data false --include-system-collections true --output-directory "dump"

To restrict the dump to just specific collections, there is is the —collection option.It can be specified multiple times if required:

  1. unix> arangodump --collection myusers --collection myvalues --output-directory "dump"

Structural information for a collection will be saved in files with name pattern <collection-name>.structure.json. Each structure file will contains a JSON object with these attributes:

  • parameters: contains the collection properties
  • indexes: contains the collection indexesDocument data for a collection will be saved in files with name pattern <collection-name>.data.json. Each line in a data file is a document insertion/update ordeletion marker, alongside with some meta data.

Starting with Version 2.1 of ArangoDB, the arangodump tool alsosupports sharding. Simply point it to one of the coordinators and itwill behave exactly as described above, working on sharded collectionsin the cluster.

However, as opposed to the single instance situation, this operation does not guarantee to dump a consistent snapshot if write operations happen during the dump operation. It is therefore recommended not to perform any data-modification operations on the cluster whilst arangodump is running.

As above, the output will be one structure description file and one datafile per sharded collection. Note that the data in the data file issorted first by shards and within each shard by ascending timestamp. Thestructural information of the collection contains the number of shardsand the shard keys.

Note that the version of the arangodump client tool needs to match the version of the ArangoDB server it connects to.

Advanced cluster options

Starting with version 3.1.17, collections may be created with sharddistribution identical to an existing prototypical collection;i.e. shards are distributed in the very same pattern as in theprototype collection. Such collections cannot be dumped without thereference collection or arangodump with yield an error.

  1. unix> arangodump --collection clonedCollection --output-directory "dump"
  2. ERROR Collection clonedCollection's shard distribution is based on a that of collection prototypeCollection, which is not dumped along. You may dump the collection regardless of the missing prototype collection by using the --ignore-distribute-shards-like-errors parameter.

There are two ways to approach that problem: Solve it, i.e. dump theprototype collection along:

  1. unix> arangodump --collection clonedCollection --collection prototypeCollection --output-directory "dump"
  2. Processed 2 collection(s), wrote 81920 byte(s) into datafiles, sent 1 batch(es)

Or override that behavior to be able to dump the collectionindividually.

  1. unix> arangodump --collection B clonedCollection --output-directory "dump" --ignore-distribute-shards-like-errors
  2. Processed 1 collection(s), wrote 34217 byte(s) into datafiles, sent 1 batch(es)

No that in consequence, restoring such a collection without itsprototype is affected. arangorestore

Encryption

In the ArangoDB Enterprise Edition there are the additional parameters:

Encryption key stored in file

—encryption.keyfile path-of-keyfile

The file path-to-keyfile must contain the encryption key. Thisfile must be secured, so that only arangod can access it. You shouldalso ensure that in case some-one steals the hardware, he will not beable to read the file. For example, by encryption /mytmpfs orcreating a in-memory file-system under /mytmpfs.

Encryption key generated by a program

—encryption.key-generator path-to-my-generator

The program path-to-my-generator must output the encryption onstandard output and exit.

Creating keys

The encryption keyfile must contain 32 bytes of random data.

You can create it with a command line this.

  1. dd if=/dev/random bs=1 count=32 of=yourSecretKeyFile

For security, it is best to create these keys offline (away from yourdatabase servers) and directly store them in you secret managementtool.

Data Maskings

—maskings path-of-config

Introduced in: v3.3.22, v3.4.2

This feature allows you to define how sensitive data shall be dumped.It is possible to exclude collections entirely, limit the dump to thestructural information of a collection (name, indexes, sharding etc.)or to obfuscate certain fields for a dump. A JSON configuration file isused to define which collections and fields to mask and how.

The general structure of the configuration file looks like this:

  1. {
  2. "collection-name": {
  3. "type": MASKING_TYPE,
  4. "maskings": [
  5. MASKING1,
  6. MASKING2,
  7. ...
  8. ]
  9. },
  10. ...
  11. }

At the top level, there is an object with collection names and the maskingsettings to be applied to them. Using "*" as collection name defines adefault behavior for collections not listed explicitly.

Masking Types

type is a string describing how to mask the given collection.Possible values are:

  • "exclude": the collection is ignored completely and not even thestructure data is dumped.

  • "structure": only the collection structure is dumped, but no data at all

  • "masked": the collection structure and all data is dumped. However, the datais subject to obfuscation defined in the attribute maskings. It is an arrayof objects, with one object per field to mask. Each object needs at least apath and a type attribute to define which field to mask and whichmasking function to apply. Depending on themasking type, there may exist additional attributes.

  • "full": the collection structure and all data is dumped. No masking isapplied to this collection at all.

Example

  1. {
  2. "private": {
  3. "type": "exclude"
  4. },
  5. "log": {
  6. "type": "structure"
  7. },
  8. "person": {
  9. "type": "masked",
  10. "maskings": [
  11. {
  12. "path": "name",
  13. "type": "xifyFront",
  14. "unmaskedLength": 2
  15. },
  16. {
  17. "path": ".security_id",
  18. "type": "xifyFront",
  19. "unmaskedLength": 2
  20. }
  21. ]
  22. }
  23. }
  • The collection called private is completely ignored.
  • Only the structure of the collection log is dumped, but not the data itself.
  • The collection person is dumped completely but with maskings applied:
    • The name field is masked if it occurs on the top-level.
    • It also masks fields with the name security_id anywhere in the document.
    • The masking function is of type xifyFront in both cases.The additional setting unmaskedLength is specific so xifyFront.

Masking vs. dump-data option

arangodump also supports a very coarse masking with the option—dump-data false. This basically removes all data from the dump.

You can either use —masking or —dump-data false, but not both.

Masking vs. include-collection option

arangodump also supports a very coarse masking with the option—include-collection. This will restrict the collections that aredumped to the ones explicitly listed.

It is possible to combine —masking and —include-collection.This will take the intersection of exportable collections.

Path

path defines which field to obfuscate. There can only be a singlepath per masking, but an unlimited amount of maskings per collection.

To mask a top-level attribute value, the path is simply the attributename, for instance "name" to mask the value "foobar":

  1. {
  2. "_key": "1234",
  3. "name": "foobar"
  4. }

The path to a nested attribute name with a top-level attribute personas its parent is "person.name":

  1. {
  2. "_key": "1234",
  3. "person": {
  4. "name": "foobar"
  5. }
  6. }

If the path starts with a . then it matches any path ending in name.For example, .name will match the field name of all leaf attributesin the document. Leaf attributes are attributes whose value is null,true, false, or of data type string, number or array.That means, it matches name at the top levelas well as at any nested level (e.g. foo.bar.name), but not nestedobjects themselves.

On the other hand, name will only match leaf attributesat top level. person.name will match the attribute name of a leafin the top-level object person. If person was itself an object,then the masking settings for this path would be ignored, because itis not a leaf attribute.

If the attribute value is an array then the masking is applied toall array elements individually.

If you have an attribute name that contains a dot, you need to quote thename with either a tick or a backtick. For example:

  1. "path": "´name.with.dots´"

or

  1. "path": "`name.with.dots`"

Example

The following configuration will replace the value of the nameattribute with an “xxxx”-masked string:

  1. {
  2. "type": "xifyFront",
  3. "path": ".name",
  4. "unmaskedLength": 2
  5. }

The document:

  1. {
  2. "name": "top-level-name",
  3. "age": 42,
  4. "nicknames" : [ { "name": "hugo" }, "egon" ],
  5. "other": {
  6. "name": [ "emil", { "secret": "superman" } ]
  7. }
  8. }

… will be changed as follows:

  1. {
  2. "name": "xxxxxxxxxxxxme",
  3. "age": 42,
  4. "nicknames" : [ { "name": "xxgo" }, "egon" ],
  5. "other": {
  6. "name": [ "xxil", { "secret": "superman" } ]
  7. }
  8. }

The values "egon" and "superman" are not replaced, because theyare not contained in an attribute value of which the attribute name isname.

Nested objects and arrays

If you specify a path and the attribute value is an array then themasking decision is applied to each element of the array as if thiswas the value of the attribute. This applies to arrays inside the array too.

If the attribute value is an object, then it is ignored and the attributedoes not get masked. To mask nested fields, specify the full path for eachleaf attribute.

If some documents have an attribute email with a string as value, but otherdocuments store a nested object under the same attribute name, then make sureto set up proper masking for the latter case, in which sub-attributes will notget masked if there is only a masking configured for the attribute emailbut not its nested attributes.

Examples

Masking email with the Xify Front function will convert:

  1. {
  2. "email" : "email address"
  3. }

… into:

  1. {
  2. "email" : "xxil xxxxxxss"
  3. }

because email is a leaf attribute. The document:

  1. {
  2. "email" : [
  3. "address one",
  4. "address two",
  5. [
  6. "address three"
  7. ]
  8. ]
  9. }

… will be converted into:

  1. {
  2. "email" : [
  3. "xxxxxss xne",
  4. "xxxxxss xwo",
  5. [
  6. "xxxxxss xxxee"
  7. ]
  8. ]
  9. }

… because the masking is applied to each array element individuallyincluding the elements of the sub-array. The document:

  1. {
  2. "email" : {
  3. "address" : "email address"
  4. }
  5. }

… will not be changed because email is not a leaf attribute.To mask the email address, you could use the paths email.addressor .address.

Masking Functions

The following masking functions are only available in theEnterprise Edition.

Random String

This masking type will replace all values of attributes with keyname with an anonymized string. It is not guaranteed that the stringwill be of the same length.

A hash of the original string is computed. If the original string isshorter then the hash will be used. This will result in a longerreplacement string. If the string is longer than the hash thencharacters will be repeated as many times as needed to reach the fulloriginal string length.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "randomString"Example
  1. {
  2. "path": ".name",
  3. "type": "randomString"
  4. }

Above masking setting applies to all leaf attributes with name .name.A document like:

  1. {
  2. "_key" : "1234",
  3. "name" : [
  4. "My Name",
  5. {
  6. "other" : "Hallo Name"
  7. },
  8. [
  9. "Name One",
  10. "Name Two"
  11. ],
  12. true,
  13. false,
  14. null,
  15. 1.0,
  16. 1234,
  17. "This is a very long name"
  18. ],
  19. "deeply": {
  20. "nested": {
  21. "name": "John Doe",
  22. "not-a-name": "Pizza"
  23. }
  24. }
  25. }

… will be converted to:

  1. {
  2. "_key": "1234",
  3. "name": [
  4. "+y5OQiYmp/o=",
  5. {
  6. "other": "Hallo Name"
  7. },
  8. [
  9. "ihCTrlsKKdk=",
  10. "yo/55hfla0U="
  11. ],
  12. true,
  13. false,
  14. null,
  15. 1.0,
  16. 1234,
  17. "hwjAfNe5BGw=hwjAfNe5BGw="
  18. ],
  19. "deeply": {
  20. "nested": {
  21. "name": "55fHctEM/wY=",
  22. "not-a-name": "Pizza"
  23. }
  24. }
  25. }

Xify Front

This masking type replaces the front characters with x andblanks. Alphanumeric characters, _ and - are replaced by x,everything else is replaced by a blank.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "xifyFront"
  • unmaskedLength (number, default: 2): how many characters toleave as-is on the right-hand side of each word as integer value
  • hash (bool, default: false): whether to append a hash value to themasked string to avoid possible unique constraint violations caused bythe obfuscation
  • seed (integer, default: 0): used as secret for computing the hash.A value of 0 means a random seedExamples
  1. {
  2. "path": ".name",
  3. "type": "xifyFront",
  4. "unmaskedLength": 2
  5. }

This will mask all alphanumeric characters of a word except the lasttwo characters. Words of length 1 and 2 are unmasked. If theattribute value is not a string the result will be xxxx.

  1. "This is a test!Do you agree?"

… will become:

  1. "xxis is a xxst Do xou xxxee "

There is a catch. If you have an index on the attribute the maskingmight distort the index efficiency or even cause errors in case of aunique index.

  1. {
  2. "type": "xifyFront",
  3. "path": ".name",
  4. "unmaskedLength": 2,
  5. "hash": true
  6. }

This will add a hash at the end of the string.

  1. "This is a test!Do you agree?"

… will become

  1. "xxis is a xxst Do xou xxxee NAATm8c9hVQ="

Note that the hash is based on a random secret that is different foreach run. This avoids dictionary attacks which can be used to guessvalues based pre-computations on dictionaries.

If you need reproducible results, i.e. hashes that do not change betweendifferent runs of arangodump, you need to specify a secret as seed,a number which must not be 0.

  1. {
  2. "type": "xifyFront",
  3. "path": ".name",
  4. "unmaskedLength": 2,
  5. "hash": true,
  6. "seed": 246781478647
  7. }

Zip

This masking type replaces a zip code with a random one.It uses the following rules:

  • If a character of the original zip code is a digit it will be replacedby a random digit.
  • If a character of the original zip code is a letter itwill be replaced by a random letter keeping the case.
  • If the attribute value is not a string then the default value is used.Note that this will generate random zip codes. Therefore there is achance that the same zip code value is generated multiple times, which cancause unique constraint violations if a unique index is or will beused on the zip code attribute.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "zip"
  • default (string, default: "12345"): if the input field is not ofdata type string, then this value is usedExamples
  1. {
  2. "path": ".code",
  3. "type": "zip",
  4. }

This replaces real zip codes stored in fields called code at any levelwith random ones. "12345" is used as fallback value.

  1. {
  2. "path": ".code",
  3. "type": "zip",
  4. "default": "abcdef"
  5. }

If the original zip code is:

  1. 50674

… it will be replaced by e.g.:

  1. 98146

If the original zip code is:

  1. SA34-EA

… it will be replaced by e.g.:

  1. OW91-JI

If the original zip code is null, true, false or a number, then theuser-defined default value of "abcdef" will be used.

Datetime

This masking type replaces the value of the attribute with a randomdate between two configured dates in a customizable format.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "datetime"
  • begin (string, default: "1970-01-01T00:00:00.000"):earliest point in time to return. Date time string in ISO 8601 format.
  • end (string, default: now):latest point in time to return. Date time string in ISO 8601 format.In case a partial date time string is provided (e.g. 2010-06 without dayand time) the earliest date and time is assumed (2010-06-01T00:00:00.000).The default value is the current system date and time.
  • format (string, default: ""): the formatting string format isdescribed in DATE_FORMAT().If no format is specified, then the result will be an empty string.Example
  1. {
  2. "path": "eventDate",
  3. "type": "datetime",
  4. "begin" : "2019-01-01",
  5. "end": "2019-12-31",
  6. "format": "%yyyy-%mm-%dd",
  7. }

Above example masks the field eventDate by returning a random date timestring in the range of January 1st and December 31st in 2019 using a formatlike 2019-06-17.

Integral Number

This masking type replaces the value of the attribute with a randomintegral number. It will replace the value even if it is a string,Boolean, or null.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "integer"
  • lower (number, default: -100): smallest integer value to return
  • upper (number, default: 100): largest integer value to returnExample
  1. {
  2. "path": "count",
  3. "type": "integer",
  4. "lower" : -100,
  5. "upper": 100
  6. }

This masks the field count with a random number between-100 and 100 (inclusive).

Decimal Number

This masking type replaces the value of the attribute with a randomfloating point number. It will replace the value even if it is a string,Boolean, or null.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "decimal"
  • lower (number, default: -1): smallest floating point value to return
  • upper (number, default: 1): largest floating point value to return
  • scale (number, default: 2): maximal amount of digits in thedecimal fraction partExamples
  1. {
  2. "path": "rating",
  3. "type": "decimal",
  4. "lower" : -0.3,
  5. "upper": 0.3
  6. }

This masks the field rating with a random floating point number between-0.3 and +0.3 (inclusive). By default, the decimal has a scale of 2.That means, it has at most 2 digits after the dot.

The configuration:

  1. {
  2. "path": "rating",
  3. "type": "decimal",
  4. "lower" : -0.3,
  5. "upper": 0.3,
  6. "scale": 3
  7. }

… will generate numbers with at most 3 decimal digits.

Credit Card Number

This masking type replaces the value of the attribute with a randomcredit card number (as integer number).See Luhn algorithmfor details.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "creditCard"Example
  1. {
  2. "path": "ccNumber",
  3. "type": "creditCard"
  4. }

This generates a random credit card number to mask field ccNumber,e.g. 4111111414443302.

Phone Number

This masking type replaces a phone number with a random one.It uses the following rule:

  • If a character of the original number is a digitit will be replaced by a random digit.
  • If it is a letter it is replaced by a random letter.
  • All other characters are left unchanged.
  • If the attribute value is not a string it is replaced by thedefault value.Masking settings:

  • path (string): which field to mask

  • type (string): masking function name "phone"
  • default (string, default: "+1234567890"): if the input fieldis not of data type string, then this value is usedExamples
  1. {
  2. "path": "phone.landline",
  3. "type": "phone"
  4. }

This will replace an existing phone number with a random one, for instance"+31 66-77-88-xx" might get substituted by "+75 10-79-52-sb".

  1. {
  2. "path": "phone.landline",
  3. "type": "phone",
  4. "default": "+49 12345 123456789"
  5. }

This masks a phone number as before, but falls back to a different defaultphone number in case the input value is not a string.

Email Address

This masking type takes an email address, computes a hash value andsplits it into three equal parts AAAA, BBBB, and CCCC. Theresulting email address is in the format AAAA.BBBB@CCCC.invalid.The hash is based on a random secret that is different for each run.

Masking settings:

  • path (string): which field to mask
  • type (string): masking function name "email"Example
  1. {
  2. "path": ".email",
  3. "type": "email"
  4. }

This masks every leaf attribute email with a random email addresssimilar to "EHwG.3AOg@hGU=.invalid".