Arangoimp

This manual describes the ArangoDB importer arangoimp, which can be used forbulk imports.

The most convenient method to import a lot of data into ArangoDB is to use thearangoimp command-line tool. It allows you to import data records from a fileinto an existing database collection.

It is possible to import document keys with the documents using the key_attribute. When importing into an edge collection, it is mandatory that allimported documents have the from and __to attributes, and that they containvalid references.

Let’s assume for the following examples you want to import user data into anexisting collection named “users” on the server.

Importing Data into an ArangoDB Database

Importing JSON-encoded Data

Let’s further assume the import at hand is encoded in JSON. We’ll be using theseexample user records to import:

  1. { "name" : { "first" : "John", "last" : "Connor" }, "active" : true, "age" : 25, "likes" : [ "swimming"] }
  2. { "name" : { "first" : "Jim", "last" : "O'Brady" }, "age" : 19, "likes" : [ "hiking", "singing" ] }
  3. { "name" : { "first" : "Lisa", "last" : "Jones" }, "dob" : "1981-04-09", "likes" : [ "running" ] }

To import these records, all you need to do is to put them into a file (with oneline for each record to import) and run the following command:

  1. > arangoimp --file "data.json" --type jsonl --collection "users"

This will transfer the data to the server, import the records, and print astatus summary. To show the intermediate progress during the import process, theoption —progress can be added. This option will show the percentage of theinput file that has been sent to the server. This will only be useful for bigimport files.

  1. > arangoimp --file "data.json" --type json --collection users --progress true

It is also possible to use the output of another command as an input for arangoimp.For example, the following shell command can be used to pipe data from the catprocess to arangoimp:

  1. > cat data.json | arangoimp --file - --type json --collection users

Note that you have to use —file - if you want to use another command as inputfor arangoimp. No progress can be reported for such imports as the size of the inputwill be unknown to arangoimp.

By default, the endpoint tcp://127.0.0.1:8529 will be used. If you want tospecify a different endpoint, you can use the —server.endpoint option. Youprobably want to specify a database user and password as well. You can do so byusing the options —server.username and —server.password. If you do notspecify a password, you will be prompted for one.

  1. > arangoimp --server.endpoint tcp://127.0.0.1:8529 --server.username root --file "data.json" --type json --collection "users"

Note that the collection (users in this case) must already exist or the importwill fail. If you want to create a new collection with the import data, you needto specify the —create-collection option. Note that by default it will createa document collection and no ede collection.

  1. > arangoimp --file "data.json" --type json --collection "users" --create-collection true

To create an edge collection instead, use the —create-collection-type optionand set it to edge:

  1. > arangoimp --file "data.json" --collection "myedges" --create-collection true --create-collection-type edge

When importing data into an existing collection it is often convenient to firstremove all data from the collection and then start the import. This can be achievedby passing the —overwrite parameter to arangoimp. If it is set to true,any existing data in the collection will be removed prior to the import. Notethat any existing index definitions for the collection will be preserved even if—overwrite is set to true.

  1. > arangoimp --file "data.json" --type json --collection "users" --overwrite true

As the import file already contains the data in JSON format, attribute names anddata types are fully preserved. As can be seen in the example data, there is noneed for all data records to have the same attribute names or types. Records canbe inhomogeneous.

Please note that by default, arangoimp will import data into the specifiedcollection in the default database (_system). To specify a different database,use the —server.database option when invoking arangoimp.

The tool also supports parallel imports, with multiple threads. Using multiplethreads may provide a speedup, especially when using the RocksDB storage engine.To specify the number of parallel threads use the —threads option:

  1. > arangoimp --threads 4 --file "data.json" --type json --collection "users"

Note that using multiple threads may lead to a non-sequential import of the inputdata. Data that appears later in the input file may be imported earlier than datathat appears earlier in the input file. This is normally not a problem but may causeissues when when there are data dependencies or duplicates in the import data. Inthis case, the number of threads should be set to 1.

JSON input file formats

Note: arangoimp supports two formats when importing JSON data froma file. The first format that we also used above is commonly known as jsonl).However, in contrast to the JSONL specification it requires the input file to containone complete JSON document in each line, e.g.

  1. { "_key": "one", "value": 1 }
  2. { "_key": "two", "value": 2 }
  3. { "_key": "foo", "value": "bar" }
  4. ...

So one could argue that this is only a subset of JSONL.

The above format can be imported sequentially by arangoimp. It will read datafrom the input file in chunks and send it in batches to the server. Each batchwill be about as big as specified in the command-line parameter —batch-size.

An alternative is to put one big JSON document into the input file like this:

  1. [
  2. { "_key": "one", "value": 1 },
  3. { "_key": "two", "value": 2 },
  4. { "_key": "foo", "value": "bar" },
  5. ...
  6. ]

This format allows line breaks within the input file as required. The downsideis that the whole input file will need to be read by arangoimp before it cansend the first batch. This might be a problem if the input file is big. Bydefault, arangoimp will allow importing such files up to a size of about 16 MB.

If you want to allow your arangoimp instance to use more memory, you may wantto increase the maximum file size by specifying the command-line option—batch-size. For example, to set the batch size to 32 MB, use the followingcommand:

  1. > arangoimp --file "data.json" --type json --collection "users" --batch-size 33554432

Please also note that you may need to increase the value of —batch-size ifa single document inside the input file is bigger than the value of —batch-size.

Importing CSV Data

arangoimp also offers the possibility to import data from CSV files. Thiscomes handy when the data at hand is in CSV format already and you don’t want tospend time converting them to JSON for the import.

To import data from a CSV file, make sure your file contains the attribute namesin the first row. All the following lines in the file will be interpreted asdata records and will be imported.

The CSV import requires the data to have a homogeneous structure. All recordsmust have exactly the same amount of columns as there are headers. By default,lines with a different number of values will not be imported and there will bewarnings for them. To still import lines with less values than in the header,there is the —ignore-missing option. If set to true, lines that have adifferent amount of fields will be imported. In this case only those attributeswill be populated for which there are values. Attributes for which there areno values present will silently be discarded.

Example:

  1. "first","last","age","active","dob"
  2. "John","Connor",25,true
  3. "Jim","O'Brady"

With —ignore-missing this will produce the following documents:

  1. { "first" : "John", "last" : "Connor", "active" : true, "age" : 25 }
  2. { "first" : "Jim", "last" : "O'Brady" }

The cell values can have different data types though. If a cell does not haveany value, it can be left empty in the file. These values will not be importedso the attributes will not “be there” in document created. Values enclosed inquotes will be imported as strings, so to import numeric values, boolean valuesor the null value, don’t enclose the value in quotes in your file.

We’ll be using the following import for the CSV import:

  1. "first","last","age","active","dob"
  2. "John","Connor",25,true,
  3. "Jim","O'Brady",19,,
  4. "Lisa","Jones",,,"1981-04-09"
  5. Hans,dos Santos,0123,,
  6. Wayne,Brewer,,false,

The command line to execute the import is:

  1. > arangoimp --file "data.csv" --type csv --collection "users"

The above data will be imported into 5 documents which will look as follows:

  1. { "first" : "John", "last" : "Connor", "active" : true, "age" : 25 }
  2. { "first" : "Jim", "last" : "O'Brady", "age" : 19 }
  3. { "first" : "Lisa", "last" : "Jones", "dob" : "1981-04-09" }
  4. { "first" : "Hans", "last" : "dos Santos", "age" : 123 }
  5. { "first" : "Wayne", "last" : "Brewer", "active" : false }

As can be seen, values left completely empty in the input file will be treatedas absent. Numeric values not enclosed in quotes will be treated as numbers.Note that leading zeros in numeric values will be removed. To import numberswith leading zeros, please use strings.The literals true and false will be treated as booleans if they are notenclosed in quotes. Other values not enclosed in quotes will be treated asstrings.Any values enclosed in quotes will be treated as strings, too.

String values containing the quote character or the separator must be enclosedwith quote characters. Within a string, the quote character itself must beescaped with another quote character (or with a backslash if the _—backslash-escape_option is used).

Note that the quote and separator characters can be adjusted via the—quote and —separator arguments when invoking arangoimp. The quotecharacter defaults to the double quote (). To use a literal quote in astring, you can use two quote characters.To use backslash for escaping quote characters, please set the option—backslash-escape to true.

The importer supports Windows (CRLF) and Unix (LF) line breaks. Line breaks mightalso occur inside values that are enclosed with the quote character.

Here’s an example for using literal quotes and newlines inside values:

  1. "name","password"
  2. "Foo","r4ndom""123!"
  3. "Bar","wow!
  4. this is a
  5. multine password!"
  6. "Bartholomew ""Bart"" Simpson","Milhouse"

Extra whitespace at the end of each line will be ignored. Whitespace at thestart of lines or between field values will not be ignored, so please make surethat there is no extra whitespace in front of values or between them.

Importing TSV Data

You may also import tab-separated values (TSV) from a file. This format is verysimple: every line in the file represents a data record. There is no quoting orescaping. That also means that the separator character (which defaults to thetabstop symbol) must not be used anywhere in the actual data.

As with CSV, the first line in the TSV file must contain the attribute names,and all lines must have an identical number of values.

If a different separator character or string should be used, it can be specifiedwith the —separator argument.

An example command line to execute the TSV import is:

  1. > arangoimp --file "data.tsv" --type tsv --collection "users"

Attribute Name Translation

For the CSV and TSV input formats, attribute names can be translated automatically.This is useful in case the import file has different attribute names than thosethat should be used in ArangoDB.

A common use case is to rename an “id” column from the input file into “_key” asit is expected by ArangoDB. To do this, specify the following translation wheninvoking arangoimp:

  1. > arangoimp --file "data.csv" --type csv --translate "id=_key"

Other common cases are to rename columns in the input file to from_ and to_:

  1. > arangoimp --file "data.csv" --type csv --translate "from=_from" --translate "to=_to"

The translate option can be specified multiple types. The source attribute nameand the target attribute must be separated with a =.

Ignoring Attributes

For the CSV and TSV input formats, certain attribute names can be ignored on imports.In an ArangoDB cluster there are cases where this can come in handy,when your documents already contain a key attributeand your collection has a sharding attribute other than _key: In the cluster thisconfiguration is not supported, because ArangoDB needs to guarantee the uniqueness of the _keyattribute in _all shards of the collection.

  1. > arangoimp --file "data.csv" --type csv --remove-attribute "_key"

The same thing would apply if your data contains an _id attribute:

  1. > arangoimp --file "data.csv" --type csv --remove-attribute "_id"

Importing into an Edge Collection

arangoimp can also be used to import data into an existing edge collection.The import data must, for each edge to import, contain at least the from_ andto_ attributes. These indicate which other two documents the edge should connect.It is necessary that these attributes are set for all records, and point tovalid document ids in existing collections.

Examples

  1. { "_from" : "users/1234", "_to" : "users/4321", "desc" : "1234 is connected to 4321" }

Note: The edge collection must already exist when the import is started. Usingthe —create-collection flag will not work because arangoimp will always try tocreate a regular document collection if the target collection does not exist.

Updating existing documents

By default, arangoimp will try to insert all documents from the import file into thespecified collection. In case the import file contains documents that are already presentin the target collection (matching is done via the _key attributes), then a defaultarangoimp run will not import these documents and complain about unique key constraintviolations.

However, arangoimp can be used to update or replace existing documents in case theyalready exist in the target collection. It provides the command-line option _—on-duplicate_to control the behavior in case a document is already present in the database.

The default value of —on-duplicate is error. This means that when the import filecontains a document that is present in the target collection already, then trying tore-insert a document with the same _key value is considered an error, and the document inthe database will not be modified.

Other possible values for —on-duplicate are:

  • update: each document present in the import file that is also present in the targetcollection already will be updated by arangoimp. update will perform a partial updateof the existing document, modifying only the attributes that are present in the importfile and leaving all other attributes untouched.

The values of system attributes id_, key, __rev, from_ and to_ cannot beupdated or replaced in existing documents.

  • replace: each document present in the import file that is also present in the targetcollection already will be replace by arangoimp. replace will replace the existingdocument entirely, resulting in a document with only the attributes specified in the importfile.

The values of system attributes id_, key, __rev, from_ and to_ cannot beupdated or replaced in existing documents.

  • ignore: each document present in the import file that is also present in the targetcollection already will be ignored and not modified in the target collection.

When —on-duplicate is set to either update or replace, arangoimp will return thenumber of documents updated/replaced in the updated return value. When set to anothervalue, the value of updated will always be zero. When —on-duplicate is set to ignore,arangoimp will return the number of ignored documents in the ignored return value.When set to another value, ignored will always be zero.

It is possible to perform a combination of inserts and updates/replaces with a singlearangoimp run. When —on-duplicate is set to update or replace, all documents presentin the import file will be inserted into the target collection provided they are validand do not already exist with the specified key_. Documents that are already presentin the target collection (identified by key_ attribute) will instead be updated/replaced.

Arangoimp result output

An arangoimp import run will print out the final results on the command line.It will show the

  • number of documents created (created)
  • number of documents updated/replaced (updated/replaced, only non-zero if—on-duplicate was set to update or replace, see below)
  • number of warnings or errors that occurred on the server side (warnings/errors)
  • number of ignored documents (only non-zero if —on-duplicate was set to ignore).Example
  1. created: 2
  2. warnings/errors: 0
  3. updated/replaced: 0
  4. ignored: 0

For CSV and TSV imports, the total number of input file lines read will also be printed(lines read).

arangoimp will also print out details about warnings and errors that happened on theserver-side (if any).

Attribute Naming and Special Attributes

Attributes whose names start with an underscore are treated in a special way byArangoDB:

  • the optional _key attribute contains the document’s key. If specified, the valuemust be formally valid (e.g. must be a string and conform to the naming conventions).Additionally, the key value must be unique within thecollection the import is run for.
  • from_: when importing into an edge collection, this attribute contains the idof one of the documents connected by the edge. The value of from_ must be asyntactically valid document id and the referred collection must exist.
  • to_: when importing into an edge collection, this attribute contains the idof the other document connected by the edge. The value of to_ must be asyntactically valid document id and the referred collection must exist.
  • rev_: this attribute contains the revision number of a document. However, therevision numbers are managed by ArangoDB and cannot be specified on import. Thusany value in this attribute is ignored on import.If you import values into key_, you should make sure they are valid and unique.

When importing data into an edge collection, you should make sure that all importdocuments can from_ and to_ and that their values point to existing documents.

To avoid specifying complete document ids (consisting of collection names and documentkeys) for from_ and to values, there are the options —from-collection-prefix and—to-collection-prefix. If specified, these values will be automatically prependedto each value in __from (or to_ resp.). This allows specifying only document keysinside from and/or __to.

Example

  1. > arangoimp --from-collection-prefix users --to-collection-prefix products ...

Importing the following document will then create an edge between users/1234 andproducts/4321:

  1. { "_from" : "1234", "_to" : "4321", "desc" : "users/1234 is connected to products/4321" }

Automatic pacing with busy or low throughput disk subsystems

Arangoimport has an automatic pacing algorithm that limits how fastdata is sent to the ArangoDB servers. This pacing algorithm exists toprevent the import operation from failing due to slow responses.

Google Compute and other VM providers limit the throughput of diskdevices. Google’s limit is more strict for smaller disk rentals, thanfor larger. Specifically, a user could choose the smallest disk spaceand be limited to 3 Mbytes per second. Similarly, other users’processes on the shared VM can limit available throughput of the diskdevices.

The automatic pacing algorithm adjusts the transmit block sizedynamically based upon the actual throughput of the server over thelast 20 seconds. Further, each thread delivers its portion of the datain mostly non-overlapping chunks. The thread timing createsintentional windows of non-import activity to allow the server extratime for meta operations.

Automatic pacing intentionally does not use the full throughput of adisk device. An unlimited (really fast) disk device might not needpacing. Raising the number of threads via the —threads X commandline to any value of X greater than 2 will increase the totalthroughput used.

Automatic pacing frees the user from adjusting the throughput used tomatch available resources. It is disabled by manually specifying any—batch-size. 16777216 was the previous default for —batch-size.Having —batch-size too large can lead to transmitted data piling-upon the server, resulting in a TimeoutError.

The pacing algorithm works successfully with MMFiles with diskslimited to read and write throughput as small as 1 Mbyte persecond. The algorithm works successfully with RocksDB with diskslimited to read and write throughput as small as 3 Mbyte per second.