Features and Improvements in ArangoDB 3.2

Features and Improvements in ArangoDB 3.2

The following list shows in detail which features have been added or improved inArangoDB 3.2. ArangoDB 3.2 also contains several bugfixes that are not listedhere.

Storage engines

ArangoDB 3.2 offers two storage engines:

the always-existing memory-mapped files storage engine
a new storage engine based on RocksDB

Memory-mapped files storage engine (MMFiles)

The former storage engine (named MMFiles engine henceforth) persists data in memory-mappedfiles.

Any data changes are done first in the engine’s write-ahead log (WAL). The WALis replayed after a crash so the engine offers durability and crash-safety. Datafrom the WAL is eventually moved to collection-specific datafiles. The files arealways written in an append-only fashion, so data in files is never overwritten.Obsolete data in files will eventually be purged by background compaction threads.

Most of this engine’s indexes are built in RAM. When a collection is loaded, this requiresrebuilding the indexes in RAM from the data stored on disk. The MMFiles engine hascollection-level locking.

This storage engine is a good choice when data (including the indexes) can fit in theserver’s available RAM. If the size of data plus the in-memory indexes exceeds the sizeof the available RAM, then this engine may try to allocate more memory than available.This will either make the operating system swap out parts of the data (and cause disk I/O)or, when no swap space is configured, invoke the operating system’s out-of-memory processkiller.

The locking strategy allows parallel reads and is often good enough in read-mostlyworkloads. Writes need exclusive locks on the collections, so they can block otheroperations in the same collection. The locking strategy also provides transactional consistencyand isolation.

RocksDB storage engine

The RocksDB storage engine is new in ArangoDB 3.2. It is designed to store datasetsthat are bigger than the server’s available RAM. It persists all data (including theindexes) in a RocksDB instance.

That means any document read or write operations will be answered by RocksDB under thehood. RocksDB will serve the data from its own in-RAM caches or from disk.The RocksDB engine has a write-ahead log (WAL) and uses background threads for compaction.It supports data compression.

The RocksDB storage engine has document-level locking. Read operations do not block andare never blocked by other operations. Write operations only block writes on the samedocuments/index values. Because multiple writers can operate in parallel on the samecollection, there is the possibility of write-write conflicts. If such write conflictis detected, one of the write operations is aborted with error 1200 (“conflict”).Client applications can then either abort the operation or retry, based on the requiredconsistency semantics.

Storage engine selection

The storage engine to use in an ArangoDB cluster or a single-server instance must beselected initially. The default storage engine in ArangoDB 3.2 is the MMFiles engine ifno storage engine is selected explicitly. This ensures all users upgrading from earlierversions can continue with the well-known MMFiles engine.

To select the storage-engine, there is the configuration option —server.storage-engine.It can be set to either mmfiles, rocksdb or auto. While the first two values willexplicitly select a storage engine, the auto option will automatically choose thestorage engine based on which storage engine was previously selected. If no engine wasselected previously, auto will select the MMFiles engine. If an engine was previouslyselected, the selection will be written to a file ENGINE in the server’s databasedirectory and will be read from there at any subsequent server starts.

Once the storage engine was selected, the selection cannot be changed by adjusting—server.storage-engine. In order to switch to another storage engine, it is requiredto re-start the server with another (empty) database directory. In order to use datacreated with the other storage engine, it is required to dump the data first with theold engine and restore it using the new storage engine. This can be achieved viainvoking arangodump and arangorestore.

Unlike in MySQL, the storage engine selection in ArangoDB is for an entire cluster oran entire single-server instance. All databases and collections will use the same storageengine.

RocksDB storage engine: supported index types

The existing indexes in the RocksDB engine are all persistent. The following indexes aresupported there:

primary: this type of index is automatically created. It indexes _id / _key
edge: this index is automatically created for edge collections. It indexes_from and _to
hash, skiplist, persistent: these are user-defined indexes, Despite their names, they areneither hash nor skiplist indexes. These index types map to the same RocksDB-basedsorted index implementation. The same is true for the “persistent” index. The names“hash”, “skiplist” and “persistent” are only used for compatibility with the MMFilesengine where these indexes existed in previous and the current version of ArangoDB.
geo: user-defined index for proximity searches
fulltext: user-defined sorted reverted index on words occurring in documents

Satellite Collections

With SatelliteCollections, you can define collections to shard to a cluster andcollections to replicate to each machine. The ArangoDB query optimizer knows whereeach shard is located and sends the requests to the DBServers involved, which thenexecutes the query, locally. With this approach, network hops during joinoperations on sharded collections can be avoided and response times can be close tothat of a single instance.

Satellite collectionsare available in the Enterprise Edition.

Memory management

make arangod start with less V8 JavaScript contexts

This speeds up the server start and makes arangod use less memory at start.Whenever a V8 context is needed by a Foxx action or some other JavaScript operationand there is no usable V8 context, a new context will be created dynamically now.

Up to —javascript.v8-contexts V8 contexts will be created, so this optionwill change its meaning. Previously as many V8 contexts as specified by thisoption were created at server start, and the number of V8 contexts did notchange at runtime. Now up to this number of V8 contexts will be in use at thesame time, but the actual number of V8 contexts is dynamic.

The garbage collector thread will automatically delete unused V8 contexts aftera while. The number of spare contexts will go down to as few as configured inthe new option —javascript.v8-contexts-minimum. Actually that many V8 contextsare also created at server start.

The first few requests in new V8 contexts may take longer than in contextsthat have been there already. Performance may therefore suffer a bit for theinitial requests sent to ArangoDB or when there are only few but performance-critical situations in which new V8 contexts need to be created. If this is aconcern, it can easily be fixed by setting —javascipt.v8-contexts-minimumand —javascript.v8-contexts to a relatively high value, which will guaranteethat many number of V8 contexts to be created at startup and kept around evenwhen unused.

Waiting for an unused V8 context will now also abort and write a log messagein case no V8 context can be acquired/created after 60 seconds.

the number of pending operations in arangod can now be limited to a configurablenumber. If this number is exceeded, the server will now respond with HTTP 503(service unavailable). The maximum size of pending operations is controlled viathe startup option —server.maximal-queue-size. Setting it to 0 means “no limit”.
the in-memory document revisions cache was removed entirely because it did notprovide the expected benefits. The 3.1 implementation shadowed document data inRAM, which increased the server’s RAM usage but did not speed up document lookupstoo much.

This also obsoletes the startup options —database.revision-cache-chunk-size and—database.revision-cache-target-size.

The MMFiles engine now does not use a document revisions cache but has in-memoryindexes and maps documents to RAM automatically via mmap when documents areaccessed. The RocksDB engine has its own mechanism for caching accessed documents.

Communication Layer

HTTP responses returned by arangod will now include the extra HTTP headerx-content-type-options: nosniff to work around a cross-site scripting bugin MSIE
the default value for —ssl.protocol was changed from TLSv1 to TLSv1.2.When not explicitly set, arangod and all client tools will now use TLSv1.2.
the JSON data in all incoming HTTP requests in now validated for duplicateattribute names.

Incoming JSON data with duplicate attribute names will now be rejected asinvalid. Previous versions of ArangoDB only validated the uniqueness ofattribute names inside incoming JSON for some API endpoints, but notconsistently for all APIs.

Internal JavaScript REST actions will now hide their stack traces to the clientunless in HTTP responses. Instead they will always log to the logfile.

JavaScript

updated V8 version to 5.7.0.0
change undocumented behavior in case of invalid revision ids inIf-Match and If-None-Match headers from 400 (BAD) to 412 (PRECONDITIONFAILED).
change default string truncation length from 80 characters to 256 characters forprint/printShell functions in ArangoShell and arangod. This will emit longerprefixes of string values before truncating them with …, which is helpfulfor debugging. This change is mostly useful when using the ArangoShell (arangosh).
the @arangodb module now provides a time function which returns the current timein seconds as a floating point value with microsecond precision.

Foxx

There is now an official HTTP API for managing services,allowing services to be installed, modified, uninstalled and reconfigured withoutthe administrative web interface.
It is now possible to upload a single JavaScript file instead of a zip archiveif your service requires no configuration, additional files or setup.A minimal manifest will be generated automatically upon installation and theuploaded file will be used as the service’s main entry point.

Distributed Graph Processing

We added support for executing distributed graph algorithms aka Pregel.
Users can run arbitrary algorithms on an entire graph, including in cluster mode.
We implemented a number of algorithms for various well-known graph measures:
- Connected Components
- PageRank
- Shortest Paths
- Centrality Measures (Centrality and Betweenness)
- Community Detection (via Label Propagation, Speakers-Listeners Label Propagation or DMID)
Users can contribute their own algorithms

AQL

Optimizer improvements

Geo indexes are now implicitly and automatically used when using appropriate SORT/FILTERstatements in AQL, without the need to use the somewhat limited special-purpose geo AQLfunctions NEAR or WITHIN.

Compared to using the special purpose AQL functions this approach has theadvantage that it is more composable, and will also honor any LIMIT valuesused in the AQL query.

The special purpose NEAR AQL function can now be substituted with thefollowing AQL (provided there is a geo index present on the doc.latitudeand doc.longitude attributes):

FOR doc in geoSort
  SORT DISTANCE(doc.latitude, doc.longitude, 0, 0)
  LIMIT 5
  RETURN doc

WITHIN can be substituted with the following AQL:

FOR doc in geoFilter
  FILTER DISTANCE(doc.latitude, doc.longitude, 0, 0) < 2000
  RETURN doc

Miscellaneous improvements

added REGEX_REPLACE AQL function

REGEX_REPLACE(text, search, replacement, caseInsensitive) → string

Replace the pattern search with the string replacement in the stringtext, using regular expression matching.

text (string): the string to search in
search (string): a regular expression search pattern
replacement (string): the string to replace the search pattern with
returns string (string): the string text with the search regexpattern replaced with the replacement string wherever the pattern existsin text
- added new startup option —query.fail-on-warning to make AQL queriesabort instead of continuing with warnings.

When set to true, this will make an AQL query throw an exception andabort in case a warning occurs. This option should be used in development to catcherrors early. If set to false, warnings will not be propagated to exceptions andwill be returned with the query results. The startup option can also be overridenon a per query-level.

the slow query list now contains the values of bind variables used in theslow queries. Bind variables are also provided for the currently runningqueries. This helps debugging slow or blocking queries that use dynamiccollection names via bind parameters.
AQL breaking change in cluster:The SHORTEST_PATH statement using edge collection names insteadof a graph names now requires to explicitly name the vertex collection nameswithin the AQL query in the cluster. It can be done by adding WITH <name>at the beginning of the query.

Example:

FOR v,e IN OUTBOUND SHORTEST_PATH @start TO @target edges [...]

Now has to be:

WITH vertices
FOR v,e IN OUTBOUND SHORTEST_PATH @start TO @target edges [...]

This change is due to avoid deadlock sitations in clustered case.An error stating the above is included.

Client tools

added data export tool, arangoexport.

arangoexport can be used to export collections to json, jsonl or xmland export a graph or collections to xgmml.

added “jsonl” as input file type for arangoimp
added —translate option for arangoimp to translate attribute names fromthe input files to attriubte names expected by ArangoDB

The —translate option can be specified multiple times (once per translationto be executed). The following example renames the “id” column from the inputfile to “_key”, and the “from” column to “_from”, and the “to” column to “_to”:

arangoimp --type csv --file data.csv --translate "id=_key" --translate "from=_from" --translate "to=_to"

—translate works for CSV and TSV inputs only.

added —threads option to arangoimp to specify the number of parallel import threads
changed default value for client tools option —server.max-packet-size from 128 MBto 256 MB. this allows transferring bigger result sets from the server without theclient tools rejecting them as invalid.

Authentication

added LDAP authentication (Enterprise Edition only)

Authorization

added read only mode for users
collection level authorization rightsRead more in the overview.

Foxx and authorization

the cookie session transport now supportsall options supported by the cookie method of the response object.
it’s now possible to provide your own version of the graphql-sync module when using the GraphQL extensions for Foxx by passing a copy of the module using the new graphql option.
custom API endpoints can now be tagged using the tag method to generate a cleaner Swagger documentation.

Miscellaneous Changes

arangod now validates several OS/environment settings on startup and warns ifthe settings are non-ideal. It additionally will print out ways to remedy theoptions.

Most of the checks are executed on Linux systems only.

added “deduplicate” attribute for array indexes, which controls whether insertingduplicate index values from the same document into a unique array index will lead toan error or not:

// with deduplicate = true, which is the default value:
db._create("test");
db.test.ensureIndex({ type: "hash", fields: ["tags[*]"], deduplicate: true });
db.test.insert({ tags: ["a", "b"] });
db.test.insert({ tags: ["c", "d", "c"] }); // will work, because deduplicate = true
db.test.insert({ tags: ["a"] }); // will fail
// with deduplicate = false
db._create("test");
db.test.ensureIndex({ type: "hash", fields: ["tags[*]"], deduplicate: false });
db.test.insert({ tags: ["a", "b"] });
db.test.insert({ tags: ["c", "d", "c"] }); // will not work, because deduplicate = false
db.test.insert({ tags: ["a"] }); // will fail

What's New in 3.2