Different data models and scalability

In this section we discuss scalability in the context of the differentdata models supported by ArangoDB.

Key/value pairs

The key/value store data model is the easiest to scale. In ArangoDB,this is implemented in the sense that a document collection always has a primary key _key attribute and in the absence of further secondaryindexes the document collection behaves like a simple key/value store.

The only operations that are possible in this context are single keylookups and key/value pair insertions and updates. If _key is theonly sharding attribute then the sharding is done with respect to theprimary key and all these operations scale linearly. If the sharding isdone using different shard keys, then a lookup of a single key involvesasking all shards and thus does not scale linearly.

Document store

For the document store case even in the presence of secondary indexesessentially the same arguments apply, since an index for a shardedcollection is simply the same as a local index for each shard. Therefore,single document operations still scale linearly with the size of thecluster, unless a special sharding configuration makes lookups orwrite operations more expensive.

Complex queries and joins

The AQL query language allows complex queries, using multiplecollections, secondary indexes as well as joins. In particular withthe latter, scaling can be a challenge, since if the data to bejoined resides on different machines, a lot of communicationhas to happen. The AQL query execution engine organizes a datapipeline across the cluster to put together the results in themost efficient way. The query optimizer is aware of the clusterstructure and knows what data is where and how it is indexed.Therefore, it can arrive at an informed decision about what partsof the query ought to run where in the cluster.

Nevertheless, for certain complicated joins, there are limits asto what can be achieved.

Graph database

Graph databases are particularly good at queries on graphs that involvepaths in the graph of an a priori unknown length. For example, findingthe shortest path between two vertices in a graph, or finding allpaths that match a certain pattern starting at a given vertex are suchexamples.

However, if the vertices and edges along the occurring paths aredistributed across the cluster, then a lot of communication isnecessary between nodes, and performance suffers. To achieve goodperformance at scale, it is therefore necessary to get thedistribution of the graph data across the shards in the clusterright. Most of the time, the application developers and users ofArangoDB know best, how their graphs are structured. Therefore, ArangoDB allows users to specify, according to which attributesthe graph data is sharded. A useful first step is usually to makesure that the edges originating at a vertex reside on the samecluster node as the vertex.