ArangoSearch Views

ArangoSearch Views

ArangoSearch Views enable sophisticated information retrieval queries such asfull-text search for unstructured or semi-structured data over documents fromdifferent collections, filtering on multiple document attributes and sortingthe documents that satisfy the search criteria by relevance.

Comparison with the Full-text Index:

Feature	ArangoSearch	Full-text Index
Term search	Yes	Yes
Prefix search	Yes	Yes
Boolean expressions	Yes	Restricted
Range search	Yes	No
Phrase search	Yes	No
Relevance ranking	Yes	No
Configurable Analyzers	Yes	No
AQL composable language construct	Yes	No
Indexed attributes per collection	Unlimited	1
Indexed collections	Unlimited	1

Views guarantee the best execution plan (merge join) when querying multipleattributes, unlike collections with user-defined indexes.

Concept

A View can be understood as abstraction over a transformation applied to documentsof zero or more collections. The transformation is View-implementation specificand may even be as simple as an identity transformation thus making the Viewrepresent all documents available in the specified set of source collections.Currently there is a single supported View implementation, ArangoSearch Views.

ArangoSearch Views combine two information retrieval models: Boolean andgeneralized ranking retrieval. Each document “approved” by Boolean model getsits own rank from the ranking model.

For text retrieval, the Vector Space Model (VSM) is used as the ranking model.According to the model, documents and query are represented as vectors in aspace formed by the terms of the query. Typically terms are single words,keywords or even phrases. Value analysis such as splitting text into words(tokenization) and normalizing them is possible with the help ofAnalyzers. As this is application dependentArangoDB offers configurable Analyzers aside from a set of built-in Analyzers.

The document vectors that are closer to a query vector are more relevant.Practically the closeness is expressed as the cosine of the angle between twovectors, namely cosine similarity.

In order to define how relevant document d is to the query q the followingexpression is evaluated:

cos a = (d q) / (|d| |q|), where d * q is the dot product of the queryvector q and document vector d, |d| is the norm of the vector d, |q|is the norm of the vector q.

The required vector components have to be computed upfront. Since the space isformed by the terms, term weights can be used as the coordinates. There are anumber of probability/statistical weighting models of which two are implementedfor ArangoSearch Views, probably the most famous schemes:

Okapi BM25
TF-IDFUnder the hood both models rely on two main components:
Term frequency (TF): in the simplest case defined as the number of timesthat term t occurs in document d
Inverse document frequency (IDF): a measure of how much information theword provides, i.e. whether the term is common or rare across all documentsThe searching and ranking capabilities are provided by theIResearch library.

Integration

The collections to act as data source need to be linked to a View.An ArangoSearch Link is a uni-directional connection from an ArangoDB collectionto an ArangoSearch View describing how data coming from the said collectionshould be made available in the given View. You can think of it as a data flowfrom a collection to a View. A View can have zero or more links, each to adistinct ArangoDB collection within a database. The same collections may belinked to other Views too.

ArangoSearch Views are not updated synchronously as the source collectionschange in order to minimize the performance impact. They are eventuallyconsistent, with a configurable consolidation policy.

Document as well as edge collections can be linked, which means graphs can betreated as flat and interconnected data structure simultaneously. For example,one can find the most relevant vertices by searching and sorting via a View,then do a regular traversal within a specified depth.

Links can be managed by editing the View Definition.It is possible to index all attributes or particular attributes (optionallyincluding nested attributes). Any document attribute at any depth can beindexed. A list of Analyzers to process the values with can be defined for eachsuch field.

The Analyzer(s) defined in the View and the one(s) specified in a query mustmatch to produce a result. For example, if a field was only indexed using the"identity" Analyzer but the search expression compares the value to somethingusing a different Analyzer (e.g. "text_en") then nothing is found.

The elements of arrays are indexed individually by default, as if the sourceattribute had each element as value at the same time. Strings may gettransformed by Analyzers into multiple tokens, which are handled similarly toan array of strings. SeeAQL SEARCH operationfor details. Primitive values other than strings (null, true, false,numbers) are indexed unchanged. The values of nested object are optionallyindexed under the respective attribute path, including objects in arrays.

Views can be managed in the Web UI, via an HTTP API andthrough a JavaScript API.

Finally, Views can be queried with AQL via theSEARCH operation.

Primary Sort Order

The index behind an ArangoSearch View can have a primary sort order.A direction can be specified upon View creation for each uniquely namedattribute (ascending or descending), to enable an optimization for AQLqueries which iterate over a View and sort by one or multiple of theattributes. If the field(s) and the sorting direction(s) match then thethe data can be read directly from the index without actual sort operation.

View definition example:

{
  "links": {
    "coll1": {
      "fields": {
        "text": {
        }
      }
    },
    "coll2": {
      "fields": {
        "text": {
      }
    }
  },
  "primarySort": [
    {
      "field": "text",
      "direction": "asc"
    }
  ]
}

AQL query example:

FOR doc IN viewName
  SORT doc.text
  RETURN doc

Execution plan without a sorted index being used:

Execution plan:
 Id   NodeType            Est.   Comment
  1   SingletonNode          1   * ROOT
  2   EnumerateViewNode      1     - FOR doc IN viewName   /* view query */
  3   CalculationNode        1       - LET #1 = doc.`val`   /* attribute expression */
  4   SortNode               1       - SORT #1 ASC   /* sorting strategy: standard */
  5   ReturnNode             1       - RETURN doc

Execution plan with a the primary sort order of the index being utilized:

Execution plan:
 Id   NodeType            Est.   Comment
  1   SingletonNode          1   * ROOT
  2   EnumerateViewNode      1     - FOR doc IN viewName SORT doc.`val` ASC   /* view query */
  5   ReturnNode             1       - RETURN doc

To define more than one attribute to sort by, simply add more sub-objects tothe primarySort array:

  "primarySort": [
    {
      "field": "date",
      "direction": "desc"
    },
    {
      "field": "text",
      "direction": "asc"
    }
  ]

The optimization can be applied to View queries which sort by both fields asdefined (SORT doc.date DESC, doc.text), but also if they sort in descendingorder by the date attribute only (SORT doc.date DESC). Queries which sortby text alone (SORT doc.text) are not eligible, because the View is sortedby date first. This is similar to skiplist indexes, but inverted sortingdirections are not covered by the View index(e.g. SORT doc.date, doc.text DESC).

Note that the primarySort option is immutable: it can not be changed afterView creation. It is therefore not possible to configure it through the Web UI.The View needs to be created via the HTTP or JavaScript API (arangosh) to set it.

View Definition/Modification

An ArangoSearch View is configured via an object containing a set ofView-specific configuration directives and a map of link-specific configurationdirectives.

During View creation the following directives apply:

name (string, immutable): the View name
type (string, immutable): the value "arangosearch"
any of the directives from the section View PropertiesDuring view modification the following directives apply:
links (object, optional):a mapping of collection-name / collection-identifier to one of:
- link creation - link definition as per the section Link properties
- link removal - JSON keyword null (i.e. nullify a link if present)
any of the directives from the section View Properties

Link Properties

analyzers (optional; type: array; subtype: string; default: ['identity' ])

A list of analyzers, by name as defined via the Analyzers,that should be applied to values of processed document attributes.

fields (optional; type: object; default: {})

An object { attribute-name: [Link properties], … } of fields that should beprocessed at each level of the document. Each key specifies the documentattribute to be processed. Note that the value of includeAllFields is alsoconsulted when selecting fields to be processed. It is a recursive datastructure. Each value specifies the Link propertiesdirectives to be used when processing the specified field, a Link propertiesvalue of {} denotes inheritance of all (except fields) directives fromthe current level.

includeAllFields (optional; type: boolean; default: false)

If set to true, then process all document attributes. Otherwise, onlyconsider attributes mentioned in fields. Attributes not explicitlyspecified in fields will be processed with default link properties, i.e.{}.

trackListPositions (optional; type: boolean; default: false)

If set to true, then for array values track the value position in arrays.E.g., when querying for the input { attr: [ 'valueX', 'valueY', 'valueZ' ]}, the user must specify: doc.attr[1] == 'valueY'. Otherwise, all values inan array are treated as equal alternatives. E.g., when querying for the input{ attr: [ 'valueX', 'valueY', 'valueZ' ] }, the user must specify: doc.attr== 'valueY'.

storeValues (optional; type: string; default: "none")

This property controls how the view should keep track of the attribute values.Valid values are:

none: Do not store values with the view.
id: Store information about value presence to allow use of theEXISTS() function.

View Properties

primarySort (optional; type: array; default: []; immutable)

A primary sort order can be defined to enable an AQL optimization. If a queryiterates over all documents of a View, wants to sort them by attribute valuesand the (left-most) fields to sort by as well as their sorting direction matchwith the primarySort definition, then the SORT operation is optimized away.Also see Primary Sort Order

An inverted index is the heart of ArangoSearch Views.The index consists of several independent segments and the index segmentitself is meant to be treated as a standalone index. Commit is meant to betreated as the procedure of accumulating processed data creating new indexsegments. Consolidation is meant to be treated as the procedure of joiningmultiple index segments into a bigger one and removing garbage documents (e.g.deleted from a collection). Cleanup is meant to be treated as the procedureof removing unused segments after release of internal resources.

cleanupIntervalStep (optional; type: integer; default: 2; todisable use: 0)

ArangoSearch waits at least this many commits between removing unused files inits data directory for the case where the consolidation policies mergesegments often (i.e. a lot of commit+consolidate). A lower value will cause alot of disk space to be wasted for the case where the consolidation policiesrarely merge segments (i.e. few inserts/deletes). A higher value will impactperformance without any added benefits.

With every commit or consolidate operation a new state of the viewinternal data-structures is created on disk. Old states/snapshots arereleased once there are no longer any users remaining. However, the filesfor the released states/snapshots are left on disk, and only removed by“cleanup” operation.

commitIntervalMsec (optional; type: integer; default: 1000;to disable use: 0)

Wait at least this many milliseconds between committing View data storechanges and making documents visible to queries (default: 1000, to disableuse: 0).For the case where there are a lot of inserts/updates, a lower value, untilcommit, will cause the index not to account for them and memory usage wouldcontinue to grow.For the case where there are a few inserts/updates, a higher value will impactperformance and waste disk space for each commit call without any addedbenefits.

For data retrieval ArangoSearch Views follow the concept of“eventually-consistent”, i.e. eventually all the data in ArangoDB will bematched by corresponding query expressions.The concept of ArangoSearch View “commit” operation is introduced tocontrol the upper-bound on the time until document addition/removals areactually reflected by corresponding query expressions.Once a “commit” operation is complete all documents added/removed prior tothe start of the “commit” operation will be reflected by queries invoked insubsequent ArangoDB transactions, in-progress ArangoDB transactions willstill continue to return a repeatable-read state.

consolidationIntervalMsec (optional; type: integer; default: 60000;to disable use: 0)

ArangoSearch waits at least this many milliseconds between committing viewdata store changes and making documents visible to queries. A lower valuewill cause the view not to account for them, (until commit), and memory usagewould continue to grow for the case where there are a few inserts/updates. Ahigher value will impact performance and waste disk space for each commit callwithout any added benefits.

For data retrieval ArangoSearch Views follow the concept of“eventually-consistent”, i.e. eventually all the data in ArangoDB will bematched by corresponding query expressions. The concept of an ArangoSearchView “commit” operation is introduced to control the upper-bound on the timeuntil document addition/removals are actually reflected by correspondingquery expressions. Once a commit operation is complete, all documentsadded/removed prior to the start of the commit operation will bereflected by queries invoked in subsequent ArangoDB transactions, whilein-progress ArangoDB transactions will still continue to return arepeatable-read state.

ArangoSearch performs operations in its index based on numerous writerobjects that are mapped to processed segments. In order to control memory thatis used by these writers (in terms of “writers pool”) one can usewritebuffer* properties of a view.

writebufferIdle (optional; type: integer; default: 64;to disable use: 0; immutable)

Maximum number of writers (segments) cached in the pool.

writebufferActive (optional; type: integer; default: 0;to disable use: 0; immutable)

Maximum number of concurrent active writers (segments) that perform a transaction.Other writers (segments) wait till current active writers (segments) finish.

writebufferSizeMax (optional; type: integer; default: 33554432;to disable use: 0; immutable)

Maximum memory byte size per writer (segment) before a writer (segment) flush istriggered. 0 value turns off this limit for any writer (buffer) and data willbe flushed periodically based on thevalue defined for the flush thread(ArangoDB server startup option). 0 value should be used carefully due to highpotential memory consumption.

consolidationPolicy (optional; type: object; default: {})

The consolidation policy to apply for selecting data store segment mergecandidates.

With each ArangoDB transaction that inserts documents, one or moreArangoSearch internal segments gets created. Similarly, for removeddocuments the segments containing such documents will have these documentsmarked as “deleted”. Over time this approach causes a lot of small andsparse segments to be created. A consolidation operation selects one ormore segments and copies all of their valid documents into a single newsegment, thereby allowing the search algorithm to perform more optimally andfor extra file handles to be released once old segments are no longer used.

type (optional; type: string; default: "bytes_accum")

The segment candidates for the “consolidation” operation are selected basedupon several possible configurable formulas as defined by their types.The currently supported types are:

- <code>&#34;bytes_accum&#34;</code>: Consolidation is performed based on current memoryconsumption of segments and <code>threshold</code> property value.
- <code>&#34;tier&#34;</code>: Consolidate based on segment byte size and live document countas dictated by the customization attributes.

consolidationPolicy properties for "bytes_accum" type:

threshold (optional; type: float; default: 0.1)

Defines threshold value of [0.0, 1.0] possible range. Consolidation isperformed on segments which accumulated size in bytes is less than allsegments’ byte size multiplied by the threshold; i.e. the following formulais applied for each segment:{threshold} > (segment_bytes + sum_of_merge_candidate_segment_bytes) / all_segment_bytes.

consolidationPolicy properties for "tier" type:

segmentsMin (optional; type: integer; default: 1)

The minimum number of segments that will be evaluated as candidates for consolidation.

segmentsMax (optional; type: integer; default: 10)

The maximum number of segments that will be evaluated as candidates for consolidation.

segmentsBytesMax (optional; type: integer; default: 5368709120)

Maximum allowed size of all consolidated segments in bytes.

segmentsBytesFloor (optional; type: integer; default: 2097152)

Defines the value (in bytes) to treat all smaller segments as equal for consolidationselection.