Detailed overview of ArangoSearch Views

ArangoSearch is a powerful fulltext search component with additionalfunctionality, supported via the text analyzer and tfidf / bm25scorers, without impact on performance when specifying documentsfrom different collections or filtering on multiple document attributes.

View datasource

Search functionality is exposed to ArangoDB via the view API for views of typearangosearch. The ArangoSearch View is merely an identity transformationapplied onto documents stored in linked collections of the same ArangoDBdatabase. In plain terms an ArangoSearch View only allows filtering and sortingof documents located in collections of the same database. The matching documentsthemselves are returned as-is from their corresponding collections.

A concept of an ArangoDB collection ‘link’ is introduced to allow specifyingwhich ArangoDB collections a given ArangoSearch View should query for documentsand how these documents should be queried.

An ArangoSearch Link is a uni-directional connection from an ArangoDB collectionto an ArangoSearch View describing how data coming from the said collectionshould be made available in the given view. Each ArangoSearch Link in anArangoSearch view is uniquely identified by the name of the ArangoDB collectionit links to. An ArangoSearch View may have zero or more links, each to adistinct ArangoDB collection. Similarly an ArangoDB collection may be referencedvia links by zero or more distinct ArangoSearch Views. In other words, any givenArangoSearch View may be linked to any given ArangoDB collection of the samedatabase with zero or one link. However, any ArangoSearch View may be linked tomultiple distinct ArangoDB collections and similarly any ArangoDB collection maybe referenced by multiple ArangoSearch Views.

To configure an ArangoSearch View for consideration of documents from a givenArangoDB collection a link definition must be added to the properties of thesaid ArangoSearch View defining the link parameters as per the sectionView definition/modification.

Index

Inverted Index is the heart of ArangoSearch. The index consists of severalindependent segments and the index segment itself is meant to be treated as astandalone index.

Analyzers

To simplify query syntax ArangoSearch provides a concept ofnamed analyzers which are merely aliases for type+configurationof IResearch analyzers.

View definition/modification

An ArangoSearch View is configured via an object containing a set ofview-specific configuration directives and a map of link-specific configurationdirectives.

During view creation the following directives apply:

  • name (required; type: string): the view name
  • type (required; type: string): the value "arangosearch"
  • any of the directives from the section View propertiesDuring view modification the following directives apply:

  • links (optional; type: object):a mapping of collection-name/collection-identifier to one of:

    • link creation - link definition as per the section Link properties
    • link removal - JSON keyword null (i.e. nullify a link if present)
  • any of the directives from the section View properties

View properties

The following terminology from ArangoSearch architecture is used to understandview properties assignment of its type:

The index consists of several independent segments and the index segmentitself is meant to be treated as a standalone index. Commit is meant to betreated as the procedure of accumulating processed data creating new indexsegments. Consolidation is meant to be treated as the procedure of joiningmultiple index segments into a bigger one and removing garbage documents (e.g.deleted from a collection). Cleanup is meant to be treated as the procedureof removing unused segments after release of internal resources.

  • cleanupIntervalStep (optional; type: integer; default: 10; todisable use: 0)

ArangoSearch waits at least this many commits between removing unused files inits data directory for the case where the consolidation policies mergesegments often (i.e. a lot of commit+consolidate). A lower value will cause alot of disk space to be wasted for the case where the consolidation policiesrarely merge segments (i.e. few inserts/deletes). A higher value will impactperformance without any added benefits.

With every commit or consolidate operation a new state of the viewinternal data-structures is created on disk. Old states/snapshots arereleased once there are no longer any users remaining. However, the filesfor the released states/snapshots are left on disk, and only removed by“cleanup” operation.

  • consolidationIntervalMsec (optional; type: integer; default: 60000;to disable use: 0)

ArangoSearch waits at least this many milliseconds between committing viewdata store changes and making documents visible to queries. A lower valuewill cause the view not to account for them, (until commit), and memory usagewould continue to grow for the case where there are a few inserts/updates. Ahigher value will impact performance and waste disk space for each commit callwithout any added benefits.

For data retrieval ArangoSearch Views follow the concept of“eventually-consistent”, i.e. eventually all the data in ArangoDB will bematched by corresponding query expressions. The concept of an ArangoSearchView “commit” operation is introduced to control the upper-bound on the timeuntil document addition/removals are actually reflected by correspondingquery expressions. Once a commit operation is complete, all documentsadded/removed prior to the start of the commit operation will bereflected by queries invoked in subsequent ArangoDB transactions, whilein-progress ArangoDB transactions will still continue to return arepeatable-read state.

ArangoSearch performs operations in its index based on numerous writerobjects that are mapped to processed segments. In order to control memory thatis used by these writers (in terms of “writers pool”) one can usewritebuffer* properties of a view.

  • writebufferIdle (optional; type: integer; default: 64;to disable use: 0)

Maximum number of writers (segments) cached in the pool.

  • writebufferActive (optional; type: integer; default: 0;to disable use: 0)

Maximum number of concurrent active writers (segments) that perform a transaction.Other writers (segments) wait till current active writers (segments) finish.

  • writebufferSizeMax (optional; type: integer; default: 33554432;to disable use: 0)

Maximum memory byte size per writer (segment) before a writer (segment) flush istriggered. 0 value turns off this limit for any writer (buffer) and data willbe flushed periodically based on thevalue defined for the flush thread(ArangoDB server startup option). 0 value should be used carefully due to highpotential memory consumption.

  • consolidationPolicy (optional; type: object; default: {})

The consolidation policy to apply for selecting data store segment mergecandidates.

With each ArangoDB transaction that inserts documents, one or moreArangoSearch internal segments gets created. Similarly, for removeddocuments the segments containing such documents will have these documentsmarked as “deleted”. Over time this approach causes a lot of small andsparse segments to be created. A consolidation operation selects one ormore segments and copies all of their valid documents into a single newsegment, thereby allowing the search algorithm to perform more optimally andfor extra file handles to be released once old segments are no longer used.

  • type (optional; type: string; default: "bytes_accum")

The segment candidates for the “consolidation” operation are selected basedupon several possible configurable formulas as defined by their types.The currently supported types are:

  1. - **bytes_accum**: Consolidation is performed based on current memory consumptionof segments and <code>threshold</code> property value.
  2. - **tier**: Consolidate based on segment byte size and live document countas dictated by the customization attributes.

consolidationPolicy properties for bytes_accum type

  • threshold (optional; type: float; default: 0.1)

Defines threshold value of [0.0, 1.0] possible range. Consolidation isperformed on segments which accumulated size in bytes is less than allsegments’ byte size multiplied by the threshold; i.e. the following formulais applied for each segment:{threshold} > (segment_bytes + sum_of_merge_candidate_segment_bytes) / all_segment_bytes.

consolidationPolicy properties for tier type

  • segmentsMin (optional; type: integer; default: 1)

The minimum number of segments that will be evaluated as candidates for consolidation.

  • segmentsMax (optional; type: integer; default: 10)

The maximum number of segments that will be evaluated as candidates for consolidation.

  • segmentsBytesMax (optional; type: integer; default: 5368709120)

Maximum allowed size of all consolidated segments in bytes.

  • segmentsBytesFloor (optional; type: integer; default: 2097152)

Defines the value (in bytes) to treat all smaller segments as equal for consolidationselection.

  • lookahead (optional; type: integer; default: 18446744073709552000)

The number of additionally searched tiers except initially chosen candidates based onsegmentsMin, segmentsMax, segmentsBytesMax, segmentsBytesFloor withrespect to defined values. Default value is treated as searching among all existingsegments.

  • analyzers (optional; type: array; subtype: string; default: ['identity' ])

A list of analyzers, by name as defined via the Analyzers,that should be applied to values of processed document attributes.

  • fields (optional; type: object; default: {})

An object {attribute-name: [Link properties]} of fields that should beprocessed at each level of the document. Each key specifies the documentattribute to be processed. Note that the value of includeAllFields is alsoconsulted when selecting fields to be processed. Each value specifies theLink properties directives to be used when processing thespecified field, a Link properties value of {} denotes inheritance of all(except fields) directives from the current level.

  • includeAllFields (optional; type: boolean; default: false)

If set to true, then process all document attributes. Otherwise, onlyconsider attributes mentioned in fields. Attributes not explicitlyspecified in fields will be processed with default link properties, i.e.{}.

  • trackListPositions (optional; type: boolean; default: false)

If set to true, then for array values track the value position in arrays.E.g., when querying for the input { attr: [ 'valueX', 'valueY', 'valueZ' ]}, the user must specify: doc.attr[1] == 'valueY'. Otherwise, all values inan array are treated as equal alternatives. E.g., when querying for the input{ attr: [ 'valueX', 'valueY', 'valueZ' ] }, the user must specify: doc.attr== 'valueY'.

  • storeValues (optional; type: string; default: "none")

This property controls how the view should keep track of the attribute values.Valid values are:

  • none: Do not store values with the view.
  • id: Store information about value presence to allow use of theEXISTS() function.