Configuration

This following describes available embeddings configuration. These parameters are set via the Embeddings constructor.

format

  1. format: pickle|json

Sets the configuration storage format. Defaults to pickle.

path

  1. path: string

Sets the path for a vectors model. When using a transformers/sentence-transformers model, this can be any model on the Hugging Face Hub or a local file path. Otherwise, it must be a local file path to a word embeddings model.

method

  1. method: transformers|sentence-transformers|words|external

Sentence embeddings method to use. If the method is not provided, it is inferred using the path.

sentence-transformers and words require the similarity extras package to be installed.

transformers

Builds sentence embeddings using a transformers model. While this can be any transformers model, it works best with models trained to build sentence embeddings.

sentence-transformers

Same as transformers but loads models with the sentence-transformers library.

words

Builds sentence embeddings using a word embeddings model. Transformers models are the preferred vector backend in most cases. Word embeddings models may be deprecated in the future.

storevectors

  1. storevectors: boolean

Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.

scoring

  1. scoring: bm25|tfidf|sif

A scoring model builds weighted averages of word vectors for a given sentence. Supports BM25, TF-IDF and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.

pca

  1. pca: int

Removes n principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.

external

Sentence embeddings are loaded via an external model or API. Requires setting the transform parameter to a function that translates data into vectors.

transform

  1. transform: function

When method is external, this function transforms input content into embeddings. The input to this function is a list of data. This method must return either a numpy array or list of numpy arrays.

batch

  1. batch: int

Sets the transform batch size. This parameter controls how input streams are chunked and vectorized.

encodebatch

  1. encodebatch: int

Sets the encode batch size. This parameter controls the underlying vector model batch size. This often corresponds to a GPU batch size, which controls GPU memory usage.

tokenize

  1. tokenize: boolean

Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of English language sentence embeddings in some situations.

instructions

  1. instructions:
  2. query: prefix for queries
  3. data: prefix for indexing

Instruction-based models use prefixes to modify how embeddings are computed. This is especially useful with asymmetric search, which is when the query and indexed data are of vastly different lengths. In other words, short queries with long documents.

E5-base is an example of a model that accepts instructions. It takes query: and passage: prefixes and uses those to generate embeddings that work well for asymmetric search.

backend

  1. backend: faiss|hnsw|annoy|custom

Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to faiss. Additional backends require the similarity extras package to be installed. Add custom backends via setting this parameter to the fully resolvable class string.

Backend-specific settings are set with a corresponding configuration object having the same name as the backend (i.e. annoy, faiss, or hnsw). None of these are required and are set to defaults if omitted.

faiss

  1. faiss:
  2. components: comma separated list of components - defaults to "Flat" for small
  3. indices and "IVFx,Flat" for larger indexes where
  4. x = 4 * sqrt(embeddings count)
  5. nprobe: search probe setting (int) - defaults to x/16 (as defined above)
  6. for larger indexes
  7. quantize: store vectors with 8-bit precision vs 32-bit (boolean)
  8. defaults to false
  9. mmap: load as on-disk index (boolean) - trade query response time for a
  10. smaller RAM footprint, defaults to false
  11. sample: percent of data to use for model training (0.0 - 1.0)
  12. reduces indexing time for larger (>1M+ row) indexes, defaults to 1.0

See the following Faiss documentation links for more information.

hnsw

  1. hnsw:
  2. efconstruction: ef_construction param for init_index (int) - defaults to 200
  3. m: M param for init_index (int) - defaults to 16
  4. randomseed: random-seed param for init_index (int) - defaults to 100
  5. efsearch: ef search param (int) - defaults to None and not set

See Hnswlib documentation for more information on these parameters.

annoy

  1. annoy:
  2. ntrees: number of trees (int) - defaults to 10
  3. searchk: search_k search setting (int) - defaults to -1

See Annoy documentation for more information on these parameters. Note that annoy indexes can not be modified after creation, upserts/deletes and other modifications are not supported.

content

  1. content: boolean|sqlite|duckdb|custom

Enables content storage. When true, the default storage engine, sqlite will be used. Also supports duckdb. Add custom storage engines via setting this parameter to the fully resolvable class string.

functions

  1. functions: list

List of functions with user-defined SQL functions, only used when content is enabled. Each list element must be one of the following:

  • function
  • callable object
  • dict with fields for name, argcount and function

An example can be found here.

query

  1. query:
  2. path: sets the path for the query model - this can be any model on the
  3. Hugging Face Model Hub or a local file path.
  4. prefix: text prefix to prepend to all inputs
  5. maxlength: maximum generated sequence length

Query translation model. Translates natural language queries to txtai compatible SQL statements.

graph

  1. graph:
  2. backend: graph network backend (string), defaults to "networkx"
  3. batchsize: batch query size, used to query embeddings index (int)
  4. defaults to 256
  5. limit: maximum number of results to return per embeddings query (int)
  6. defaults to 15
  7. minscore: minimum score required to consider embeddings query matches (float)
  8. defaults to 0.1
  9. approximate: when true, queries only run for nodes without edges (boolean)
  10. defaults to true
  11. topics: see below

Enables graph storage. When set, a graph network is built using the embeddings index. Graph nodes are synced with each embeddings index operation (index/upsert/delete). Graph edges are created using the embeddings index upon completion of each index/upsert/delete embeddings index call.

Add custom graph storage engines via setting the graph.backend parameter to the fully resolvable class string.

Defaults are tuned so that in most cases these values don’t need to be changed.

topics

  1. topics:
  2. algorithm: community detection algorithm (string), options are
  3. louvain (default), greedy, lpa
  4. level: controls number of topics (string), options are best (default) or first
  5. resolution: controls number of topics (int), larger values create more
  6. topics (int), defaults to 100
  7. labels: scoring index method used to build topic labels (string)
  8. options are bm25 (default), tfidf, sif
  9. terms: number of frequent terms to use for topic labels (int), defaults to 4
  10. stopwords: optional list of stop words to exclude from topic labels
  11. categories: optional list of categories used to group topics, allows
  12. granular topics with broad categories grouping topics

Enables topic modeling. Defaults are tuned so that in most cases these values don’t need to be changed (except for categories). These parameters are available for advanced use cases where one wants full control over the community detection process.