Language and Tokenization

Tokenization

Tokenization is the act of taking a sentence or phrase and splitting it into smaller units of language, called tokens. It is the first step of document indexing in the Meilisearch engine, and is a critical factor in the quality of search results.

Breaking sentences into smaller chunks requires understanding where one word ends and another begins, making tokenization a highly complex and language-dependant task. MeiliSearch’s solution to this problem is a modular tokenizer that follows different processes, called pipelines, based on the language it detects.

This allows MeiliSearch to function in several different languages with zero setup.

Deep Dive: The Meili Tokenizer

When you add documents to a MeiliSearch index, the tokenization process is handled by an abstract interface called an analyzer. The analyzer is responsible for determining the primary language of each based on the scripts (e.g. Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding pipeline to each field.

We can break down the tokenization process like so:

  1. Crawl the document(s) and determine the primary language for each field.
  2. Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists.

Pipelines include many language-specific operations. Currently, we have two pipelines:

  1. A specialized Chinese pipeline using JiebaLanguage and Tokenization - 图1 (opens new window)
  2. A default MeiliSearch pipeline that separates words based on categories. Works with a variety of languages.

For more details, check out the feature specificationLanguage and Tokenization - 图2 (opens new window).

Language Support

MeiliSearch is multilingual, featuring optimized support for:

We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our GitHub repositoryLanguage and Tokenization - 图4 (opens new window).

Improving Our Language Support

While we have employees from all over the world at MeiliSearch, we don’t speak every language. In fact, we rely almost entirely on feedback from external contributors to know how our engine is performing across different languages.

If you’d like to help us create a more global MeiliSearch, please consider sharing your tests, results, and general feedback with us through GitHub issuesLanguage and Tokenization - 图5 (opens new window). Here are some of the languages that have been requested by users and their corresponding issue:

If you’d like us to add or improve support for a language that isn’t in the above list, please create an issueLanguage and Tokenization - 图10 (opens new window) saying so, and then make a pull request on the documentationLanguage and Tokenization - 图11 (opens new window) to add it to the above list.