NLP and tokenization

Data tokenization

Manticore doesn’t store text as is for performing full-text searching on it. Instead it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as position of it inside a field). All these are used when a full-text match is performed.

The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching and it operates at character and word level.

On the character level, the engine allows only certain characters to pass, this is defined by the charset_table, anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, for example lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.

At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.

Going further, we might want a word to be matched as another one - because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.

Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only on speeding queries, but also on decreasing index size.

A more advanced blacklisting is bigrams, which allows creating a special token between a ‘bigram’ (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.

In case of indexing HTML content, it’s important to not index the HTML tags, as they can introduce a lot of ‘noise’ in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore content of certain HTML elements.

Supported languages

Manticore supports many languages. Basic support for most is enabled by default via charset_table = non_cjk (which is a default value).

For many languages we provide stopwords file (you can also use your own one) which you can use to improve search relevance.

For few languages advanced morphology is available that allows to improve search relevance significantly by better segmentation and normalization using dictionary based lemmatization or stemming algorithms.

The below table includes a complete list of supported languages. You can use it to find out how to enable:

  • basic support (column “Supported”)
  • stopwords (column “Stopwords”)
  • advanced morphology (column “Advanced morphology”)
LanguageSupportedStopwords file nameAdvanced morphologyNotes
Afrikaanscharset_table=non_cjkaf-
Arabiccharset_table=non_cjkarmorphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armeniancharset_table=non_cjkhy-
Assamesespecify charset_table specify charset_table manually--
Basquecharset_table=non_cjkeu-
Bengalicharset_table=non_cjkbn-
Bishnupriyaspecify charset_table manually--
Buhidspecify charset_table manually--
Bulgariancharset_table=non_cjkbg-
Catalancharset_table=non_cjkcamorphology=libstemmer_ca
Chinesecharset_table=chinese or ngram_chars=chinesezhmorphology=icu_chinese or ngram_chars=1 correspondinglyICU dictionary based segmentation is much more accurate than ngram-based
Croatiancharset_table=non_cjkhr-
Kurdishcharset_table=non_cjkckb-
Czechcharset_table=non_cjkczmorphology=stem_cz (Czech stemmer)
Danishcharset_table=non_cjkdamorphology=libstemmer_da
Dutchcharset_table=non_cjknlmorphology=libstemmer_nl
Englishcharset_table=non_cjkenmorphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter’s English stemmer); morphology=stem_enru (Porter’s English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperantocharset_table=non_cjkeo-
Estoniancharset_table=non_cjket-
Finnishcharset_table=non_cjkfimorphology=libstemmer_fi
Frenchcharset_table=non_cjkfrmorphology=libstemmer_fr
Galiciancharset_table=non_cjkgl-
Garospecify charset_table manually--
Germancharset_table=non_cjkdemorphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greekcharset_table=non_cjkelmorphology=libstemmer_el
Hebrewcharset_table=non_cjkhe-
Hindicharset_table=non_cjkhimorphology=libstemmer_hi
Hmongspecify charset_table manually--
Hospecify charset_table manually--
Hungariancharset_table=non_cjkhumorphology=libstemmer_hu
Indonesiancharset_table=non_cjkidmorphology=libstemmer_id
Irishcharset_table=non_cjkgamorphology=libstemmer_ga
Italiancharset_table=non_cjkitmorphology=libstemmer_it
Japanesengram_chars=japanese-ngram_chars=japanese ngram_len=1Requires ngram-based segmentation
Komispecify charset_table manually--
Koreanngram_chars=korean-ngram_chars=korean ngram_len=1Requires ngram-based segmentation
Large Flowery Miaospecify charset_table manually--
Latincharset_table=non_cjkla-
Latviancharset_table=non_cjklv-
Lithuaniancharset_table=non_cjkltmorphology=libstemmer_lt
Mabaspecify charset_table manually--
Maithilispecify charset_table manually--
Marathispecify charset_table manually--
Marathicharset_table=non_cjkmr-
Mendespecify charset_table manually--
Mruspecify charset_table manually--
Myenespecify charset_table manually--
Nepalispecify charset_table manually-morphology=libstemmer_ne
Ngambayspecify charset_table manually--
Norwegiancharset_table=non_cjknomorphology=libstemmer_no
Odiaspecify charset_table manually--
Persiancharset_table=non_cjkfa-
Polishcharset_table=non_cjkpl-
Portuguesecharset_table=non_cjkptmorphology=libstemmer_pt
Romaniancharset_table=non_cjkromorphology=libstemmer_ro
Russiancharset_table=non_cjkrumorphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter’s Russian stemmer); morphology=stem_enru (Porter’s English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santalispecify charset_table manually--
Sindhispecify charset_table manually--
Slovakcharset_table=non_cjksk-
Sloveniancharset_table=non_cjksl-
Somalicharset_table=non_cjkso-
Sothocharset_table=non_cjkst-
Spanishcharset_table=non_cjkesmorphology=libstemmer_es
Swahilicharset_table=non_cjksw-
Swedishcharset_table=non_cjksvmorphology=libstemmer_sv
Sylhetispecify charset_table manually--
Tamilspecify charset_table manually-morphology=libstemmer_ta
Thaicharset_table=non_cjkth-
Turkishcharset_table=non_cjktrmorphology=libstemmer_tr
Ukrainiancharset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491-morphology=lemmatize_uk_allRequires installation of UK lemmatizer
Yorubacharset_table=non_cjkyo-
Zulucharset_table=non_cjkzu-