Create a custom analyzer

When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

Configuration

The custom analyzer accepts the following parameters:

tokenizer

A built-in or customised tokenizer. (Required)

char_filter

An optional array of built-in or customised character filters.

filter

An optional array of built-in or customised token filters.

position_increment_gap

When indexing an array of text values, Elasticsearch inserts a fake “gap” between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100. See position_increment_gap for more.

Example configuration

Here is an example that combines the following:

Character Filter

Tokenizer

Token Filters

  1. PUT my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_custom_analyzer": {
  7. "type": "custom",
  8. "tokenizer": "standard",
  9. "char_filter": [
  10. "html_strip"
  11. ],
  12. "filter": [
  13. "lowercase",
  14. "asciifolding"
  15. ]
  16. }
  17. }
  18. }
  19. }
  20. }
  21. POST my-index-000001/_analyze
  22. {
  23. "analyzer": "my_custom_analyzer",
  24. "text": "Is this <b>déjà vu</b>?"
  25. }

Setting type to custom tells Elasticsearch that we are defining a custom analyzer. Compare this to how built-in analyzers can be configured: type will be set to the name of the built-in analyzer, like standard or simple.

The above example produces the following terms:

  1. [ is, this, deja, vu ]

The previous example used tokenizer, token filters, and character filters with their default configurations, but it is possible to create configured versions of each and to use them in a custom analyzer.

Here is a more complicated example that combines the following:

Character Filter

Tokenizer

Token Filters

Here is an example:

  1. PUT my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_custom_analyzer": {
  7. "type": "custom",
  8. "char_filter": [
  9. "emoticons"
  10. ],
  11. "tokenizer": "punctuation",
  12. "filter": [
  13. "lowercase",
  14. "english_stop"
  15. ]
  16. }
  17. },
  18. "tokenizer": {
  19. "punctuation": {
  20. "type": "pattern",
  21. "pattern": "[ .,!?]"
  22. }
  23. },
  24. "char_filter": {
  25. "emoticons": {
  26. "type": "mapping",
  27. "mappings": [
  28. ":) => _happy_",
  29. ":( => _sad_"
  30. ]
  31. }
  32. },
  33. "filter": {
  34. "english_stop": {
  35. "type": "stop",
  36. "stopwords": "_english_"
  37. }
  38. }
  39. }
  40. }
  41. }
  42. POST my-index-000001/_analyze
  43. {
  44. "analyzer": "my_custom_analyzer",
  45. "text": "I'm a :) person, and you?"
  46. }

Assigns the index a default custom analyzer, my_custom_analyzer. This analyzer uses a custom tokenizer, character filter, and token filter that are defined later in the request.

Defines the custom punctuation tokenizer.

Defines the custom emoticons character filter.

Defines the custom english_stop token filter.

The above example produces the following terms:

  1. [ i'm, _happy_, person, you ]