Character group tokenizer

The char_group tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the pattern tokenizer is not acceptable.

Configuration

The char_group tokenizer accepts one parameter:

tokenize_on_chars

A list containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. This accepts either single characters like e.g. -, or character groups: whitespace, letter, digit, punctuation, symbol.

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

Example output

  1. POST _analyze
  2. {
  3. "tokenizer": {
  4. "type": "char_group",
  5. "tokenize_on_chars": [
  6. "whitespace",
  7. "-",
  8. "\n"
  9. ]
  10. },
  11. "text": "The QUICK brown-fox"
  12. }

returns

  1. {
  2. "tokens": [
  3. {
  4. "token": "The",
  5. "start_offset": 0,
  6. "end_offset": 3,
  7. "type": "word",
  8. "position": 0
  9. },
  10. {
  11. "token": "QUICK",
  12. "start_offset": 4,
  13. "end_offset": 9,
  14. "type": "word",
  15. "position": 1
  16. },
  17. {
  18. "token": "brown",
  19. "start_offset": 10,
  20. "end_offset": 15,
  21. "type": "word",
  22. "position": 2
  23. },
  24. {
  25. "token": "fox",
  26. "start_offset": 16,
  27. "end_offset": 19,
  28. "type": "word",
  29. "position": 3
  30. }
  31. ]
  32. }