This topic introduces the Greenplum Database functions you can use to test and debug a search configuration or the individual parser and dictionaries specified in a configuration.

The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.

This section contains the following subtopics:

Configuration Testing

The function ts_debug allows easy testing of a text search configuration.

  1. ts_debug([<config> regconfig, ] <document> text,
  2. OUT <alias> text,
  3. OUT <description> text,
  4. OUT <token> text,
  5. OUT <dictionaries> regdictionary[],
  6. OUT <dictionary> regdictionary,
  7. OUT <lexemes> text[])
  8. returns setof record

ts_debug displays information about every token of *document* as produced by the parser and processed by the configured dictionaries. It uses the configuration specified by *config*, or default_text_search_config if that argument is omitted.

ts_debug returns one row for each token identified in the text by the parser. The columns returned are

  • *alias* text — short name of the token type
  • *description* text — description of the token type
  • *token* text— text of the token
  • *dictionaries* regdictionary[] — the dictionaries selected by the configuration for this token type
  • *dictionary* regdictionary — the dictionary that recognized the token, or NULL if none did
  • *lexemes* text[] — the lexeme(s) produced by the dictionary that recognized the token, or NULL if none did; an empty array ({}) means it was recognized as a stop word

Here is a simple example:

  1. SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats');
  2. alias | description | token | dictionaries | dictionary | lexemes
  3. -----------+-----------------+-------+----------------+--------------+---------
  4. asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
  5. blank | Space symbols | | {} | |
  6. asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
  7. blank | Space symbols | | {} | |
  8. asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
  9. blank | Space symbols | | {} | |
  10. asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
  11. blank | Space symbols | | {} | |
  12. asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
  13. blank | Space symbols | | {} | |
  14. asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
  15. blank | Space symbols | | {} | |
  16. asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
  17. blank | Space symbols | | {} | |
  18. blank | Space symbols | - | {} | |
  19. asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
  20. blank | Space symbols | | {} | |
  21. asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
  22. blank | Space symbols | | {} | |
  23. asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
  24. blank | Space symbols | | {} | |
  25. asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
  26. blank | Space symbols | | {} | |
  27. asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}

For a more extensive demonstration, we first create a public.english configuration and Ispell dictionary for the English language:

  1. CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
  2. CREATE TEXT SEARCH DICTIONARY english_ispell (
  3. TEMPLATE = ispell,
  4. DictFile = english,
  5. AffFile = english,
  6. StopWords = english
  7. );
  8. ALTER TEXT SEARCH CONFIGURATION public.english
  9. ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
  1. SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
  2. alias | description | token | dictionaries | dictionary | lexemes
  3. -----------+-----------------+-------------+-------------------------------+----------------+-------------
  4. asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {}
  5. blank | Space symbols | | {} | |
  6. asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright}
  7. blank | Space symbols | | {} | |
  8. asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}

In this example, the word Brightest was recognized by the parser as an ASCII word (alias asciiword). For this token type the dictionary list is english_ispell and english_stem. The word was recognized by english_ispell, which reduced it to the noun bright. The word supernovaes is unknown to the english_ispell dictionary so it was passed to the next dictionary, and, fortunately, was recognized (in fact, english_stem is a Snowball dictionary which recognizes everything; that is why it was placed at the end of the dictionary list).

The word The was recognized by the english_ispell dictionary as a stop word (Stop Words) and will not be indexed. The spaces are discarded too, since the configuration provides no dictionaries at all for them.

You can reduce the width of the output by explicitly specifying which columns you want to see:

  1. SELECT alias, token, dictionary, lexemes FROM ts_debug('public.english','The Brightest supernovaes');
  2. alias | token | dictionary | lexemes
  3. -----------+-------------+----------------+-------------
  4. asciiword | The | english_ispell | {}
  5. blank | | |
  6. asciiword | Brightest | english_ispell | {bright}
  7. blank | | |
  8. asciiword | supernovaes | english_stem | {supernova}

Parser Testing

The following functions allow direct testing of a text search parser.

  1. ts_parse(<parser_name> text, <document> text,
  2. OUT <tokid> integer, OUT <token> text) returns setof record
  3. ts_parse(<parser_oid> oid, <document> text,
  4. OUT <tokid> integer, OUT <token> text) returns setof record

ts_parse parses the given document and returns a series of records, one for each token produced by parsing. Each record includes a tokid showing the assigned token type and a token, which is the text of the token. For example:

  1. SELECT * FROM ts_parse('default', '123 - a number');
  2. tokid | token
  3. -------+--------
  4. 22 | 123
  5. 12 |
  6. 12 | -
  7. 1 | a
  8. 12 |
  9. 1 | number
  1. ts_token_type(<parser_name> text, OUT <tokid> integer,
  2. OUT <alias> text, OUT <description> text) returns setof record
  3. ts_token_type(<parser_oid> oid, OUT <tokid> integer,
  4. OUT <alias> text, OUT <description> text) returns setof record

ts_token_type returns a table which describes each type of token the specified parser can recognize. For each token type, the table gives the integer tokid that the parser uses to label a token of that type, the alias that names the token type in configuration commands, and a short description. For example:

  1. SELECT * FROM ts_token_type('default');
  2. tokid | alias | description
  3. -------+-----------------+------------------------------------------
  4. 1 | asciiword | Word, all ASCII
  5. 2 | word | Word, all letters
  6. 3 | numword | Word, letters and digits
  7. 4 | email | Email address
  8. 5 | url | URL
  9. 6 | host | Host
  10. 7 | sfloat | Scientific notation
  11. 8 | version | Version number
  12. 9 | hword_numpart | Hyphenated word part, letters and digits
  13. 10 | hword_part | Hyphenated word part, all letters
  14. 11 | hword_asciipart | Hyphenated word part, all ASCII
  15. 12 | blank | Space symbols
  16. 13 | tag | XML tag
  17. 14 | protocol | Protocol head
  18. 15 | numhword | Hyphenated word, letters and digits
  19. 16 | asciihword | Hyphenated word, all ASCII
  20. 17 | hword | Hyphenated word, all letters
  21. 18 | url_path | URL path
  22. 19 | file | File or path name
  23. 20 | float | Decimal notation
  24. 21 | int | Signed integer
  25. 22 | uint | Unsigned integer
  26. 23 | entity | XML entity

Dictionary Testing

The ts_lexize function facilitates dictionary testing.

  1. ts_lexize(*dictreg* dictionary, *token* text) returns text[]

ts_lexize returns an array of lexemes if the input *token* is known to the dictionary, or an empty array if the token is known to the dictionary but it is a stop word, or NULL if it is an unknown word.

Examples:

  1. SELECT ts_lexize('english_stem', 'stars');
  2. ts_lexize
  3. -----------
  4. {star}
  5. SELECT ts_lexize('english_stem', 'a');
  6. ts_lexize
  7. -----------
  8. {}

Note

The ts_lexize function expects a single token, not text. Here is a case where this can be confusing:

  1. SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
  2. ?column?
  3. ----------
  4. t

The thesaurus dictionary thesaurus_astro does know the phrase supernovae stars, but ts_lexize fails since it does not parse the input text but treats it as a single token. Use plainto_tsquery or to_tsvector to test thesaurus dictionaries, for example:

  1. SELECT plainto_tsquery('supernovae stars');
  2. plainto_tsquery
  3. -----------------
  4. 'sn'

Parent topic: Using Full Text Search