12.8. 測試與除錯

The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.

12.8.1. Configuration Testing

The functionts_debugallows easy testing of a text search configuration.

  1. ts_debug([
  2. config
  3. regconfig
  4. ,
  5. ]
  6. document
  7. text
  8. ,
  9. OUT
  10. alias
  11. text
  12. ,
  13. OUT
  14. description
  15. text
  16. ,
  17. OUT
  18. token
  19. text
  20. ,
  21. OUT
  22. dictionaries
  23. regdictionary[]
  24. ,
  25. OUT
  26. dictionary
  27. regdictionary
  28. ,
  29. OUT
  30. lexemes
  31. text[]
  32. )
  33. returns setof record

ts_debugdisplays information about every token ofdocument_as produced by the parser and processed by the configured dictionaries. It uses the configuration specified byconfig_, ordefault_text_search_configif that argument is omitted.

ts_debugreturns one row for each token identified in the text by the parser. The columns returned are

  • aliastext— short name of the token type
  • descriptiontext— description of the token type
  • tokentext— text of the token
  • dictionariesregdictionary[]— the dictionaries selected by the configuration for this token type
  • dictionaryregdictionary— the dictionary that recognized the token, orNULLif none did
  • lexemestext[]— the lexeme(s) produced by the dictionary that recognized the token, orNULLif none did; an empty array ({}) means it was recognized as a stop word

Here is a simple example:

  1. SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats');
  2. alias | description | token | dictionaries | dictionary | lexemes
  3. -----------+-----------------+-------+----------------+--------------+---------
  4. asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
  5. blank | Space symbols | | {} | |
  6. asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
  7. blank | Space symbols | | {} | |
  8. asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
  9. blank | Space symbols | | {} | |
  10. asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
  11. blank | Space symbols | | {} | |
  12. asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
  13. blank | Space symbols | | {} | |
  14. asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
  15. blank | Space symbols | | {} | |
  16. asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
  17. blank | Space symbols | | {} | |
  18. blank | Space symbols | - | {} | |
  19. asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
  20. blank | Space symbols | | {} | |
  21. asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
  22. blank | Space symbols | | {} | |
  23. asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
  24. blank | Space symbols | | {} | |
  25. asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
  26. blank | Space symbols | | {} | |
  27. asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}

For a more extensive demonstration, we first create apublic.englishconfiguration and Ispell dictionary for the English language:

  1. CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
  2. CREATE TEXT SEARCH DICTIONARY english_ispell (
  3. TEMPLATE = ispell,
  4. DictFile = english,
  5. AffFile = english,
  6. StopWords = english
  7. );
  8. ALTER TEXT SEARCH CONFIGURATION public.english
  9. ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
  1. SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
  2. alias | description | token | dictionaries | dictionary | lexemes
  3. -----------+-----------------+-------------+-------------------------------+----------------+-------------
  4. asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {}
  5. blank | Space symbols | | {} | |
  6. asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright}
  7. blank | Space symbols | | {} | |
  8. asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}

In this example, the wordBrightestwas recognized by the parser as anASCII word(aliasasciiword). For this token type the dictionary list isenglish_ispellandenglish_stem. The word was recognized byenglish_ispell, which reduced it to the nounbright. The wordsupernovaesis unknown to theenglish_ispelldictionary so it was passed to the next dictionary, and, fortunately, was recognized (in fact,english_stemis a Snowball dictionary which recognizes everything; that is why it was placed at the end of the dictionary list).

The wordThewas recognized by theenglish_ispelldictionary as a stop word (Section 12.6.1) and will not be indexed. The spaces are discarded too, since the configuration provides no dictionaries at all for them.

You can reduce the width of the output by explicitly specifying which columns you want to see:

  1. SELECT alias, token, dictionary, lexemes
  2. FROM ts_debug('public.english','The Brightest supernovaes');
  3. alias | token | dictionary | lexemes
  4. -----------+-------------+----------------+-------------
  5. asciiword | The | english_ispell | {}
  6. blank | | |
  7. asciiword | Brightest | english_ispell | {bright}
  8. blank | | |
  9. asciiword | supernovaes | english_stem | {supernova}

12.8.2. Parser Testing

The following functions allow direct testing of a text search parser.

  1. ts_parse(
  2. parser_name
  3. text
  4. ,
  5. document
  6. text
  7. ,
  8. OUT
  9. tokid
  10. integer
  11. , OUT
  12. token
  13. text
  14. ) returns
  15. setof record
  16. ts_parse(
  17. parser_oid
  18. oid
  19. ,
  20. document
  21. text
  22. ,
  23. OUT
  24. tokid
  25. integer
  26. , OUT
  27. token
  28. text
  29. ) returns
  30. setof record

ts_parseparses the given_document_and returns a series of records, one for each token produced by parsing. Each record includes atokidshowing the assigned token type and atokenwhich is the text of the token. For example:

  1. SELECT * FROM ts_parse('default', '123 - a number');
  2. tokid | token
  3. -------+--------
  4. 22 | 123
  5. 12 |
  6. 12 | -
  7. 1 | a
  8. 12 |
  9. 1 | number
  1. ts_token_type(
  2. parser_name
  3. text
  4. , OUT
  5. tokid
  6. integer
  7. ,
  8. OUT
  9. alias
  10. text
  11. , OUT
  12. description
  13. text
  14. ) returns
  15. setof record
  16. ts_token_type(
  17. parser_oid
  18. oid
  19. , OUT
  20. tokid
  21. integer
  22. ,
  23. OUT
  24. alias
  25. text
  26. , OUT
  27. description
  28. text
  29. ) returns
  30. setof record

ts_token_typereturns a table which describes each type of token the specified parser can recognize. For each token type, the table gives the integertokidthat the parser uses to label a token of that type, thealiasthat names the token type in configuration commands, and a shortdescription. For example:

  1. SELECT * FROM ts_token_type('default');
  2. tokid | alias | description
  3. -------+-----------------+------------------------------------------
  4. 1 | asciiword | Word, all ASCII
  5. 2 | word | Word, all letters
  6. 3 | numword | Word, letters and digits
  7. 4 | email | Email address
  8. 5 | url | URL
  9. 6 | host | Host
  10. 7 | sfloat | Scientific notation
  11. 8 | version | Version number
  12. 9 | hword_numpart | Hyphenated word part, letters and digits
  13. 10 | hword_part | Hyphenated word part, all letters
  14. 11 | hword_asciipart | Hyphenated word part, all ASCII
  15. 12 | blank | Space symbols
  16. 13 | tag | XML tag
  17. 14 | protocol | Protocol head
  18. 15 | numhword | Hyphenated word, letters and digits
  19. 16 | asciihword | Hyphenated word, all ASCII
  20. 17 | hword | Hyphenated word, all letters
  21. 18 | url_path | URL path
  22. 19 | file | File or path name
  23. 20 | float | Decimal notation
  24. 21 | int | Signed integer
  25. 22 | uint | Unsigned integer
  26. 23 | entity | XML entity

12.8.3. Dictionary Testing

Thets_lexizefunction facilitates dictionary testing.

  1. ts_lexize(
  2. dict
  3. regdictionary
  4. ,
  5. token
  6. text
  7. ) returns
  8. text[]

ts_lexizereturns an array of lexemes if the input_token_is known to the dictionary, or an empty array if the token is known to the dictionary but it is a stop word, orNULLif it is an unknown word.

Examples:

  1. SELECT ts_lexize('english_stem', 'stars');
  2. ts_lexize
  3. -----------
  4. {star}
  5. SELECT ts_lexize('english_stem', 'a');
  6. ts_lexize
  7. -----------
  8. {}

Note

Thets_lexizefunction expects a singletoken, not text. Here is a case where this can be confusing:

  1. SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
  2. ?column?
  3. ----------
  4. t

The thesaurus dictionarythesaurus_astrodoes know the phrasesupernovae stars, butts_lexizefails since it does not parse the input text but treats it as a single token. Useplainto_tsqueryorto_tsvectorto test thesaurus dictionaries, for example:

  1. SELECT plainto_tsquery('supernovae stars');
  2. plainto_tsquery
  3. -----------------
  4. 'sn'