This topic describes the types of tokens the Greenplum Database text search parser produces from raw text.

    Text search parsers are responsible for splitting raw document text into tokens and identifying each token’s type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At present Greenplum Database provides just one built-in parser, which has been found to be useful for a wide range of applications.

    The built-in parser is named pg_catalog.default. It recognizes 23 token types, shown in the following table.

    AliasDescriptionExample
    asciiwordWord, all ASCII letterselephant
    wordWord, all lettersmañana
    numwordWord, letters and digitsbeta1
    asciihwordHyphenated word, all ASCIIup-to-date
    hwordHyphenated word, all letterslógico-matemática
    numhwordHyphenated word, letters and digitspostgresql-beta1
    hword_asciipartHyphenated word part, all ASCIIpostgresql in the context postgresql-beta1
    hword_partHyphenated word part, all letterslógico or matemática in the context lógico-matemática
    hword_numpartHyphenated word part, letters and digitsbeta1 in the context postgresql-beta1
    emailEmail addressfoo@example.com
    protocolProtocol headhttp://
    urlURLexample.com/stuff/index.html
    hostHostexample.com
    url_pathURL path/stuff/index.html, in the context of a URL
    fileFile or path name/usr/local/foo.txt, if not within a URL
    sfloatScientific notation-1.234e56
    floatDecimal notation-1.234
    intSigned integer-1234
    uintUnsigned integer1234
    versionVersion number8.3.0
    tagXML tag<a href=“dictionaries.html”>
    entityXML entity&amp;
    blankSpace symbols(any whitespace or punctuation not otherwise recognized)

    Note

    The parser’s notion of a “letter” is determined by the database’s locale setting, specifically lc_ctype. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token types word and asciiword should be treated alike.

    email does not support all valid email characters as defined by RFC 5322. Specifically, the only non-alphanumeric characters supported for email user names are period, dash, and underscore.

    It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component:

    1. SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
    2. alias | description | token
    3. -----------------+------------------------------------------+---------------
    4. numhword | Hyphenated word, letters and digits | foo-bar-beta1
    5. hword_asciipart | Hyphenated word part, all ASCII | foo
    6. blank | Space symbols | -
    7. hword_asciipart | Hyphenated word part, all ASCII | bar
    8. blank | Space symbols | -
    9. hword_numpart | Hyphenated word part, letters and digits | beta1

    This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example:

    1. SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
    2. alias | description | token
    3. ----------+---------------+------------------------------
    4. protocol | Protocol head | http://
    5. url | URL | example.com/stuff/index.html
    6. host | Host | example.com
    7. url_path | URL path | /stuff/index.html

    Parent topic: Using Full Text Search