Greenplum Fuzzy String Match Extension

The Greenplum Database Fuzzy String Match extension provides functions to determine similarities and distance between strings based on various algorithms.

The Greenplum Database installation contains the files required for the functions in this extension module and SQL scripts to define the extension functions in a database and remove the functions from a database.

Warning: The functions soundex, metaphone, dmetaphone, and dmetaphone_alt do not work well with multibyte encodings (such as UTF-8).

The Greenplum Database Fuzzy String Match extension is based on the PostgreSQL fuzzystrmatch module.

Parent topic: Greenplum Database Reference Guide

Soundex Functions

The Soundex system is a method of matching similar-sounding (similar phonemes) names by converting them to the same code.

Note: Soundex is most useful for English names.

These functions work with Soundex codes:

  1. soundex(text <string1>) returns text
  2. difference(text <string1>, text <string2>) returns int

The soundex function converts a string to its Soundex code. Soundex codes consist of four characters.

The difference function converts two strings to their Soundex codes and then reports the number of matching code positions. The result ranges from zero to four, zero being no match and four being an exact match. These are some examples:

  1. SELECT soundex('hello world!');
  2. SELECT soundex('Anne'), soundex('Ann'), difference('Anne', 'Ann');
  3. SELECT soundex('Anne'), soundex('Andrew'), difference('Anne', 'Andrew');
  4. SELECT soundex('Anne'), soundex('Margaret'), difference('Anne', 'Margaret');
  5. CREATE TABLE s (nm text);
  6. INSERT INTO s VALUES ('john');
  7. INSERT INTO s VALUES ('joan');
  8. INSERT INTO s VALUES ('wobbly');
  9. INSERT INTO s VALUES ('jack');
  10. SELECT * FROM s WHERE soundex(nm) = soundex('john');
  11. SELECT * FROM s WHERE difference(s.nm, 'john') > 2;

For information about the Soundex indexing system see https://www.archives.gov/research/census/soundex.html.

Levenshtein Functions

These functions calculate the Levenshtein distance between two strings:

  1. levenshtein(text <source>, text <target>, int <ins_cost>, int <del_cost>, int <sub_cost>) returns int
  2. levenshtein(text <source>, text <target>) returns int
  3. levenshtein_less_equal(text <source>, text <target>, int <ins_cost>, int <del_cost>, int <sub_cost>, int <max_d>) returns int
  4. levenshtein_less_equal(text <source>, text <target>, int max_d) returns int

Both the source and target parameters can be any non-null string, with a maximum of 255 bytes. The cost parameters ins_cost, del_cost, and sub_cost specify cost of a character insertion, deletion, or substitution, respectively. You can omit the cost parameters, as in the second version of the function; in that case the cost parameters default to 1.

levenshtein_less_equal is accelerated version of levenshtein function for low values of distance. If actual distance is less or equal then max_d, then levenshtein_less_equal returns an accurate value of the distance. Otherwise, this function returns value which is greater than max_d. Examples:

  1. test=# SELECT levenshtein('GUMBO', 'GAMBOL');
  2. levenshtein
  3. -------------
  4. 2
  5. (1 row)
  6. test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
  7. levenshtein
  8. -------------
  9. 3
  10. (1 row)
  11. test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
  12. levenshtein_less_equal
  13. ------------------------
  14. 3
  15. (1 row)
  16. test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
  17. levenshtein_less_equal
  18. ------------------------
  19. 4
  20. (1 row)

For information about the Levenshtein algorithm, see http://www.levenshtein.net/.

Metaphone Functions

Metaphone, like Soundex, is based on the idea of constructing a representative code for an input string. Two strings are then deemed similar if they have the same codes. This function calculates the metaphone code of an input string :

  1. metaphone(text <source>, int <max_output_length>) returns text

The source parameter must be a non-null string with a maximum of 255 characters. The max_output_length parameter sets the maximum length of the output metaphone code; if longer, the output is truncated to this length. Example:

  1. test=# SELECT metaphone('GUMBO', 4);
  2. metaphone
  3. -----------
  4. KM
  5. (1 row)

For information about the Metaphone algorithm, see https://en.wikipedia.org/wiki/Metaphone.

Double Metaphone Functions

The Double Metaphone system computes two “sounds like” strings for a given input string - a “primary” and an “alternate”. In most cases they are the same, but for non-English names especially they can be a bit different, depending on pronunciation. These functions compute the primary and alternate codes:

  1. dmetaphone(text <source>) returns text
  2. dmetaphone_alt(text <source>) returns text

There is no length limit on the input strings. Example:

  1. test=# select dmetaphone('gumbo');
  2. dmetaphone
  3. ------------
  4. KMP
  5. (1 row)

For information about the Double Metaphone algorithm, see https://en.wikipedia.org/wiki/Metaphone#Double_Metaphone.

Installing and Uninstalling the Fuzzy String Match Functions

Greenplum Database supplies SQL scripts to install and uninstall the Fuzzy String Match extension functions.

To install the functions in a database, run the following SQL script:

  1. psql -f $GPHOME/share/postgresql/contrib/fuzzystrmatch.sql

To uninstall the functions, run the following SQL script:

  1. psql -f $GPHOME/share/postgresql/contrib/uninstall_fuzzystrmatch.sql

Note: When you uninstall the Fuzzy String Match functions from a database, routines that you created in the database that use the functions will no longer work.