Percolate query

Percolate queries are also known as Persistent queries, Prospective search, document routing, search in reverse and inverse search.

The normal way of doing searches is to store documents and perform search queries against them. However there are cases when we want to apply a query to an incoming new document to signal the matching. There are some scenarios where this is wanted. For example a monitoring system doesn’t just collect data, but it’s also desired to notify user on different events. That can be reaching some threshold for a metric or a certain value that appears in the monitored data. Another similar case is news aggregation. You can notify the user about any fresh news, but the user might want to be notified only about certain categories or topics. Going further, they might be only interested about certain “keywords”.

This is where a traditional search is not a good fit, since would assume performed the desired search over the entire collection, which gets multiplied by the number of users and we end up with lots of queries running over the entire collection, which can put a lot of extra load. The idea explained in this section is to store instead the queries and test them against an incoming new document or a batch of documents.

Google Alerts, AlertHN, Bloomberg Terminal and other systems that let their users to subscribe to something use a similar technology.

Performing a percolate query with CALL PQ

The key thing you need to remember about percolate queries is that you already have your search queries in the table. What you need to provide is documents to check if any of them match any of the stored rules.

You can perform a percolate query via SQL or JSON interfaces as well as using programming language clients. The SQL way gives more flexibility while via the HTTP it’s simpler and gives all you mostly need. The below table can help you understand the differences.

Desired behaviourSQLHTTPPHP
Provide a single documentCALL PQ(‘tbl’, ‘{doc1}’)query.percolate.document{doc1}$client->pq()->search([$percolate])
Provide a single document (alternative)CALL PQ(‘tbl’, ‘doc1’, 0 as docs_json)-
Provide multiple documentsCALL PQ(‘tbl’, (‘doc1’, ‘doc2’), 0 as docs_json)query.percolate.documents[{doc1}, {doc2}]$client->pq()->search([$percolate])
Provide multiple documents (alternative)CALL PQ(‘tbl’, (‘{doc1}’, ‘{doc2}’))--
Provide multiple documents (alternative)CALL PQ(‘tbl’, ‘[{doc1}, {doc2}]’)--
Return matching document ids0/1 as docs (disabled by default)Enabled by defaultEnabled by default
Use document’s own id to show in the result‘id field’ as docs_id (disabled by default)Not availableNot available
Consider input documents are JSON1 as docs_json (1 by default)Enabled by defaultEnabled by default
Consider input documents are plain text0 as docs_json (1 by default)Not availableNot available
Sparsed distribution modedefaultdefaultdefault
Sharded distribution modesharded as modeNot availableNot available
Return all info about matching query1 as query (0 by default)Enabled by defaultEnabled by default
Skip invalid JSON1 as skip_bad_json (0 by default)Not availableNot available
Extended info in SHOW META1 as verbose (0 by default)Not availableNot available
Define the number which will be added to document ids if no docs_id fields provided (makes sense mostly in distributed PQ modes)1 as shift (0 by default)Not availableNot available

To demonstrate how it works here are few examples. Let’s create a PQ table with 2 fields:

  • title (text)
  • color (string)

and 3 rules in it:

  • Just full-text. Query: @title bag
  • Full-text and filtering. Query: @title shoes. Filters: color='red'
  • Full-text and more complex filtering. Query: @title shoes. Filters: color IN('blue', 'green')
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java

SQL JSON PHP Python javascript Java

  1. CREATE TABLE products(title text, color string) type='pq';
  2. INSERT INTO products(query) values('@title bag');
  3. INSERT INTO products(query,filters) values('@title shoes', 'color=\'red\'');
  4. INSERT INTO products(query,filters) values('@title shoes', 'color in (\'blue\', \'green\')');
  5. select * from products;
  1. PUT /pq/products/doc/
  2. {
  3. "query": {
  4. "match": {
  5. "title": "bag"
  6. }
  7. },
  8. "filters": ""
  9. }
  10. PUT /pq/products/doc/
  11. {
  12. "query": {
  13. "match": {
  14. "title": "shoes"
  15. }
  16. },
  17. "filters": "color='red'"
  18. }
  19. PUT /pq/products/doc/
  20. {
  21. "query": {
  22. "match": {
  23. "title": "shoes"
  24. }
  25. },
  26. "filters": "color IN ('blue', 'green')"
  27. }
  1. $index = [
  2. 'index' => 'products',
  3. 'body' => [
  4. 'columns' => [
  5. 'title' => ['type' => 'text'],
  6. 'color' => ['type' => 'string']
  7. ],
  8. 'settings' => [
  9. 'type' => 'pq'
  10. ]
  11. ]
  12. ];
  13. $client->indices()->create($index);
  14. $query = [
  15. 'index' => 'products',
  16. 'body' => [ 'query'=>['match'=>['title'=>'bag']]]
  17. ];
  18. $client->pq()->doc($query);
  19. $query = [
  20. 'index' => 'products',
  21. 'body' => [ 'query'=>['match'=>['title'=>'shoes']],'filters'=>"color='red'"]
  22. ];
  23. $client->pq()->doc($query);
  24. $query = [
  25. 'index' => 'products',
  26. 'body' => [ 'query'=>['match'=>['title'=>'shoes']],'filters'=>"color IN ('blue', 'green')"]
  27. ];
  28. $client->pq()->doc($query);
  1. utilsApi.sql('create table products(title text, color string) type=\'pq\'')
  2. indexApi.insert({"index" : "products", "doc" : {"query" : "@title bag" }})
  3. indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes", "filters": "color='red'" }})
  4. indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes","filters": "color IN ('blue', 'green')" }})
  1. res = await utilsApi.sql('create table products(title text, color string) type=\'pq\'');
  2. res = indexApi.insert({"index" : "products", "doc" : {"query" : "@title bag" }});
  3. res = indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes", "filters": "color='red'" }});
  4. res = indexApi.insert({"index" : "products", "doc" : {"query" : "@title shoes","filters": "color IN ('blue', 'green')" }});
  1. utilsApi.sql("create table products(title text, color string) type='pq'");
  2. doc = new HashMap<String,Object>(){{
  3. put("query", "@title bag");
  4. }};
  5. newdoc = new InsertDocumentRequest();
  6. newdoc.index("products").setDoc(doc);
  7. indexApi.insert(newdoc);
  8. doc = new HashMap<String,Object>(){{
  9. put("query", "@title shoes");
  10. put("filters", "color='red'");
  11. }};
  12. newdoc = new InsertDocumentRequest();
  13. newdoc.index("products").setDoc(doc);
  14. indexApi.insert(newdoc);
  15. doc = new HashMap<String,Object>(){{
  16. put("query", "@title shoes");
  17. put("filters", "color IN ('blue', 'green')");
  18. }};
  19. newdoc = new InsertDocumentRequest();
  20. newdoc.index("products").setDoc(doc);
  21. indexApi.insert(newdoc);

Response

  1. +---------------------+--------------+------+---------------------------+
  2. | id | query | tags | filters |
  3. +---------------------+--------------+------+---------------------------+
  4. | 1657852401006149635 | @title shoes | | color IN ('blue, 'green') |
  5. | 1657852401006149636 | @title shoes | | color='red' |
  6. | 1657852401006149637 | @title bag | | |
  7. +---------------------+--------------+------+---------------------------+
  1. {
  2. "index": "products",
  3. "type": "doc",
  4. "_id": "1657852401006149661",
  5. "result": "created"
  6. }
  7. {
  8. "index": "products",
  9. "type": "doc",
  10. "_id": "1657852401006149662",
  11. "result": "created"
  12. }
  13. {
  14. "index": "products",
  15. "type": "doc",
  16. "_id": "1657852401006149663",
  17. "result": "created"
  18. }
  1. Array(
  2. [index] => products
  3. [type] => doc
  4. [_id] => 1657852401006149661
  5. [result] => created
  6. )
  7. Array(
  8. [index] => products
  9. [type] => doc
  10. [_id] => 1657852401006149662
  11. [result] => created
  12. )
  13. Array(
  14. [index] => products
  15. [type] => doc
  16. [_id] => 1657852401006149663
  17. [result] => created
  18. )
  1. {'created': True,
  2. 'found': None,
  3. 'id': 0,
  4. 'index': 'products',
  5. 'result': 'created'}
  6. {'created': True,
  7. 'found': None,
  8. 'id': 0,
  9. 'index': 'products',
  10. 'result': 'created'}
  11. {'created': True,
  12. 'found': None,
  13. 'id': 0,
  14. 'index': 'products',
  15. 'result': 'created'}
  1. "_index":"products","_id":0,"created":true,"result":"created"}
  2. {"_index":"products","_id":0,"created":true,"result":"created"}
  3. {"_index":"products","_id":0,"created":true,"result":"created"}
  1. {total=0, error=, warning=}
  2. class SuccessResponse {
  3. index: products
  4. id: 0
  5. created: true
  6. result: created
  7. found: null
  8. }
  9. class SuccessResponse {
  10. index: products
  11. id: 0
  12. created: true
  13. result: created
  14. found: null
  15. }
  16. class SuccessResponse {
  17. index: products
  18. id: 0
  19. created: true
  20. result: created
  21. found: null
  22. }
Just tell me what PQ rules match my single document

The first document doesn’t match any rules. It could match the first 2, but they require additional filters.

The second document matches one rule. Note that CALL PQ by default expects a document to be a JSON, but if you do 0 as docs_json you can pass a plain string instead.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java

SQL JSON PHP Python javascript Java

  1. CALL PQ('products', 'Beautiful shoes', 0 as docs_json);
  2. CALL PQ('products', 'What a nice bag', 0 as docs_json);
  3. CALL PQ('products', '{"title": "What a nice bag"}');
  1. POST /pq/products/_search
  2. {
  3. "query": {
  4. "percolate": {
  5. "document": {
  6. "title": "What a nice bag"
  7. }
  8. }
  9. }
  10. }
  1. $percolate = [
  2. 'index' => 'products',
  3. 'body' => [
  4. 'query' => [
  5. 'percolate' => [
  6. 'document' => [
  7. 'title' => 'What a nice bag'
  8. ]
  9. ]
  10. ]
  11. ]
  12. ];
  13. $client->pq()->search($percolate);
  1. searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}})
  1. res = await searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}});
  1. PercolateRequest percolateRequest = new PercolateRequest();
  2. query = new HashMap<String,Object>(){{
  3. put("percolate",new HashMap<String,Object >(){{
  4. put("document", new HashMap<String,Object >(){{
  5. put("title","what a nice bag");
  6. }});
  7. }});
  8. }};
  9. percolateRequest.query(query);
  10. searchApi.percolate("test_pq",percolateRequest);

Response

  1. +---------------------+
  2. | id |
  3. +---------------------+
  4. | 1657852401006149637 |
  5. +---------------------+
  6. +---------------------+
  7. | id |
  8. +---------------------+
  9. | 1657852401006149637 |
  10. +---------------------+
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 1,
  6. "max_score": 1,
  7. "hits": [
  8. {
  9. "_index": "products",
  10. "_type": "doc",
  11. "_id": "1657852401006149644",
  12. "_score": "1",
  13. "_source": {
  14. "query": {
  15. "ql": "@title bag"
  16. }
  17. },
  18. "fields": {
  19. "_percolator_document_slot": [
  20. 1
  21. ]
  22. }
  23. }
  24. ]
  25. }
  26. }
  1. Array
  2. (
  3. [took] => 0
  4. [timed_out] =>
  5. [hits] => Array
  6. (
  7. [total] => 1
  8. [max_score] => 1
  9. [hits] => Array
  10. (
  11. [0] => Array
  12. (
  13. [_index] => products
  14. [_type] => doc
  15. [_id] => 1657852401006149644
  16. [_score] => 1
  17. [_source] => Array
  18. (
  19. [query] => Array
  20. (
  21. [match] => Array
  22. (
  23. [title] => bag
  24. )
  25. )
  26. )
  27. [fields] => Array
  28. (
  29. [_percolator_document_slot] => Array
  30. (
  31. [0] => 1
  32. )
  33. )
  34. )
  35. )
  36. )
  37. )
  1. {'hits': {'hits': [{u'_id': u'2811025403043381480',
  2. u'_index': u'products',
  3. u'_score': u'1',
  4. u'_source': {u'query': {u'ql': u'@title bag'}},
  5. u'_type': u'doc',
  6. u'fields': {u'_percolator_document_slot': [1]}}],
  7. 'total': 1},
  8. 'profile': None,
  9. 'timed_out': False,
  10. 'took': 0}
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 1,
  6. "hits": [
  7. {
  8. "_index": "products",
  9. "_type": "doc",
  10. "_id": "2811045522851233808",
  11. "_score": "1",
  12. "_source": {
  13. "query": {
  14. "ql": "@title bag"
  15. }
  16. },
  17. "fields": {
  18. "_percolator_document_slot": [
  19. 1
  20. ]
  21. }
  22. }
  23. ]
  24. }
  25. }
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. hits: class SearchResponseHits {
  5. total: 1
  6. maxScore: 1
  7. hits: [{_index=products, _type=doc, _id=2811045522851234109, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[1]}}]
  8. aggregations: null
  9. }
  10. profile: null
  11. }
I want to know complete PQ rules matching my document
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java

SQL JSON PHP Python javascript Java

  1. CALL PQ('products', '{"title": "What a nice bag"}', 1 as query);
  1. POST /pq/products/_search
  2. {
  3. "query": {
  4. "percolate": {
  5. "document": {
  6. "title": "What a nice bag"
  7. }
  8. }
  9. }
  10. }
  1. $percolate = [
  2. 'index' => 'products',
  3. 'body' => [
  4. 'query' => [
  5. 'percolate' => [
  6. 'document' => [
  7. 'title' => 'What a nice bag'
  8. ]
  9. ]
  10. ]
  11. ]
  12. ];
  13. $client->pq()->search($percolate);
  1. searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}})
  1. res = await searchApi.percolate('products',{"query":{"percolate":{"document":{"title":"What a nice bag"}}}});
  1. PercolateRequest percolateRequest = new PercolateRequest();
  2. query = new HashMap<String,Object>(){{
  3. put("percolate",new HashMap<String,Object >(){{
  4. put("document", new HashMap<String,Object >(){{
  5. put("title","what a nice bag");
  6. }});
  7. }});
  8. }};
  9. percolateRequest.query(query);
  10. searchApi.percolate("test_pq",percolateRequest);

Response

  1. +---------------------+------------+------+---------+
  2. | id | query | tags | filters |
  3. +---------------------+------------+------+---------+
  4. | 1657852401006149637 | @title bag | | |
  5. +---------------------+------------+------+---------+
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 1,
  6. "max_score": 1,
  7. "hits": [
  8. {
  9. "_index": "products",
  10. "_type": "doc",
  11. "_id": "1657852401006149644",
  12. "_score": "1",
  13. "_source": {
  14. "query": {
  15. "ql": "@title bag"
  16. }
  17. },
  18. "fields": {
  19. "_percolator_document_slot": [
  20. 1
  21. ]
  22. }
  23. }
  24. ]
  25. }
  26. }
  1. Array
  2. (
  3. [took] => 0
  4. [timed_out] =>
  5. [hits] => Array
  6. (
  7. [total] => 1
  8. [max_score] => 1
  9. [hits] => Array
  10. (
  11. [0] => Array
  12. (
  13. [_index] => products
  14. [_type] => doc
  15. [_id] => 1657852401006149644
  16. [_score] => 1
  17. [_source] => Array
  18. (
  19. [query] => Array
  20. (
  21. [match] => Array
  22. (
  23. [title] => bag
  24. )
  25. )
  26. )
  27. [fields] => Array
  28. (
  29. [_percolator_document_slot] => Array
  30. (
  31. [0] => 1
  32. )
  33. )
  34. )
  35. )
  36. )
  37. )
  1. {'hits': {'hits': [{u'_id': u'2811025403043381480',
  2. u'_index': u'products',
  3. u'_score': u'1',
  4. u'_source': {u'query': {u'ql': u'@title bag'}},
  5. u'_type': u'doc',
  6. u'fields': {u'_percolator_document_slot': [1]}}],
  7. 'total': 1},
  8. 'profile': None,
  9. 'timed_out': False,
  10. 'took': 0}
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 1,
  6. "hits": [
  7. {
  8. "_index": "products",
  9. "_type": "doc",
  10. "_id": "2811045522851233808",
  11. "_score": "1",
  12. "_source": {
  13. "query": {
  14. "ql": "@title bag"
  15. }
  16. },
  17. "fields": {
  18. "_percolator_document_slot": [
  19. 1
  20. ]
  21. }
  22. }
  23. ]
  24. }
  25. }
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. hits: class SearchResponseHits {
  5. total: 1
  6. maxScore: 1
  7. hits: [{_index=products, _type=doc, _id=2811045522851234109, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[1]}}]
  8. aggregations: null
  9. }
  10. profile: null
  11. }
How about multiple documents?

Note that via CALL PQ you can provide multiple documents different ways:

  • as an array of plain document in round brackets ('doc1', 'doc2'). This requires 0 as docs_json
  • as a array of JSONs in round brackets ('{doc1}' '{doc2}')
  • or as a standard JSON array '[{doc1}, {doc2}]'
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java

SQL JSON PHP Python javascript Java

  1. CALL PQ('products', ('nice pair of shoes', 'beautiful bag'), 1 as query, 0 as docs_json);
  2. CALL PQ('products', ('{"title": "nice pair of shoes", "color": "red"}', '{"title": "beautiful bag"}'), 1 as query);
  3. CALL PQ('products', '[{"title": "nice pair of shoes", "color": "blue"}, {"title": "beautiful bag"}]', 1 as query);
  1. POST /pq/products/_search
  2. {
  3. "query": {
  4. "percolate": {
  5. "documents": [
  6. {"title": "nice pair of shoes", "color": "blue"},
  7. {"title": "beautiful bag"}
  8. ]
  9. }
  10. }
  11. }
  1. $percolate = [
  2. 'index' => 'products',
  3. 'body' => [
  4. 'query' => [
  5. 'percolate' => [
  6. 'documents' => [
  7. ['title' => 'nice pair of shoes','color'=>'blue'],
  8. ['title' => 'beautiful bag']
  9. ]
  10. ]
  11. ]
  12. ]
  13. ];
  14. $client->pq()->search($percolate);
  1. searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}})
  1. res = await searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}});
  1. percolateRequest = new PercolateRequest();
  2. query = new HashMap<String,Object>(){{
  3. put("percolate",new HashMap<String,Object >(){{
  4. put("documents", new ArrayList<Object>(){{
  5. add(new HashMap<String,Object >(){{
  6. put("title","nice pair of shoes");
  7. put("color","blue");
  8. }});
  9. add(new HashMap<String,Object >(){{
  10. put("title","beautiful bag");
  11. }});
  12. }});
  13. }});
  14. }};
  15. percolateRequest.query(query);
  16. searchApi.percolate("products",percolateRequest);

Response

  1. +---------------------+------------+------+---------+
  2. | id | query | tags | filters |
  3. +---------------------+------------+------+---------+
  4. | 1657852401006149637 | @title bag | | |
  5. +---------------------+------------+------+---------+
  6. +---------------------+--------------+------+-------------+
  7. | id | query | tags | filters |
  8. +---------------------+--------------+------+-------------+
  9. | 1657852401006149636 | @title shoes | | color='red' |
  10. | 1657852401006149637 | @title bag | | |
  11. +---------------------+--------------+------+-------------+
  12. +---------------------+--------------+------+---------------------------+
  13. | id | query | tags | filters |
  14. +---------------------+--------------+------+---------------------------+
  15. | 1657852401006149635 | @title shoes | | color IN ('blue, 'green') |
  16. | 1657852401006149637 | @title bag | | |
  17. +---------------------+--------------+------+---------------------------+
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 2,
  6. "max_score": 1,
  7. "hits": [
  8. {
  9. "_index": "products",
  10. "_type": "doc",
  11. "_id": "1657852401006149644",
  12. "_score": "1",
  13. "_source": {
  14. "query": {
  15. "ql": "@title bag"
  16. }
  17. },
  18. "fields": {
  19. "_percolator_document_slot": [
  20. 2
  21. ]
  22. }
  23. },
  24. {
  25. "_index": "products",
  26. "_type": "doc",
  27. "_id": "1657852401006149646",
  28. "_score": "1",
  29. "_source": {
  30. "query": {
  31. "ql": "@title shoes"
  32. }
  33. },
  34. "fields": {
  35. "_percolator_document_slot": [
  36. 1
  37. ]
  38. }
  39. }
  40. ]
  41. }
  42. }
  1. Array
  2. (
  3. [took] => 23
  4. [timed_out] =>
  5. [hits] => Array
  6. (
  7. [total] => 2
  8. [max_score] => 1
  9. [hits] => Array
  10. (
  11. [0] => Array
  12. (
  13. [_index] => products
  14. [_type] => doc
  15. [_id] => 2810781492890828819
  16. [_score] => 1
  17. [_source] => Array
  18. (
  19. [query] => Array
  20. (
  21. [match] => Array
  22. (
  23. [title] => bag
  24. )
  25. )
  26. )
  27. [fields] => Array
  28. (
  29. [_percolator_document_slot] => Array
  30. (
  31. [0] => 2
  32. )
  33. )
  34. )
  35. [1] => Array
  36. (
  37. [_index] => products
  38. [_type] => doc
  39. [_id] => 2810781492890828821
  40. [_score] => 1
  41. [_source] => Array
  42. (
  43. [query] => Array
  44. (
  45. [match] => Array
  46. (
  47. [title] => shoes
  48. )
  49. )
  50. )
  51. [fields] => Array
  52. (
  53. [_percolator_document_slot] => Array
  54. (
  55. [0] => 1
  56. )
  57. )
  58. )
  59. )
  60. )
  61. )
  1. {'hits': {'hits': [{u'_id': u'2811025403043381494',
  2. u'_index': u'products',
  3. u'_score': u'1',
  4. u'_source': {u'query': {u'ql': u'@title bag'}},
  5. u'_type': u'doc',
  6. u'fields': {u'_percolator_document_slot': [2]}},
  7. {u'_id': u'2811025403043381496',
  8. u'_index': u'products',
  9. u'_score': u'1',
  10. u'_source': {u'query': {u'ql': u'@title shoes'}},
  11. u'_type': u'doc',
  12. u'fields': {u'_percolator_document_slot': [1]}}],
  13. 'total': 2},
  14. 'profile': None,
  15. 'timed_out': False,
  16. 'took': 0}
  1. {
  2. "took": 6,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 2,
  6. "hits": [
  7. {
  8. "_index": "products",
  9. "_type": "doc",
  10. "_id": "2811045522851233808",
  11. "_score": "1",
  12. "_source": {
  13. "query": {
  14. "ql": "@title bag"
  15. }
  16. },
  17. "fields": {
  18. "_percolator_document_slot": [
  19. 2
  20. ]
  21. }
  22. },
  23. {
  24. "_index": "products",
  25. "_type": "doc",
  26. "_id": "2811045522851233810",
  27. "_score": "1",
  28. "_source": {
  29. "query": {
  30. "ql": "@title shoes"
  31. }
  32. },
  33. "fields": {
  34. "_percolator_document_slot": [
  35. 1
  36. ]
  37. }
  38. }
  39. ]
  40. }
  41. }
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. hits: class SearchResponseHits {
  5. total: 2
  6. maxScore: 1
  7. hits: [{_index=products, _type=doc, _id=2811045522851234133, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[2]}}, {_index=products, _type=doc, _id=2811045522851234135, _score=1, _source={query={ql=@title shoes}}, fields={_percolator_document_slot=[1]}}]
  8. aggregations: null
  9. }
  10. profile: null
  11. }
I want to know what docs match what rules

Option 1 as docs allows to see what documents of the provided match what rules.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java

SQL JSON PHP Python javascript Java

  1. CALL PQ('products', '[{"title": "nice pair of shoes", "color": "blue"}, {"title": "beautiful bag"}]', 1 as query, 1 as docs);
  1. POST /pq/products/_search
  2. {
  3. "query": {
  4. "percolate": {
  5. "documents": [
  6. {"title": "nice pair of shoes", "color": "blue"},
  7. {"title": "beautiful bag"}
  8. ]
  9. }
  10. }
  11. }
  1. $percolate = [
  2. 'index' => 'products',
  3. 'body' => [
  4. 'query' => [
  5. 'percolate' => [
  6. 'documents' => [
  7. ['title' => 'nice pair of shoes','color'=>'blue'],
  8. ['title' => 'beautiful bag']
  9. ]
  10. ]
  11. ]
  12. ]
  13. ];
  14. $client->pq()->search($percolate);
  1. searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}})
  1. res = await searchApi.percolate('products',{"query":{"percolate":{"documents":[{"title":"nice pair of shoes","color":"blue"},{"title":"beautiful bag"}]}}});
  1. percolateRequest = new PercolateRequest();
  2. query = new HashMap<String,Object>(){{
  3. put("percolate",new HashMap<String,Object >(){{
  4. put("documents", new ArrayList<Object>(){{
  5. add(new HashMap<String,Object >(){{
  6. put("title","nice pair of shoes");
  7. put("color","blue");
  8. }});
  9. add(new HashMap<String,Object >(){{
  10. put("title","beautiful bag");
  11. }});
  12. }});
  13. }});
  14. }};
  15. percolateRequest.query(query);
  16. searchApi.percolate("products",percolateRequest);

Response

  1. +---------------------+-----------+--------------+------+---------------------------+
  2. | id | documents | query | tags | filters |
  3. +---------------------+-----------+--------------+------+---------------------------+
  4. | 1657852401006149635 | 1 | @title shoes | | color IN ('blue, 'green') |
  5. | 1657852401006149637 | 2 | @title bag | | |
  6. +---------------------+-----------+--------------+------+---------------------------+
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 2,
  6. "max_score": 1,
  7. "hits": [
  8. {
  9. "_index": "products",
  10. "_type": "doc",
  11. "_id": "1657852401006149644",
  12. "_score": "1",
  13. "_source": {
  14. "query": {
  15. "ql": "@title bag"
  16. }
  17. },
  18. "fields": {
  19. "_percolator_document_slot": [
  20. 2
  21. ]
  22. }
  23. },
  24. {
  25. "_index": "products",
  26. "_type": "doc",
  27. "_id": "1657852401006149646",
  28. "_score": "1",
  29. "_source": {
  30. "query": {
  31. "ql": "@title shoes"
  32. }
  33. },
  34. "fields": {
  35. "_percolator_document_slot": [
  36. 1
  37. ]
  38. }
  39. }
  40. ]
  41. }
  42. }
  1. Array
  2. (
  3. [took] => 23
  4. [timed_out] =>
  5. [hits] => Array
  6. (
  7. [total] => 2
  8. [max_score] => 1
  9. [hits] => Array
  10. (
  11. [0] => Array
  12. (
  13. [_index] => products
  14. [_type] => doc
  15. [_id] => 2810781492890828819
  16. [_score] => 1
  17. [_source] => Array
  18. (
  19. [query] => Array
  20. (
  21. [match] => Array
  22. (
  23. [title] => bag
  24. )
  25. )
  26. )
  27. [fields] => Array
  28. (
  29. [_percolator_document_slot] => Array
  30. (
  31. [0] => 2
  32. )
  33. )
  34. )
  35. [1] => Array
  36. (
  37. [_index] => products
  38. [_type] => doc
  39. [_id] => 2810781492890828821
  40. [_score] => 1
  41. [_source] => Array
  42. (
  43. [query] => Array
  44. (
  45. [match] => Array
  46. (
  47. [title] => shoes
  48. )
  49. )
  50. )
  51. [fields] => Array
  52. (
  53. [_percolator_document_slot] => Array
  54. (
  55. [0] => 1
  56. )
  57. )
  58. )
  59. )
  60. )
  61. )
  1. {'hits': {'hits': [{u'_id': u'2811025403043381494',
  2. u'_index': u'products',
  3. u'_score': u'1',
  4. u'_source': {u'query': {u'ql': u'@title bag'}},
  5. u'_type': u'doc',
  6. u'fields': {u'_percolator_document_slot': [2]}},
  7. {u'_id': u'2811025403043381496',
  8. u'_index': u'products',
  9. u'_score': u'1',
  10. u'_source': {u'query': {u'ql': u'@title shoes'}},
  11. u'_type': u'doc',
  12. u'fields': {u'_percolator_document_slot': [1]}}],
  13. 'total': 2},
  14. 'profile': None,
  15. 'timed_out': False,
  16. 'took': 0}
  1. {
  2. "took": 6,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 2,
  6. "hits": [
  7. {
  8. "_index": "products",
  9. "_type": "doc",
  10. "_id": "2811045522851233808",
  11. "_score": "1",
  12. "_source": {
  13. "query": {
  14. "ql": "@title bag"
  15. }
  16. },
  17. "fields": {
  18. "_percolator_document_slot": [
  19. 2
  20. ]
  21. }
  22. },
  23. {
  24. "_index": "products",
  25. "_type": "doc",
  26. "_id": "2811045522851233810",
  27. "_score": "1",
  28. "_source": {
  29. "query": {
  30. "ql": "@title shoes"
  31. }
  32. },
  33. "fields": {
  34. "_percolator_document_slot": [
  35. 1
  36. ]
  37. }
  38. }
  39. ]
  40. }
  41. }
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. hits: class SearchResponseHits {
  5. total: 2
  6. maxScore: 1
  7. hits: [{_index=products, _type=doc, _id=2811045522851234133, _score=1, _source={query={ql=@title bag}}, fields={_percolator_document_slot=[2]}}, {_index=products, _type=doc, _id=2811045522851234135, _score=1, _source={query={ql=@title shoes}}, fields={_percolator_document_slot=[1]}}]
  8. aggregations: null
  9. }
  10. profile: null
  11. }
Static ids

By default matching document ids correspond to their relative numbers in the list you provide. But in some cases each document already has its own id. For this case there’s an option 'id field name' as docs_id for CALL PQ.

Note that if the id cannot be found by the provided field name the PQ rule will not be shown in the results.

This option is only available for CALL PQ via SQL.

  • SQL

SQL

  1. CALL PQ('products', '[{"id": 123, "title": "nice pair of shoes", "color": "blue"}, {"id": 456, "title": "beautiful bag"}]', 1 as query, 'id' as docs_id, 1 as docs);

Response

  1. +---------------------+-----------+--------------+------+---------------------------+
  2. | id | documents | query | tags | filters |
  3. +---------------------+-----------+--------------+------+---------------------------+
  4. | 1657852401006149664 | 456 | @title bag | | |
  5. | 1657852401006149666 | 123 | @title shoes | | color IN ('blue, 'green') |
  6. +---------------------+-----------+--------------+------+---------------------------+
I may have invalid JSONs, please skip them

If you provide documents as separate JSONs there is an option for CALL PQ to skip invalid JSONs. In the example note that in the 2nd and 3rd queries the 2nd JSON is invalid. Without 1 as skip_bad_json the 2nd query fails, adding it in the 3rd query allows to avoid that. This option is not available via JSON over HTTP as the whole JSON query should be always valid when sent via the HTTP protocol.

  • SQL

SQL

  1. CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'));
  2. CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag}'));
  3. CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag}'), 1 as skip_bad_json);

Response

  1. +---------------------+
  2. | id |
  3. +---------------------+
  4. | 1657852401006149635 |
  5. | 1657852401006149637 |
  6. +---------------------+
  7. ERROR 1064 (42000): Bad JSON objects in strings: 2
  8. +---------------------+
  9. | id |
  10. +---------------------+
  11. | 1657852401006149635 |
  12. +---------------------+
I want higher performance of a percolate query

Percolate queries were made with high throughput and big data volume in mind, so there are few things how you can optimize your performance in case you are looking for lower latency and higher throughput.

There are two modes of distribution of a percolate table and how a percolate query can work against it:

  • Sparsed. Default. When it is good: too many documents, PQ tables are mirrored. The batch of documents you pass will be split into parts according to the number of agents, so each of the nodes will receive and process only a part of the documents from your request. It will be beneficial when your set of documents is quite big, but the set of queries stored in the pq table is quite small. Assuming that all the hosts are mirrors Manticore will split your set of documents and distribute the chunks among the mirrors. Once the agents are done with the queries it will collect and merge all the results and return final query set as if it comes from one solid table. You can use replication to help the process.
  • Sharded. When it is good: too many PQ rules, the rules are split among PQ tables. The whole documents set will be broad-casted to all tables of the distributed PQ table without any initial documents split. It is beneficial when you push relatively small set of documents, but the number of stored queries is huge. So in this case it is more appropriate to store just part of PQ rules on each node and then merge the results returned from the nodes that process one and the same set of documents against different sets of PQ rules. This mode has to be explicitly set since first of all it implies multiplication of network payload and secondly it expects tables with different PQ which replication cannot do out of the box.

Let’s assume you have table pq_d2 which is defined as:

  1. table pq_d2
  2. {
  3. type = distributed
  4. agent = 127.0.0.1:6712:pq
  5. agent = 127.0.0.1:6712:ptitle
  6. }

Each of ‘pq’ and ‘ptitle’ contains:

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java

SQL JSON PHP Python javascript Java

  1. SELECT * FROM pq;
  1. POST /pq/pq/_search
  1. $params = [
  2. 'index' => 'pq',
  3. 'body' => [
  4. ]
  5. ];
  6. $response = $client->pq()->search($params);
  1. searchApi.search({"index":"pq","query":{"match_all":{}}})
  1. res = await searchApi.search({"index":"pq","query":{"match_all":{}}});
  1. Map<String,Object> query = new HashMap<String,Object>();
  2. query.put("match_all",null);
  3. SearchRequest searchRequest = new SearchRequest();
  4. searchRequest.setIndex("pq");
  5. searchRequest.setQuery(query);
  6. SearchResponse searchResponse = searchApi.search(searchRequest);

Response

  1. +------+-------------+------+-------------------+
  2. | id | query | tags | filters |
  3. +------+-------------+------+-------------------+
  4. | 1 | filter test | | gid>=10 |
  5. | 2 | angry | | gid>=10 OR gid<=3 |
  6. +------+-------------+------+-------------------+
  7. 2 rows in set (0.01 sec)
  1. {
  2. "took":0,
  3. "timed_out":false,
  4. "hits":{
  5. "total":2,
  6. "hits":[
  7. {
  8. "_id":"1",
  9. "_score":1,
  10. "_source":{
  11. "query":{ "ql":"filter test" },
  12. "tags":"",
  13. "filters":"gid>=10"
  14. }
  15. },
  16. {
  17. "_id":"2",
  18. "_score":1,
  19. "_source":{
  20. "query":{"ql":"angry"},
  21. "tags":"",
  22. "filters":"gid>=10 OR gid<=3"
  23. }
  24. }
  25. ]
  26. }
  27. }
  1. (
  2. [took] => 0
  3. [timed_out] =>
  4. [hits] =>
  5. (
  6. [total] => 2
  7. [hits] =>
  8. (
  9. [0] =>
  10. (
  11. [_id] => 1
  12. [_score] => 1
  13. [_source] =>
  14. (
  15. [query] =>
  16. (
  17. [ql] => filter test
  18. )
  19. [tags] =>
  20. [filters] => gid>=10
  21. )
  22. ),
  23. [1] =>
  24. (
  25. [_id] => 1
  26. [_score] => 1
  27. [_source] =>
  28. (
  29. [query] =>
  30. (
  31. [ql] => angry
  32. )
  33. [tags] =>
  34. [filters] => gid>=10 OR gid<=3
  35. )
  36. )
  37. )
  38. )
  39. )
  1. {'hits': {'hits': [{u'_id': u'2811025403043381501',
  2. u'_score': 1,
  3. u'_source': {u'filters': u"gid>=10",
  4. u'query': u'filter test',
  5. u'tags': u''}},
  6. {u'_id': u'2811025403043381502',
  7. u'_score': 1,
  8. u'_source': {u'filters': u"gid>=10 OR gid<=3",
  9. u'query': u'angry',
  10. u'tags': u''}}],
  11. 'total': 2},
  12. 'profile': None,
  13. 'timed_out': False,
  14. 'took': 0}
  1. {"hits": {"hits": [{"_id": "2811025403043381501",
  2. "_score": 1,
  3. "_source": {"filters": u"gid>=10",
  4. "query": "filter test",
  5. "tags": ""}},
  6. {"_id": "2811025403043381502",
  7. "_score": 1,
  8. "_source": {"filters": u"gid>=10 OR gid<=3",
  9. "query": "angry",
  10. "tags": ""}}],
  11. "total": 2},
  12. "timed_out": false,
  13. "took": 0}
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. hits: class SearchResponseHits {
  5. total: 2
  6. maxScore: null
  7. hits: [{_id=2811045522851233962, _score=1, _source={filters=gid>=10, query=filter test, tags=}}, {_id=2811045522851233951, _score=1, _source={filters=gid>=10 OR gid<=3, query=angry,tags=}}]
  8. aggregations: null
  9. }
  10. profile: null
  11. }

And you fire CALL PQ to the distributed table with a couple of docs.

  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java

SQL JSON PHP Python javascript Java

  1. CALL PQ ('pq_d2', ('{"title":"angry test", "gid":3 }', '{"title":"filter test doc2", "gid":13}'), 1 AS docs);
  1. POST /pq/pq/_search -d '
  2. "query":
  3. {
  4. "percolate":
  5. {
  6. "documents" : [
  7. { "title": "angry test", "gid": 3 },
  8. { "title": "filter test doc2", "gid": 13 }
  9. ]
  10. }
  11. }
  12. '
  1. $params = [
  2. 'index' => 'pq',
  3. 'body' => [
  4. 'query' => [
  5. 'percolate' => [
  6. 'documents' => [
  7. [
  8. 'title'=>'angry test',
  9. 'gid' => 3
  10. ],
  11. [
  12. 'title'=>'filter test doc2',
  13. 'gid' => 13
  14. ],
  15. ]
  16. ]
  17. ]
  18. ]
  19. ];
  20. $response = $client->pq()->search($params);
  1. searchApi.percolate('pq',{"percolate":{"documents":[{"title":"angry test","gid":3},{"title":"filter test doc2","gid":13}]}})
  1. res = await searchApi.percolate('pq',{"percolate":{"documents":[{"title":"angry test","gid":3},{"title":"filter test doc2","gid":13}]}});
  1. percolateRequest = new PercolateRequest();
  2. query = new HashMap<String,Object>(){{
  3. put("percolate",new HashMap<String,Object >(){{
  4. put("documents", new ArrayList<Object>(){{
  5. add(new HashMap<String,Object >(){{
  6. put("title","angry test");
  7. put("gid",3);
  8. }});
  9. add(new HashMap<String,Object >(){{
  10. put("title","filter test doc2");
  11. put("gid",13);
  12. }});
  13. }});
  14. }});
  15. }};
  16. percolateRequest.query(query);
  17. searchApi.percolate("pq",percolateRequest);

Response

  1. +------+-----------+
  2. | id | documents |
  3. +------+-----------+
  4. | 1 | 2 |
  5. | 2 | 1 |
  6. +------+-----------+
  1. {
  2. "took":0,
  3. "timed_out":false,
  4. "hits":{
  5. "total":2,"hits":[
  6. {
  7. "_id":"2",
  8. "_score":1,
  9. "_source":{
  10. "query":{"title":"angry"},
  11. "tags":"",
  12. "filters":"gid>=10 OR gid<=3"
  13. }
  14. }
  15. {
  16. "_id":"1",
  17. "_score":1,
  18. "_source":{
  19. "query":{"ql":"filter test"},
  20. "tags":"",
  21. "filters":"gid>=10"
  22. }
  23. },
  24. ]
  25. }
  26. }
  1. (
  2. [took] => 0
  3. [timed_out] =>
  4. [hits] =>
  5. (
  6. [total] => 2
  7. [hits] =>
  8. (
  9. [0] =>
  10. (
  11. [_index] => pq
  12. [_type] => doc
  13. [_id] => 2
  14. [_score] => 1
  15. [_source] =>
  16. (
  17. [query] =>
  18. (
  19. [ql] => angry
  20. )
  21. [tags] =>
  22. [filters] => gid>=10 OR gid<=3
  23. ),
  24. [fields] =>
  25. (
  26. [_percolator_document_slot] =>
  27. (
  28. [0] => 1
  29. )
  30. )
  31. ),
  32. [1] =>
  33. (
  34. [_index] => pq
  35. [_id] => 1
  36. [_score] => 1
  37. [_source] =>
  38. (
  39. [query] =>
  40. (
  41. [ql] => filter test
  42. )
  43. [tags] =>
  44. [filters] => gid>=10
  45. )
  46. [fields] =>
  47. (
  48. [_percolator_document_slot] =>
  49. (
  50. [0] => 0
  51. )
  52. )
  53. )
  54. )
  55. )
  56. )
  1. {'hits': {'hits': [{u'_id': u'2811025403043381480',
  2. u'_index': u'pq',
  3. u'_score': u'1',
  4. u'_source': {u'query': {u'ql': u'angry'},u'tags':u'',u'filters':u"gid>=10 OR gid<=3"},
  5. u'_type': u'doc',
  6. u'fields': {u'_percolator_document_slot': [1]}},
  7. {u'_id': u'2811025403043381501',
  8. u'_index': u'pq',
  9. u'_score': u'1',
  10. u'_source': {u'query': {u'ql': u'filter test'},u'tags':u'',u'filters':u"gid>=10"},
  11. u'_type': u'doc',
  12. u'fields': {u'_percolator_document_slot': [1]}}],
  13. 'total': 2},
  14. 'profile': None,
  15. 'timed_out': False,
  16. 'took': 0}
  1. {'hits': {'hits': [{u'_id': u'2811025403043381480',
  2. u'_index': u'pq',
  3. u'_score': u'1',
  4. u'_source': {u'query': {u'ql': u'angry'},u'tags':u'',u'filters':u"gid>=10 OR gid<=3"},
  5. u'_type': u'doc',
  6. u'fields': {u'_percolator_document_slot': [1]}},
  7. {u'_id': u'2811025403043381501',
  8. u'_index': u'pq',
  9. u'_score': u'1',
  10. u'_source': {u'query': {u'ql': u'filter test'},u'tags':u'',u'filters':u"gid>=10"},
  11. u'_type': u'doc',
  12. u'fields': {u'_percolator_document_slot': [1]}}],
  13. 'total': 2},
  14. 'profile': None,
  15. 'timed_out': False,
  16. 'took': 0}
  1. class SearchResponse {
  2. took: 10
  3. timedOut: false
  4. hits: class SearchResponseHits {
  5. total: 2
  6. maxScore: 1
  7. hits: [{_index=pq, _type=doc, _id=2811045522851234165, _score=1, _source={query={ql=@title angry}}, fields={_percolator_document_slot=[1]}}, {_index=pq, _type=doc, _id=2811045522851234166, _score=1, _source={query={ql=@title filter test doc2}}, fields={_percolator_document_slot=[2]}}]
  8. aggregations: null
  9. }
  10. profile: null
  11. }

That was an example of the default sparsed mode. To demonstrate the sharded mode let’s create a distributed PQ table consisting of 2 local PQ tables and add 2 documents to “products1” and 1 document to “products2”:

  1. create table products1(title text, color string) type='pq';
  2. create table products2(title text, color string) type='pq';
  3. create table products_distributed type='distributed' local='products1' local='products2';
  4. INSERT INTO products1(query) values('@title bag');
  5. INSERT INTO products1(query,filters) values('@title shoes', 'color=\'red\'');
  6. INSERT INTO products2(query,filters) values('@title shoes', 'color in (\'blue\', \'green\')');

Now if you add 'sharded' as mode to CALL PQ it will send the documents to all the agents tables (in this case just local tables, but they can be remote to utilize external hardware). This mode is not available via the JSON interface.

  • SQL

SQL

  1. CALL PQ('products_distributed', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 'sharded' as mode, 1 as query);

Response

  1. +---------------------+--------------+------+---------------------------+
  2. | id | query | tags | filters |
  3. +---------------------+--------------+------+---------------------------+
  4. | 1657852401006149639 | @title bag | | |
  5. | 1657852401006149643 | @title shoes | | color IN ('blue, 'green') |
  6. +---------------------+--------------+------+---------------------------+

Note that the syntax of agent mirrors in the configuration (when several hosts are assigned to one agent line, separated with | ) has nothing to do with the CALL PQ query mode, so each agent always represents one node despite of the number of HA mirrors specified for this agent.

How do I understand more about the performance?

In some case you might want to get more details about performance a percolate query. For that purposes there is option 1 as verbose which is available only via SQL and allows to save more performance metrics. You can see them via SHOW META query which you can run after CALL PQ. See SHOW META for more info.

  • 1 as verbose
  • 0 as verbose

1 as verbose 0 as verbose

  1. CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 1 as verbose); show meta;
  1. CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 0 as verbose); show meta;

Response

  1. +---------------------+
  2. | id |
  3. +---------------------+
  4. | 1657852401006149644 |
  5. | 1657852401006149646 |
  6. +---------------------+
  7. +-------------------------+-----------+
  8. | Name | Value |
  9. +-------------------------+-----------+
  10. | Total | 0.000 sec |
  11. | Setup | 0.000 sec |
  12. | Queries matched | 2 |
  13. | Queries failed | 0 |
  14. | Document matched | 2 |
  15. | Total queries stored | 3 |
  16. | Term only queries | 3 |
  17. | Fast rejected queries | 0 |
  18. | Time per query | 27, 10 |
  19. | Time of matched queries | 37 |
  20. +-------------------------+-----------+
  1. +---------------------+
  2. | id |
  3. +---------------------+
  4. | 1657852401006149644 |
  5. | 1657852401006149646 |
  6. +---------------------+
  7. +-----------------------+-----------+
  8. | Name | Value |
  9. +-----------------------+-----------+
  10. | Total | 0.000 sec |
  11. | Queries matched | 2 |
  12. | Queries failed | 0 |
  13. | Document matched | 2 |
  14. | Total queries stored | 3 |
  15. | Term only queries | 3 |
  16. | Fast rejected queries | 0 |
  17. +-----------------------+-----------+

Autocomplete

Autocomplete (or word completion) is a feature in which an application predicts the rest of a word a user is typing. On websites it’s used in search boxes, where a user starts to type a word and a dropdown with suggestions pops up so the user can select the ending from the list.

Autocomplete

There are few ways how you can do autocomplete in Manticore:

Autocomplete a sentence

To autocomplete a sentence you can use infixed search. You can find endings of a document’s field by providing its beginning and:

  • using full-text operators * to match anything it substitutes
  • and optionally ^ to start from the beginning of the field
  • and perhaps "" for phrase matching
  • and optionally highlight the results so you don’t have to fetch them in full to your application

There is an article about it in our blog and an interactive course. A quick example is:

  • Let’s assume you have a document: My cat loves my dog. The cat (Felis catus) is a domestic species of small carnivorous mammal.
  • Then you can use ^, "" and * so as the user is typing you make queries like: ^"m*", ^"my *", ^"my c*", ^"my ca*" and so on
  • It will find the document and if you also do highlighting you will get something like: <strong>My cat</strong> loves my dog. The cat ( ...
Autocomplete a word

In some cases all you need is just autocomplete a single word or a couple of words. In this case you can use CALL KEYWORDS.

CALL KEYWORDS

CALL KEYWORDS is available via the SQL interface and provides a way to check how keywords are tokenized or to retrieve the tokenized forms of particular keywords. If the table enables infixes it allows to quickly find possible endings for given keywords which makes it possible for use for autocomplete.

This is a good alternative to just general infixed search as it provides higher performance since all it needs for work is table’s dictionary, not the documents themselves.

General syntax

  1. CALL KEYWORDS(text, table [, options])

CALL KEYWORDS statement splits text into keywords. It returns tokenized and normalized forms of the keywords, and, optionally, keyword statistics. It also returns the position of each keyword in the query and all forms of tokenized keywords in case the table enables lemmatizers.

ParameterDescription
textText to break down to keywords
tableName of the table from which to take the text processing settings
0/1 as statsShow statistics of keywords, default is 0
0/1 as fold_wildcardsFold wildcards, default is 0
0/1 as fold_lemmasFold morphological lemmas, default is 0
0/1 as fold_blendedFold blended words, default is 0
N as expansion_limitOverride expansion_limit defined in the server configuration, default is 0 (use value from the configuration)
docs/hits as sort_modeSort output results by either ‘docs’ or ‘hits’. Default no sorting

The examples show how it works if assuming the user is trying to get an autocomplete for “my cat …”. So on the application side all you need to do is to suggest the user the endings from the column “normalized” for each new word. It often makes sense to sort by hits or docs using 'hits' as sort_mode or 'docs' as sort_mode.

  • Examples

Examples

  1. MySQL [(none)]> CALL KEYWORDS('m*', 't', 1 as stats);
  2. +------+-----------+------------+------+------+
  3. | qpos | tokenized | normalized | docs | hits |
  4. +------+-----------+------------+------+------+
  5. | 1 | m* | my | 1 | 2 |
  6. | 1 | m* | mammal | 1 | 1 |
  7. +------+-----------+------------+------+------+
  8. MySQL [(none)]> CALL KEYWORDS('my*', 't', 1 as stats);
  9. +------+-----------+------------+------+------+
  10. | qpos | tokenized | normalized | docs | hits |
  11. +------+-----------+------------+------+------+
  12. | 1 | my* | my | 1 | 2 |
  13. +------+-----------+------------+------+------+
  14. MySQL [(none)]> CALL KEYWORDS('c*', 't', 1 as stats, 'hits' as sort_mode);
  15. +------+-----------+-------------+------+------+
  16. | qpos | tokenized | normalized | docs | hits |
  17. +------+-----------+-------------+------+------+
  18. | 1 | c* | cat | 1 | 2 |
  19. | 1 | c* | carnivorous | 1 | 1 |
  20. | 1 | c* | catus | 1 | 1 |
  21. +------+-----------+-------------+------+------+
  22. MySQL [(none)]> CALL KEYWORDS('ca*', 't', 1 as stats, 'hits' as sort_mode);
  23. +------+-----------+-------------+------+------+
  24. | qpos | tokenized | normalized | docs | hits |
  25. +------+-----------+-------------+------+------+
  26. | 1 | ca* | cat | 1 | 2 |
  27. | 1 | ca* | carnivorous | 1 | 1 |
  28. | 1 | ca* | catus | 1 | 1 |
  29. +------+-----------+-------------+------+------+
  30. MySQL [(none)]> CALL KEYWORDS('cat*', 't', 1 as stats, 'hits' as sort_mode);
  31. +------+-----------+------------+------+------+
  32. | qpos | tokenized | normalized | docs | hits |
  33. +------+-----------+------------+------+------+
  34. | 1 | cat* | cat | 1 | 2 |
  35. | 1 | cat* | catus | 1 | 1 |
  36. +------+-----------+------------+------+------+

There is a nice trick how you can improve the above algorithm - use bigram_index. When you have it enabled for the table what you get in it is not just a single word, but each pair of words standing one after another indexed as a separate token.

This allows to predict not just the current word’s ending, but the next word too which is especially beneficial for the purpose of autocomplete.

  • Examples

Examples

  1. MySQL [(none)]> CALL KEYWORDS('m*', 't', 1 as stats, 'hits' as sort_mode);
  2. +------+-----------+------------+------+------+
  3. | qpos | tokenized | normalized | docs | hits |
  4. +------+-----------+------------+------+------+
  5. | 1 | m* | my | 1 | 2 |
  6. | 1 | m* | mammal | 1 | 1 |
  7. | 1 | m* | my cat | 1 | 1 |
  8. | 1 | m* | my dog | 1 | 1 |
  9. +------+-----------+------------+------+------+
  10. MySQL [(none)]> CALL KEYWORDS('my*', 't', 1 as stats, 'hits' as sort_mode);
  11. +------+-----------+------------+------+------+
  12. | qpos | tokenized | normalized | docs | hits |
  13. +------+-----------+------------+------+------+
  14. | 1 | my* | my | 1 | 2 |
  15. | 1 | my* | my cat | 1 | 1 |
  16. | 1 | my* | my dog | 1 | 1 |
  17. +------+-----------+------------+------+------+
  18. MySQL [(none)]> CALL KEYWORDS('c*', 't', 1 as stats, 'hits' as sort_mode);
  19. +------+-----------+--------------------+------+------+
  20. | qpos | tokenized | normalized | docs | hits |
  21. +------+-----------+--------------------+------+------+
  22. | 1 | c* | cat | 1 | 2 |
  23. | 1 | c* | carnivorous | 1 | 1 |
  24. | 1 | c* | carnivorous mammal | 1 | 1 |
  25. | 1 | c* | cat felis | 1 | 1 |
  26. | 1 | c* | cat loves | 1 | 1 |
  27. | 1 | c* | catus | 1 | 1 |
  28. | 1 | c* | catus is | 1 | 1 |
  29. +------+-----------+--------------------+------+------+
  30. MySQL [(none)]> CALL KEYWORDS('ca*', 't', 1 as stats, 'hits' as sort_mode);
  31. +------+-----------+--------------------+------+------+
  32. | qpos | tokenized | normalized | docs | hits |
  33. +------+-----------+--------------------+------+------+
  34. | 1 | ca* | cat | 1 | 2 |
  35. | 1 | ca* | carnivorous | 1 | 1 |
  36. | 1 | ca* | carnivorous mammal | 1 | 1 |
  37. | 1 | ca* | cat felis | 1 | 1 |
  38. | 1 | ca* | cat loves | 1 | 1 |
  39. | 1 | ca* | catus | 1 | 1 |
  40. | 1 | ca* | catus is | 1 | 1 |
  41. +------+-----------+--------------------+------+------+
  42. MySQL [(none)]> CALL KEYWORDS('cat*', 't', 1 as stats, 'hits' as sort_mode);
  43. +------+-----------+------------+------+------+
  44. | qpos | tokenized | normalized | docs | hits |
  45. +------+-----------+------------+------+------+
  46. | 1 | cat* | cat | 1 | 2 |
  47. | 1 | cat* | cat felis | 1 | 1 |
  48. | 1 | cat* | cat loves | 1 | 1 |
  49. | 1 | cat* | catus | 1 | 1 |
  50. | 1 | cat* | catus is | 1 | 1 |
  51. +------+-----------+------------+------+------+

CALL KEYWORDS supports distributed tables so no matter how big your data set you can benefit from using it.

Spell correction

Spell correction also known as:

  • Auto correction
  • Text correction
  • Fixing a spelling error
  • Typo tolerance
  • “Did you mean?”

and so on is a software functionality that suggests you alternatives to or makes automatic corrections of the text you have typed in. The concept of correcting typed text dates back to the 1960s, when a computer scientist named Warren Teitelman who also invented the “undo” command came up with a philosophy of computing called D.W.I.M., or “Do What I Mean.” Rather than programming computers to accept only perfectly formatted instructions, Teitelman said we should program them to recognize obvious mistakes.

The first well known product which provided spell correction functionality was Microsoft Word 6.0 released in 1993.

How it works

There are few ways how spell correction can be done, but the important thing is that there is no purely programmatic way which will convert your mistyped “ipone” into “iphone” (at least with decent quality). Mostly there has to be a data set the system is based on. The data set can be:

  • A dictionary of properly spelled words. In its turn it can be:
    • Based on your real data. The idea here is that mostly in the dictionary made up from your data the spelling is correct and then for each typed word the system just tries to find a word which is most similar to that (we’ll speak about how exactly it can be done with Manticore shortly)
    • Or can be based on an external dictionary which has nothing to do with your data. The problem which may arise here is that your data and the external dictionary can be too different: some words may be missing in the dictionary, some words may be missing in your data.
  • Not just dictionary-based, but context-aware, e.g. “white ber” would be corrected to “white bear” while “dark ber” - to “dark beer”. The context may be not just a neighbour word in your query, but your location, date of time, current sentence’s grammar (to let’s say change “there” to “their” or not), your search history and virtually any other things that can affect your intent.
  • Another classical approach is to use previous search queries as the data set for spell correction. It’s even more utilized in autocomplete functionality, but makes sense for autocorrect too. The idea here is that mostly users are right with spelling, therefore we can use words from their search history as a source of truth even if we don’t have the words in our documents nor we use an external dictionary. Context-awareness is also possible here.

Manticore provides commands CALL QSUGGEST and CALL SUGGEST that can be used for the purpose of automatic spell correction.

CALL QSUGGEST, CALL SUGGEST

They are both available via SQL only and the general syntax is:

  1. CALL QSUGGEST(word, table [,options])
  2. CALL SUGGEST(word, table [,options])
  3. options: N as option_name[, M as another_option, ...]

These commands provide for a given word all suggestions from the dictionary. They work only on tables with infixing enabled and dict=keywords. They return the suggested keywords, Levenshtein distance between the suggested and original keywords and the docs statistics of the suggested keyword.

If the first parameter is not a single word, but multiple, then:

  • CALL QSUGGEST will return suggestions only for the last word, ignoring the rest
  • CALL SUGGEST will return suggestions only for the first word

That’s the only difference between them. Several options are supported for customization:

OptionDescriptionDefault
limitReturns N top matches5
max_editsKeeps only dictionary words which Levenshtein distance is less than or equal to N4
result_statsProvides Levenshtein distance and document count of the found words1 (enabled)
delta_lenKeeps only dictionary words whose length difference is less than N3
max_matchesNumber of matches to keep25
rejectRejected words are matches that are not better than those already in the match queue. They are put in a rejected queue that gets reset in case one actually can go in the match queue. This parameter defines the size of the rejected queue (as reject*max(max_matched,limit)). If the rejected queue is filled, the engine stops looking for potential matches4
result_linealternate mode to display the data by returning all suggests, distances and docs each per one row0
non_chardo not skip dictionary words with non alphabet symbols0 (skip such words)

To show how it works let’s create a table and add few documents into it.

  1. create table products(title text) min_infix_len='2';
  2. insert into products values (0,'Crossbody Bag with Tassel'), (0,'microfiber sheet set'), (0,'Pet Hair Remover Glove');
Single word example

As you can see we have a mistype in “crossbUdy” which gets corrected to the “crossbody”. In addition to that by default CALL SUGGEST/QSUGGEST return:

  • distance - the Levenshtein distance which means how many edits they had to make to convert the given word to the suggestion
  • docs - and the number of docs that have this word

To disable these stats display you can use option 0 as result_stats.

  • Example

Example

  1. call suggest('crossbudy', 'products');

Response

  1. +-----------+----------+------+
  2. | suggest | distance | docs |
  3. +-----------+----------+------+
  4. | crossbody | 1 | 1 |
  5. +-----------+----------+------+
CALL SUGGEST takes only the first word

If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the first word.

  • Example

Example

  1. call suggest('bagg with tasel', 'products');

Response

  1. +---------+----------+------+
  2. | suggest | distance | docs |
  3. +---------+----------+------+
  4. | bag | 1 | 1 |
  5. +---------+----------+------+
CALL QSUGGEST takes only the last word

If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the last word.

  • Example

Example

  1. CALL QSUGGEST('bagg with tasel', 'products');

Response

  1. +---------+----------+------+
  2. | suggest | distance | docs |
  3. +---------+----------+------+
  4. | tassel | 1 | 1 |
  5. +---------+----------+------+
Different display mode

Using 1 as result_line in the options turns on alternate mode to display the data by returning all suggests, distances and docs each per one row.

  • Example

Example

  1. CALL QSUGGEST('bagg with tasel', 'products', 1 as result_line);

Response

  1. +----------+--------+
  2. | name | value |
  3. +----------+--------+
  4. | suggests | tassel |
  5. | distance | 1 |
  6. | docs | 1 |
  7. +----------+--------+

Interactive course

This interactive course demonstrates online how it works on a web page and provides different examples.

Typical flow with Manticore and a database