Multi-queries

Multi-queries, or query batches, let you send multiple search queries to Manticore in one go (more formally, one network request).

👍 Why use multi-queries?

Generally, it all boils down to performance. First, by sending requests to Manticore in a batch instead of one by one, you always save a bit by doing less network round-trips. Second, and somewhat more important, sending queries in a batch enables Manticore to perform certain internal optimizations. In the case when there aren’t any possible batch optimizations to apply, queries will be processed one by one internally.

⛔ When not to use multi-queries?

Multi-queries require all the search queries in a batch to be independent, and sometimes they aren’t. That is, sometimes query B is based on query A results, and so can only be set up after executing query A. For instance, you might want to display results from a secondary index if and only if there were no results found in a primary table. Or maybe just specify offset into 2nd result set based on the amount of matches in the 1st result set. In that case, you will have to use separate queries (or separate batches).

You can run multiple search queries with SQL by just separating them with a semicolon. When Manticore receives a query formatted like that from a client all the inter-statement optimizations will be applied.

Multi-queries don’t support queries with FACET. The number of multi-queries in one batch shoudln’t exceed max_batch_queries.

  • SQL

SQL

  1. SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price DESC; SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price ASC

Multi-queries optimizations

There are two major optimizations to be aware of: common query optimization and common subtree optimization.

Common query optimization means that searchd will identify all those queries in a batch where only the sorting and group-by settings differ, and only perform searching once. For instance, if a batch consists of 3 queries, all of them are for “ipod nano”, but 1st query requests top-10 results sorted by price, 2nd query groups by vendor ID and requests top-5 vendors sorted by rating, and 3rd query requests max price, full-text search for “ipod nano” will only be performed once, and its results will be reused to build 3 different result sets.

Faceted search is a particularly important case that benefits from this optimization. Indeed, faceted searching can be implemented by running a number of queries, one to retrieve search results themselves, and a few other ones with same full-text query but different group-by settings to retrieve all the required groups of results (top-3 authors, top-5 vendors, etc). And as long as full-text query and filtering settings stay the same, common query optimization will trigger, and greatly improve performance.

Common subtree optimization is even more interesting. It lets searchd exploit similarities between batched full-text queries. It identifies common full-text query parts (subtrees) in all queries, and caches them between queries. For instance, look at the following query batch:

  1. donald trump president
  2. donald trump barack obama john mccain
  3. donald trump speech

There’s a common two-word part donald trump that can be computed only once, then cached and shared across the queries. And common subtree optimization does just that. Per-query cache size is strictly controlled by subtree_docs_cache and subtree_hits_cache directives (so that caching all sixteen gazillions of documents that match “i am” does not exhaust the RAM and instantly kill your server).

How to tell whether the queries in the batch were actually optimized? If they were, respective query log will have a “multiplier” field that specifies how many queries were processed together:

Note the “x3” field. It means that this query was optimized and processed in a sub-batch of 3 queries.

  • log

log

  1. [Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/rel 747541 (0,20)] [lj] the
  2. [Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the
  3. [Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the

For reference, this is how the regular log would look like if the queries were not batched:

  • log

log

  1. [Sun Jul 12 15:18:17.062 2009] 0.059 sec [ext/0/rel 747541 (0,20)] [lj] the
  2. [Sun Jul 12 15:18:17.156 2009] 0.091 sec [ext/0/ext 747541 (0,20)] [lj] the
  3. [Sun Jul 12 15:18:17.250 2009] 0.092 sec [ext/0/ext 747541 (0,20)] [lj] the

Note how per-query time in multi-query case was improved by a factor of 1.5x to 2.3x, depending on a particular sorting mode.

Sub-selects

Manticore supports SELECT subqueries via SQL in the following format:

  1. SELECT * FROM (SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y

The outer select allows only ORDER BY and LIMIT clauses. Sub-selects queries currently have 2 usage cases:

  1. We have a query with 2 ranking UDFs, one very fast and the other slow and we perform a full-text search with a big match result set. Without subselect the query would look like

    1. SELECT id,slow_rank() as slow,fast_rank() as fast FROM index
    2. WHERE MATCH(‘some common query terms’) ORDER BY fast DESC, slow DESC LIMIT 20
    3. OPTION max_matches=1000;

    With sub-selects the query can be rewritten as:

    1. SELECT * FROM
    2. (SELECT id,slow_rank() as slow,fast_rank() as fast FROM index WHERE
    3. MATCH(‘some common query terms’)
    4. ORDER BY fast DESC LIMIT 100 OPTION max_matches=1000)
    5. ORDER BY slow DESC LIMIT 20;

    In the initial query the slow_rank() UDF is computed for the entire match result set. With SELECT sub-queries only fast_rank() is computed for the entire match result set, while slow_rank() is only computed for a limited set.

  2. The second case comes handy for large result set coming from a distributed table.

    For this query:

    1. SELECT * FROM my_dist_index WHERE some_conditions LIMIT 50000;

    If we have 20 nodes, each node can send back to master a number of 50K records, resulting in 20 x 50K = 1M records, however as the master sends back only 50K (out of 1M), it might be good enough for us for the nodes to send only the top 10K records. With sub-select we can rewrite the query as:

    1. SELECT * FROM
    2. (SELECT * FROM my_dist_index WHERE some_conditions LIMIT 10000)
    3. ORDER by some_attr LIMIT 50000;

    In this case, the nodes receive only the inner query and execute. This means the master will receive only 20x10K=200K records. The master will take all the records received, reorder them by the OUTER clause and return the best 50K records. The sub-select help reducing the traffic between the master and the nodes and also reduce the master’s computation time (as it process only 200K instead of 1M).

Grouping search results

It’s often useful to group search results to get per-group match counts or other aggregations. For instance, to draw a nice graph of how much matching blog posts were there per month or to group web search results by site or forum posts by author etc.

Manticore supports grouping of search results by one or multiple columns and computed expressions. The results can:

  • be sorted inside a group
  • have more than one row returned per group
  • have groups filtered
  • have groups sorted
  • be aggregated with help of the aggregation functions
  • SQL
  • JSON

SQL JSON

General syntax

  1. SELECT {* | SELECT_expr [, SELECT_expr ...]}
  2. ...
  3. GROUP BY {field_name | alias } [, ...]
  4. [HAVING where_condition]
  5. [WITHIN GROUP ORDER BY field_name {ASC | DESC} [, ...]]
  6. ...
  7. SELECT_expr: { field_name | function_name(...) }
  8. where_condition: {aggregation expression alias | COUNT(*)}

JSON query format supports currently a simple grouping that can retrieve the aggregate values and their count(*).

  1. {
  2. "index": "<index_name>",
  3. "limit": 0,
  4. "aggs": {
  5. "<aggr_name>": {
  6. "terms": {
  7. "field": "<attribute>",
  8. "size": <int value>
  9. }
  10. }
  11. }
  12. }

The normal query output returns the result set without grouping and can be hidden with limit (or size). The aggregation requires to set a size for the size of the result set group.

Just Grouping

Grouping is very simple - just add “GROUP BY smth” to the end of your SELECT query. The something can be:

  • any non-full-text field from the table: integer, float, string, MVA (multi-value attribute)
  • or if you used an alias in the SELECT list - you can GROUP BY it too

You can omit any aggregation functions in the SELECT list and it will work too:

  • SQL

SQL

  1. SELECT release_year FROM films GROUP BY release_year LIMIT 5;

Response

  1. +--------------+
  2. | release_year |
  3. +--------------+
  4. | 2004 |
  5. | 2002 |
  6. | 2001 |
  7. | 2005 |
  8. | 2000 |
  9. +--------------+

But in most cases you want to get something aggregated for each group, for example:

  • COUNT(*) to just get number of elements in each groups
  • or AVG(field) to get an average value of the field in the group
  • SQL1
  • SQL2
  • JSON
  • PHP
  • Python
  • Javascript
  • Java

SQL1 SQL2 JSON PHP Python Javascript Java

  1. SELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;
  1. SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5;
  1. POST /search -d '
  2. {
  3. "index" : "films",
  4. "limit": 0,
  5. "aggs" :
  6. {
  7. "release_year" :
  8. {
  9. "terms" :
  10. {
  11. "field":"release_year",
  12. "size":100
  13. }
  14. }
  15. }
  16. }
  17. '
  1. $index->setName('films');
  2. $search = $index->search('');
  3. $search->limit(0);
  4. $search->facet('release_year','release_year',100);
  5. $results = $search->get();
  6. print_r($results->getFacets());
  1. res =searchApi.search({"index":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}})
  1. res = await searchApi.search({"index":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}});
  1. HashMap<String,Object> aggs = new HashMap<String,Object>(){{
  2. put("release_year", new HashMap<String,Object>(){{
  3. put("terms", new HashMap<String,Object>(){{
  4. put("field","release_year");
  5. put("size",100);
  6. }});
  7. }});
  8. }};
  9. searchRequest = new SearchRequest();
  10. searchRequest.setIndex("films");
  11. searchRequest.setLimit(0);
  12. query = new HashMap<String,Object>();
  13. query.put("match_all",null);
  14. searchRequest.setQuery(query);
  15. searchRequest.setAggs(aggs);
  16. searchResponse = searchApi.search(searchRequest);

Response

  1. +--------------+----------+
  2. | release_year | count(*) |
  3. +--------------+----------+
  4. | 2004 | 108 |
  5. | 2002 | 108 |
  6. | 2001 | 91 |
  7. | 2005 | 93 |
  8. | 2000 | 97 |
  9. +--------------+----------+
  1. +--------------+------------------+
  2. | release_year | avg(rental_rate) |
  3. +--------------+------------------+
  4. | 2004 | 2.78629661 |
  5. | 2002 | 3.08259249 |
  6. | 2001 | 3.09989142 |
  7. | 2005 | 2.90397978 |
  8. | 2000 | 3.17556739 |
  9. +--------------+------------------+
  1. {
  2. "took": 2,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 10000,
  6. "hits": [
  7. ]
  8. },
  9. "release_year": {
  10. "group_brand_id": {
  11. "buckets": [
  12. {
  13. "key": 2004,
  14. "doc_count": 108
  15. },
  16. {
  17. "key": 2002,
  18. "doc_count": 108
  19. },
  20. {
  21. "key": 2000,
  22. "doc_count": 97
  23. },
  24. {
  25. "key": 2005,
  26. "doc_count": 93
  27. },
  28. {
  29. "key": 2001,
  30. "doc_count": 91
  31. }
  32. ]
  33. }
  34. }
  35. }
  1. Array
  2. (
  3. [release_year] => Array
  4. (
  5. [buckets] => Array
  6. (
  7. [0] => Array
  8. (
  9. [key] => 2009
  10. [doc_count] => 99
  11. )
  12. [1] => Array
  13. (
  14. [key] => 2008
  15. [doc_count] => 102
  16. )
  17. [2] => Array
  18. (
  19. [key] => 2007
  20. [doc_count] => 93
  21. )
  22. [3] => Array
  23. (
  24. [key] => 2006
  25. [doc_count] => 103
  26. )
  27. [4] => Array
  28. (
  29. [key] => 2005
  30. [doc_count] => 93
  31. )
  32. [5] => Array
  33. (
  34. [key] => 2004
  35. [doc_count] => 108
  36. )
  37. [6] => Array
  38. (
  39. [key] => 2003
  40. [doc_count] => 106
  41. )
  42. [7] => Array
  43. (
  44. [key] => 2002
  45. [doc_count] => 108
  46. )
  47. [8] => Array
  48. (
  49. [key] => 2001
  50. [doc_count] => 91
  51. )
  52. [9] => Array
  53. (
  54. [key] => 2000
  55. [doc_count] => 97
  56. )
  57. )
  58. )
  59. )
  1. {'aggregations': {u'release_year': {u'buckets': [{u'doc_count': 99,
  2. u'key': 2009},
  3. {u'doc_count': 102,
  4. u'key': 2008},
  5. {u'doc_count': 93,
  6. u'key': 2007},
  7. {u'doc_count': 103,
  8. u'key': 2006},
  9. {u'doc_count': 93,
  10. u'key': 2005},
  11. {u'doc_count': 108,
  12. u'key': 2004},
  13. {u'doc_count': 106,
  14. u'key': 2003},
  15. {u'doc_count': 108,
  16. u'key': 2002},
  17. {u'doc_count': 91,
  18. u'key': 2001},
  19. {u'doc_count': 97,
  20. u'key': 2000}]}},
  21. 'hits': {'hits': [], 'max_score': None, 'total': 1000},
  22. 'profile': None,
  23. 'timed_out': False,
  24. 'took': 0}
  1. {"took":0,"timed_out":false,"aggregations":{"release_year":{"buckets":[{"key":2009,"doc_count":99},{"key":2008,"doc_count":102},{"key":2007,"doc_count":93},{"key":2006,"doc_count":103},{"key":2005,"doc_count":93},{"key":2004,"doc_count":108},{"key":2003,"doc_count":106},{"key":2002,"doc_count":108},{"key":2001,"doc_count":91},{"key":2000,"doc_count":97}]}},"hits":{"total":1000,"hits":[]}}
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
  5. hits: class SearchResponseHits {
  6. maxScore: null
  7. total: 1000
  8. hits: []
  9. }
  10. profile: null
  11. }
Sorting groups

By default the groups are not sorted and the next thing you normally want to do is to order them by something. For example the field you are grouping by:

  • SQL

SQL

  1. SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;

Response

  1. +--------------+----------+
  2. | release_year | count(*) |
  3. +--------------+----------+
  4. | 2000 | 97 |
  5. | 2001 | 91 |
  6. | 2002 | 108 |
  7. | 2003 | 106 |
  8. | 2004 | 108 |
  9. +--------------+----------+

Or vice-versa - by the aggregation:

  • by count(*) to see those groups that have most elements first
  • by avg(rental_rate) to see most rated movies first. Note that in the example it’s done via an alias: avg(rental_rate) is first mapped to avg in the SELECT list and then we just do ORDER BY avg
  • SQL1
  • SQL2

SQL1 SQL2

  1. SELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;
  1. SELECT release_year, AVG(rental_rate) avg FROM films GROUP BY release_year ORDER BY avg desc LIMIT 5;

Response

  1. +--------------+----------+
  2. | release_year | count(*) |
  3. +--------------+----------+
  4. | 2004 | 108 |
  5. | 2002 | 108 |
  6. | 2003 | 106 |
  7. | 2006 | 103 |
  8. | 2008 | 102 |
  9. +--------------+----------+
  1. +--------------+------------+
  2. | release_year | avg |
  3. +--------------+------------+
  4. | 2006 | 3.26184368 |
  5. | 2000 | 3.17556739 |
  6. | 2001 | 3.09989142 |
  7. | 2002 | 3.08259249 |
  8. | 2008 | 2.99000049 |
  9. +--------------+------------+
GROUP BY multiple fields at once

In some cases you might want to group not by a single, but by multiple fields at once, for example movie’s category and year:

  • SQL

SQL

  1. SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;

Response

  1. +-------------+--------------+----------+
  2. | category_id | release_year | count(*) |
  3. +-------------+--------------+----------+
  4. | 1 | 2000 | 5 |
  5. | 1 | 2001 | 2 |
  6. | 1 | 2002 | 6 |
  7. | 1 | 2003 | 6 |
  8. | 1 | 2004 | 5 |
  9. | 1 | 2005 | 10 |
  10. | 1 | 2006 | 4 |
  11. | 1 | 2007 | 5 |
  12. | 1 | 2008 | 7 |
  13. | 1 | 2009 | 14 |
  14. | 2 | 2000 | 10 |
  15. | 2 | 2001 | 5 |
  16. | 2 | 2002 | 6 |
  17. | 2 | 2003 | 6 |
  18. | 2 | 2004 | 10 |
  19. | 2 | 2005 | 4 |
  20. | 2 | 2006 | 5 |
  21. | 2 | 2007 | 8 |
  22. | 2 | 2008 | 8 |
  23. | 2 | 2009 | 4 |
  24. +-------------+--------------+----------+
Give me N rows

Sometimes it’s useful to see not a single element per group, but multiple. This can be easily done with help of GROUP N BY. For example here you can see that we get 2 movies for each year rather than one which would simple GROUP BY release_year returned.

  • SQL

SQL

  1. SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;

Response

  1. +--------------+-----------------------------+
  2. | release_year | title |
  3. +--------------+-----------------------------+
  4. | 2009 | ALICE FANTASIA |
  5. | 2009 | ALIEN CENTER |
  6. | 2008 | AMADEUS HOLY |
  7. | 2008 | ANACONDA CONFESSIONS |
  8. | 2007 | ANGELS LIFE |
  9. | 2007 | ARACHNOPHOBIA ROLLERCOASTER |
  10. +--------------+-----------------------------+
Sorting inside a group

Another essential analytics demand is to sort elements within a group. For that there’s WITHIN GROUP ORDER BY ... {ASC|DESC} clause. For example let’s get the most rated film for each year. Note it works in parallel with just ORDER BY:

  • WITHIN GROUP ORDER BY sorts results inside a group
  • while just GROUP BY sorts the groups themselves

The work absolutely independently.

  • SQL

SQL

  1. SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;

Response

  1. +--------------+------------------+-------------+
  2. | release_year | title | rental_rate |
  3. +--------------+------------------+-------------+
  4. | 2009 | AMERICAN CIRCUS | 4.990000 |
  5. | 2008 | ANTHEM LUKE | 4.990000 |
  6. | 2007 | ATTACKS HATE | 4.990000 |
  7. | 2006 | ALADDIN CALENDAR | 4.990000 |
  8. | 2005 | AIRPLANE SIERRA | 4.990000 |
  9. +--------------+------------------+-------------+
Filter groups

HAVING expression is a useful clause to filter groups. If WHERE is applied before grouping HAVING works with the groups. For example let’s leave only those years when an average rental rate of the films of that year was higher than 3. We get only 4 years:

  • SQL

SQL

  1. SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;

Response

  1. +--------------+------------+
  2. | release_year | avg |
  3. +--------------+------------+
  4. | 2002 | 3.08259249 |
  5. | 2001 | 3.09989142 |
  6. | 2000 | 3.17556739 |
  7. | 2006 | 3.26184368 |
  8. +--------------+------------+
GROUPBY()

There is a function GROUPBY() which returns the key of the current group. It’s useful in many cases especially when you GROUP BY an MVA) or a JSON value.

It can be also used in HAVING to for example leave only years 2000 and 2002.

Note that GROUPBY()is not recommended for use when you GROUP BY multiple fields at once. It will still work, but since the group key in this case is compound of field values it may look not the way you expect.

  • SQL

SQL

  1. SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);

Response

  1. +--------------+----------+
  2. | release_year | count(*) |
  3. +--------------+----------+
  4. | 2002 | 108 |
  5. | 2000 | 97 |
  6. +--------------+----------+
Grouping by MVA (multi-value attributes)

Manticore supports grouping by MVA). To show how it works let’s create a table “shoes” with MVA “sizes” and insert few documents into it:

  1. create table shoes(title text, sizes multi);
  2. insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));

so we have:

  1. SELECT * FROM shoes;
  2. +---------------------+----------+---------+
  3. | id | sizes | title |
  4. +---------------------+----------+---------+
  5. | 1657851069130080265 | 40,41,42 | nike |
  6. | 1657851069130080266 | 41,43 | adidas |
  7. | 1657851069130080267 | 42,43 | reebook |
  8. +---------------------+----------+---------+

If we now GROUP BY “sizes” it will process all our multi-value attributes and will return aggregation for each, in this case just count:

  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java

SQL JSON PHP Python Javascript Java

  1. SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;
  1. POST /search -d '
  2. {
  3. "index" : "shoes",
  4. "limit": 0,
  5. "aggs" :
  6. {
  7. "sizes" :
  8. {
  9. "terms" :
  10. {
  11. "field":"sizes",
  12. "size":100
  13. }
  14. }
  15. }
  16. }
  17. '
  1. $index->setName('shoes');
  2. $search = $index->search('');
  3. $search->limit(0);
  4. $search->facet('sizes','sizes',100);
  5. $results = $search->get();
  6. print_r($results->getFacets());
  1. res =searchApi.search({"index":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}})
  1. res = await searchApi.search({"index":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}});
  1. HashMap<String,Object> aggs = new HashMap<String,Object>(){{
  2. put("release_year", new HashMap<String,Object>(){{
  3. put("terms", new HashMap<String,Object>(){{
  4. put("field","release_year");
  5. put("size",100);
  6. }});
  7. }});
  8. }};
  9. searchRequest = new SearchRequest();
  10. searchRequest.setIndex("films");
  11. searchRequest.setLimit(0);
  12. query = new HashMap<String,Object>();
  13. query.put("match_all",null);
  14. searchRequest.setQuery(query);
  15. searchRequest.setAggs(aggs);
  16. searchResponse = searchApi.search(searchRequest);

Response

  1. +------+----------+
  2. | gb | count(*) |
  3. +------+----------+
  4. | 40 | 1 |
  5. | 41 | 2 |
  6. | 42 | 2 |
  7. | 43 | 2 |
  8. +------+----------+
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 3,
  6. "hits": [
  7. ]
  8. },
  9. "aggregations": {
  10. "sizes": {
  11. "buckets": [
  12. {
  13. "key": 43,
  14. "doc_count": 2
  15. },
  16. {
  17. "key": 42,
  18. "doc_count": 2
  19. },
  20. {
  21. "key": 41,
  22. "doc_count": 2
  23. },
  24. {
  25. "key": 40,
  26. "doc_count": 1
  27. }
  28. ]
  29. }
  30. }
  31. }
  1. Array
  2. (
  3. [sizes] => Array
  4. (
  5. [buckets] => Array
  6. (
  7. [0] => Array
  8. (
  9. [key] => 43
  10. [doc_count] => 2
  11. )
  12. [1] => Array
  13. (
  14. [key] => 42
  15. [doc_count] => 2
  16. )
  17. [2] => Array
  18. (
  19. [key] => 41
  20. [doc_count] => 2
  21. )
  22. [3] => Array
  23. (
  24. [key] => 40
  25. [doc_count] => 1
  26. )
  27. )
  28. )
  29. )
  1. {'aggregations': {u'sizes': {u'buckets': [{u'doc_count': 2, u'key': 43},
  2. {u'doc_count': 2, u'key': 42},
  3. {u'doc_count': 2, u'key': 41},
  4. {u'doc_count': 1, u'key': 40}]}},
  5. 'hits': {'hits': [], 'max_score': None, 'total': 3},
  6. 'profile': None,
  7. 'timed_out': False,
  8. 'took': 0}
  1. {"took":0,"timed_out":false,"aggregations":{"sizes":{"buckets":[{"key":43,"doc_count":2},{"key":42,"doc_count":2},{"key":41,"doc_count":2},{"key":40,"doc_count":1}]}},"hits":{"total":3,"hits":[]}}
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
  5. hits: class SearchResponseHits {
  6. maxScore: null
  7. total: 3
  8. hits: []
  9. }
  10. profile: null
  11. }
Grouping by a JSON node

If you have a field of type JSON you can GROUP BY any node from it. To demonstrate it let’s create a table “products” with few documents each having color in the “meta” JSON field:

  1. create table products(title text, meta json);
  2. insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');

This gives us:

  1. SELECT * FROM products;
  2. +---------------------+-------------------+--------+
  3. | id | meta | title |
  4. +---------------------+-------------------+--------+
  5. | 1657851069130080268 | {"color":"red"} | nike |
  6. | 1657851069130080269 | {"color":"red"} | adidas |
  7. | 1657851069130080270 | {"color":"green"} | puma |
  8. +---------------------+-------------------+--------+

To group the products by color we can just GROUP BY meta.color and to show the corresponding group key in the SELECT list we can use GROUPBY():

  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java

SQL JSON PHP Python Javascript Java

  1. SELECT groupby() color, count(*) from products GROUP BY meta.color;
  1. POST /search -d '
  2. {
  3. "index" : "products",
  4. "limit": 0,
  5. "aggs" :
  6. {
  7. "color" :
  8. {
  9. "terms" :
  10. {
  11. "field":"meta.color",
  12. "size":100
  13. }
  14. }
  15. }
  16. }
  17. '
  1. $index->setName('products');
  2. $search = $index->search('');
  3. $search->limit(0);
  4. $search->facet('meta.color','color',100);
  5. $results = $search->get();
  6. print_r($results->getFacets());
  1. res =searchApi.search({"index":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}})
  1. res = await searchApi.search({"index":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}});
  1. HashMap<String,Object> aggs = new HashMap<String,Object>(){{
  2. put("color", new HashMap<String,Object>(){{
  3. put("terms", new HashMap<String,Object>(){{
  4. put("field","meta.color");
  5. put("size",100);
  6. }});
  7. }});
  8. }};
  9. searchRequest = new SearchRequest();
  10. searchRequest.setIndex("products");
  11. searchRequest.setLimit(0);
  12. query = new HashMap<String,Object>();
  13. query.put("match_all",null);
  14. searchRequest.setQuery(query);
  15. searchRequest.setAggs(aggs);
  16. searchResponse = searchApi.search(searchRequest);

Response

  1. +-------+----------+
  2. | color | count(*) |
  3. +-------+----------+
  4. | red | 2 |
  5. | green | 1 |
  6. +-------+----------+
  1. {
  2. "took": 0,
  3. "timed_out": false,
  4. "hits": {
  5. "total": 3,
  6. "hits": [
  7. ]
  8. },
  9. "aggregations": {
  10. "color": {
  11. "buckets": [
  12. {
  13. "key": "green",
  14. "doc_count": 1
  15. },
  16. {
  17. "key": "red",
  18. "doc_count": 2
  19. }
  20. ]
  21. }
  22. }
  23. }
  1. Array
  2. (
  3. [color] => Array
  4. (
  5. [buckets] => Array
  6. (
  7. [0] => Array
  8. (
  9. [key] => green
  10. [doc_count] => 1
  11. )
  12. [1] => Array
  13. (
  14. [key] => red
  15. [doc_count] => 2
  16. )
  17. )
  18. )
  19. )
  1. {'aggregations': {u'color': {u'buckets': [{u'doc_count': 1,
  2. u'key': u'green'},
  3. {u'doc_count': 2, u'key': u'red'}]}},
  4. 'hits': {'hits': [], 'max_score': None, 'total': 3},
  5. 'profile': None,
  6. 'timed_out': False,
  7. 'took': 0}
  1. {"took":0,"timed_out":false,"aggregations":{"color":{"buckets":[{"key":"green","doc_count":1},{"key":"red","doc_count":2}]}},"hits":{"total":3,"hits":[]}}
  1. class SearchResponse {
  2. took: 0
  3. timedOut: false
  4. aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
  5. hits: class SearchResponseHits {
  6. maxScore: null
  7. total: 3
  8. hits: []
  9. }
  10. profile: null
  11. }

Aggregation functions

Besides COUNT(*) which returns number of elements in each group you can use different other aggregation functions:

COUNT(DISTINCT field)

While COUNT(*) returns number of all elements in the group COUNT( DISTINCT field) returns number of different values of the field in the group which may be absolutely different from the total count: you can have 100 elements in the group, but all with the same value of some field. COUNT(DISTINCT field) helps to figure that out. To demonstrate it let’s create table “students” with student’s name, age and major:

  1. CREATE TABLE students(name text, age int, major string);
  2. INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');

so we have:

  1. MySQL [(none)]> SELECT * from students;
  2. +---------------------+------+----------+---------+
  3. | id | age | major | name |
  4. +---------------------+------+----------+---------+
  5. | 1657851069130080271 | 21 | arts | John |
  6. | 1657851069130080272 | 22 | business | William |
  7. | 1657851069130080273 | 21 | cs | Richard |
  8. | 1657851069130080274 | 22 | cs | Rebecca |
  9. | 1657851069130080275 | 21 | arts | Monica |
  10. +---------------------+------+----------+---------+

In the example you can see that if we GROUP BY major and show both COUNT(*) and COUNT(DISTINCT age) it gets clear that there are 2 students that chose major “cs” and 2 unique ages, but for the major “arts” there are also 2 students, but only one unique age.

There can be at most one COUNT(DISTINCT) per query.

COUNT(DISTINCT) against a distributed table or a real-time table consisting of multiple disk chunks may return inaccurate results, but the result should be accurate for a distributed table consisting of local plain or real-time tables with the same schema (identical set/order of fields, but may be different tokenization settings).

  • SQL

SQL

  1. SELECT major, count(*), count(distinct age) FROM students GROUP BY major;

Response

  1. +----------+----------+---------------------+
  2. | major | count(*) | count(distinct age) |
  3. +----------+----------+---------------------+
  4. | arts | 2 | 1 |
  5. | business | 1 | 1 |
  6. | cs | 2 | 2 |
  7. +----------+----------+---------------------+
GROUP_CONCAT(field)

Often you want to understand better the contents of each group. You can use GROUP N BY for that, but it would return additional rows you might not want in the output. GROUP_CONCAT() enriches your grouping by concatenating values of some field in the group. Let’s take the previous example and improve it by getting all the ages in each group.

GROUP_CONCAT(field) returns the list comma-separated.

  • SQL

SQL

  1. SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major

Response

  1. +----------+----------+---------------------+-------------------+
  2. | major | count(*) | count(distinct age) | group_concat(age) |
  3. +----------+----------+---------------------+-------------------+
  4. | arts | 2 | 1 | 21,21 |
  5. | business | 1 | 1 | 22 |
  6. | cs | 2 | 2 | 21,22 |
  7. +----------+----------+---------------------+-------------------+
SUM(), MIN(), MAX(), AVG()

And of course you can get sum, average, minimal and maximum values in the group.

  • SQL

SQL

  1. SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;

Response

  1. +------+------------+----------+----------+------------+
  2. | year | sum | min | max | avg |
  3. +------+------------+----------+----------+------------+
  4. | 2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
  5. | 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
  6. | 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
  7. | 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
  8. | 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
  9. +------+------------+----------+----------+------------+

Grouping accuracy

Grouping is done in fixed memory, which depends on the max_matches setting. If max_matches allows for storage of all found groups, the results will be 100% accurate. However, if the value of max_matches is lower, the results will be less accurate.

When parallel processing is involved, it can become more complicated. When pseudo_sharding is enabled and/or when using an RT index with several disk chunks, each chunk or pseudo shard gets a result set that is no larger than max_matches. This can lead to inaccuracies in aggregates and group counts when the results sets from different threads are merged. To fix this, either a larger max_matches value or disabling parallel processing can be used.

Manticore will try to increase max_matches up to max_matches_increase_threshold if it detects that groupby may return inaccurate results. Detection is based on the number of unique values of the groupby attribute, which is retrieved from secondary indexes (if present).

To ensure accurate aggregates and/or group counts when using RT indexes or pseudo_sharding, accurate_aggregation can be enabled. This will try to increase max_matches up to the threshold, and if the threshold is not high enough, Manticore will disable parallel processing for the query.

  • SQL

SQL

  1. MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
  2. +------+----------+
  3. | year | count(*) |
  4. +------+----------+
  5. | 2004 | 108 |
  6. | 2002 | 108 |
  7. | 2001 | 91 |
  8. | 2005 | 93 |
  9. | 2000 | 97 |
  10. +------+----------+
  11. MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
  12. +------+----------+
  13. | year | count(*) |
  14. +------+----------+
  15. | 2004 | 76 |
  16. +------+----------+
  17. MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
  18. +------+----------+
  19. | year | count(*) |
  20. +------+----------+
  21. | 2004 | 76 |
  22. | 2002 | 74 |
  23. +------+----------+
  24. MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
  25. +------+----------+
  26. | year | count(*) |
  27. +------+----------+
  28. | 2004 | 108 |
  29. | 2002 | 108 |
  30. | 2001 | 91 |
  31. +------+----------+