Tutorial: Querying data

This tutorial will demonstrate how to query data in Apache Druid, with examples for Druid SQL and Druid’s native query format.

The tutorial assumes that you’ve already completed one of the 4 ingestion tutorials, as we will be querying the sample Wikipedia edits data.

Druid queries are sent over HTTP. The Druid console includes a view to issue queries to Druid and nicely format the results.

Druid SQL queries

Druid supports a dialect of SQL for querying.

This query retrieves the 10 Wikipedia pages with the most page edits on 2015-09-12.

  1. SELECT page, COUNT(*) AS Edits
  2. FROM wikipedia
  3. WHERE TIMESTAMP '2015-09-12 00:00:00' <= "__time" AND "__time" < TIMESTAMP '2015-09-13 00:00:00'
  4. GROUP BY page
  5. ORDER BY Edits DESC
  6. LIMIT 10

Let’s look at the different ways to issue this query.

Query SQL via the console

You can issue the above query from the console.

Query autocomplete

The console query view provides autocomplete functionality with inline documentation.

Query options

You can also configure extra context flags to be sent with the query from the ... options menu.

Note that the console will (by default) wrap your SQL queries in a limit where appropriate so that queries such as SELECT * FROM wikipedia can complete. You can turn off this behavior from the Smart query limit toggle.

Query actions

The query view provides contextual actions that can write and modify the query for you.

Query SQL via dsql

For convenience, the Druid package includes a SQL command-line client, located at bin/dsql from the Druid package root.

Let’s now run bin/dsql; you should see the following prompt:

  1. Welcome to dsql, the command-line client for Druid SQL.
  2. Type "\h" for help.
  3. dsql>

To submit the query, paste it to the dsql prompt and press enter:

  1. dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
  2. ┌──────────────────────────────────────────────────────────┬───────┐
  3. page Edits
  4. ├──────────────────────────────────────────────────────────┼───────┤
  5. Wikipedia:Vandalismusmeldung 33
  6. User:Cyde/List of candidates for speedy deletion/Subpage 28
  7. Jeremy Corbyn 27
  8. Wikipedia:Administrators' noticeboard/Incidents │ 21 │
  9. │ Flavia Pennetta │ 20 │
  10. │ Total Drama Presents: The Ridonculous Race │ 18 │
  11. │ User talk:Dudeperson176123 │ 18 │
  12. │ Wikipédia:Le Bistro/12 septembre 2015 │ 18 │
  13. │ Wikipedia:In the news/Candidates │ 17 │
  14. │ Wikipedia:Requests for page protection │ 17 │
  15. └──────────────────────────────────────────────────────────┴───────┘
  16. Retrieved 10 rows in 0.06s.

Query SQL over HTTP

The SQL queries are submitted as JSON over HTTP.

The tutorial package includes an example file that contains the SQL query shown above at quickstart/tutorial/wikipedia-top-pages-sql.json. Let’s submit that query to the Druid Broker:

  1. curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8888/druid/v2/sql

The following results should be returned:

  1. [
  2. {
  3. "page": "Wikipedia:Vandalismusmeldung",
  4. "Edits": 33
  5. },
  6. {
  7. "page": "User:Cyde/List of candidates for speedy deletion/Subpage",
  8. "Edits": 28
  9. },
  10. {
  11. "page": "Jeremy Corbyn",
  12. "Edits": 27
  13. },
  14. {
  15. "page": "Wikipedia:Administrators' noticeboard/Incidents",
  16. "Edits": 21
  17. },
  18. {
  19. "page": "Flavia Pennetta",
  20. "Edits": 20
  21. },
  22. {
  23. "page": "Total Drama Presents: The Ridonculous Race",
  24. "Edits": 18
  25. },
  26. {
  27. "page": "User talk:Dudeperson176123",
  28. "Edits": 18
  29. },
  30. {
  31. "page": "Wikipédia:Le Bistro/12 septembre 2015",
  32. "Edits": 18
  33. },
  34. {
  35. "page": "Wikipedia:In the news/Candidates",
  36. "Edits": 17
  37. },
  38. {
  39. "page": "Wikipedia:Requests for page protection",
  40. "Edits": 17
  41. }
  42. ]

More Druid SQL examples

Here is a collection of queries to try out:

Query over time

  1. SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted
  2. FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
  3. GROUP BY 1

Query example

General group by

  1. SELECT channel, page, SUM(added)
  2. FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00'
  3. GROUP BY channel, page
  4. ORDER BY SUM(added) DESC

Query example

Select raw data

  1. SELECT user, page
  2. FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 02:00:00' AND TIMESTAMP '2015-09-12 03:00:00'
  3. LIMIT 5

Query example

Explain query plan

Druid SQL has the ability to explain the query plan for a given query. In the console this functionality is accessible from the ... button.

Explain query

If you are querying in other ways you can get the plan by prepending EXPLAIN PLAN FOR to a Druid SQL query.

Using a query from an example above:

EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;

  1. dsql> EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10;
  2. ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
  3. PLAN
  4. ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
  5. DruidQueryRel(query=[{"queryType":"topN","dataSource":{"type":"table","name":"wikipedia"},"virtualColumns":[],"dimension":{"type":"default","dimension":"page","outputName":"d0","outputType":"STRING"},"metric":{"type":"numeric","metric":"a0"},"threshold":10,"intervals":{"type":"intervals","intervals":["2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.001Z"]},"filter":null,"granularity":{"type":"all"},"aggregations":[{"type":"count","name":"a0"}],"postAggregations":[],"context":{},"descending":false}], signature=[{d0:STRING, a0:LONG}])
  6. └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
  7. Retrieved 1 row in 0.03s.

Native JSON queries

Druid’s native query format is expressed in JSON.

Native query via the console

You can issue native Druid queries from the console’s Query view.

Here is a query that retrieves the 10 Wikipedia pages with the most page edits on 2015-09-12.

  1. {
  2. "queryType" : "topN",
  3. "dataSource" : "wikipedia",
  4. "intervals" : ["2015-09-12/2015-09-13"],
  5. "granularity" : "all",
  6. "dimension" : "page",
  7. "metric" : "count",
  8. "threshold" : 10,
  9. "aggregations" : [
  10. {
  11. "type" : "count",
  12. "name" : "count"
  13. }
  14. ]
  15. }

Simply paste it into the console to switch the editor into JSON mode.

Native query

Native queries over HTTP

We have included a sample native TopN query under quickstart/tutorial/wikipedia-top-pages.json:

Let’s submit this query to Druid:

  1. curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages.json http://localhost:8888/druid/v2?pretty

You should see the following query results:

  1. [ {
  2. "timestamp" : "2015-09-12T00:46:58.771Z",
  3. "result" : [ {
  4. "count" : 33,
  5. "page" : "Wikipedia:Vandalismusmeldung"
  6. }, {
  7. "count" : 28,
  8. "page" : "User:Cyde/List of candidates for speedy deletion/Subpage"
  9. }, {
  10. "count" : 27,
  11. "page" : "Jeremy Corbyn"
  12. }, {
  13. "count" : 21,
  14. "page" : "Wikipedia:Administrators' noticeboard/Incidents"
  15. }, {
  16. "count" : 20,
  17. "page" : "Flavia Pennetta"
  18. }, {
  19. "count" : 18,
  20. "page" : "Total Drama Presents: The Ridonculous Race"
  21. }, {
  22. "count" : 18,
  23. "page" : "User talk:Dudeperson176123"
  24. }, {
  25. "count" : 18,
  26. "page" : "Wikipédia:Le Bistro/12 septembre 2015"
  27. }, {
  28. "count" : 17,
  29. "page" : "Wikipedia:In the news/Candidates"
  30. }, {
  31. "count" : 17,
  32. "page" : "Wikipedia:Requests for page protection"
  33. } ]
  34. } ]

Further reading

The Queries documentation has more information on Druid’s native JSON queries.

The Druid SQL documentation has more information on using Druid SQL queries.