Monitoring and alarming

This document mainly introduces Doris’s monitoring items and how to collect and display them. And how to configure alarm (TODO)

Dashborad template click downloadMonitoring and alarming - 图1

Note: Before 0.9.0 (excluding), please use revision 1. For version 0.9.x, use revision 2. For version 0.10.x, use revision 3.

Dashboard templates are updated from time to time. The way to update the template is shown in the last section.

Welcome to provide better dashboard.

Components

Doris uses [Prometheus] (https://prometheus.io/) and [Grafana] (https://grafana.com/) to collect and display input monitoring items.

Monitoring and alarming - 图2

  1. Prometheus

    Prometheus is an open source system monitoring and alarm suite. It can collect monitored items by Pull or Push and store them in its own time series database. And through the rich multi-dimensional data query language, to meet the different data display needs of users.

  2. Grafana

    Grafana is an open source data analysis and display platform. Support multiple mainstream temporal database sources including Prometheus. Through the corresponding database query statements, the display data is obtained from the data source. With flexible and configurable dashboard, these data can be quickly presented to users in the form of graphs.

Note: This document only provides a way to collect and display Doris monitoring data using Prometheus and Grafana. In principle, these components are not developed or maintained. For more details on these components, please step through the corresponding official documents.

Monitoring data

Doris’s monitoring data is exposed through the HTTP interface of Frontend and Backend. Monitoring data is presented in the form of key-value text. Each Key may also be distinguished by different Labels. When the user has built Doris, the monitoring data of the node can be accessed in the browser through the following interfaces:

  • Frontend: fe_host:fe_http_port/metrics
  • Backend: be_host:be_web_server_port/metrics
  • Broker: Not available for now

Users will see the following monitoring item results (for example, FE partial monitoring items):

  1. ```
  2. # HELP jvm_heap_size_bytes jvm heap stat
  3. # TYPE jvm_heap_size_bytes gauge
  4. jvm_heap_size_bytes{type="max"} 41661235200
  5. jvm_heap_size_bytes{type="committed"} 19785285632
  6. jvm_heap_size_bytes{type="used"} 10113221064
  7. # HELP jvm_non_heap_size_bytes jvm non heap stat
  8. # TYPE jvm_non_heap_size_bytes gauge
  9. jvm_non_heap_size_bytes{type="committed"} 105295872
  10. jvm_non_heap_size_bytes{type="used"} 103184784
  11. # HELP jvm_young_size_bytes jvm young mem pool stat
  12. # TYPE jvm_young_size_bytes gauge
  13. jvm_young_size_bytes{type="used"} 6505306808
  14. jvm_young_size_bytes{type="peak_used"} 10308026368
  15. jvm_young_size_bytes{type="max"} 10308026368
  16. # HELP jvm_old_size_bytes jvm old mem pool stat
  17. # TYPE jvm_old_size_bytes gauge
  18. jvm_old_size_bytes{type="used"} 3522435544
  19. jvm_old_size_bytes{type="peak_used"} 6561017832
  20. jvm_old_size_bytes{type="max"} 30064771072
  21. # HELP jvm_direct_buffer_pool_size_bytes jvm direct buffer pool stat
  22. # TYPE jvm_direct_buffer_pool_size_bytes gauge
  23. jvm_direct_buffer_pool_size_bytes{type="count"} 91
  24. jvm_direct_buffer_pool_size_bytes{type="used"} 226135222
  25. jvm_direct_buffer_pool_size_bytes{type="capacity"} 226135221
  26. # HELP jvm_young_gc jvm young gc stat
  27. # TYPE jvm_young_gc gauge
  28. jvm_young_gc{type="count"} 2186
  29. jvm_young_gc{type="time"} 93650
  30. # HELP jvm_old_gc jvm old gc stat
  31. # TYPE jvm_old_gc gauge
  32. jvm_old_gc{type="count"} 21
  33. jvm_old_gc{type="time"} 58268
  34. # HELP jvm_thread jvm thread stat
  35. # TYPE jvm_thread gauge
  36. jvm_thread{type="count"} 767
  37. jvm_thread{type="peak_count"} 831
  38. ...
  39. ```

This is a monitoring data presented in [Promethus Format] (https://prometheus.io/docs/practices/naming/). We take one of these monitoring items as an example to illustrate:

  1. # HELP jvm_heap_size_bytes jvm heap stat
  2. # TYPE jvm_heap_size_bytes gauge
  3. jvm_heap_size_bytes{type="max"} 41661235200
  4. jvm_heap_size_bytes{type="committed"} 19785285632
  5. jvm_heap_size_bytes{type="used"} 10113221064
  1. Behavior commentary line at the beginning of “#”. HELP is the description of the monitored item; TYPE represents the data type of the monitored item, and Gauge is the scalar data in the example. There are also Counter, Histogram and other data types. Specifically, you can see [Prometheus Official Document] (https://prometheus.io/docs/practices/instrumentation/#counter-vs.-gauge,-summary-vs.-histogram).
  2. jvm_heap_size_bytes is the name of the monitored item (Key); type= "max" is a label named type, with a value of max. A monitoring item can have multiple Labels.
  3. The final number, such as 41661235200, is the monitored value.

Monitoring Architecture

The entire monitoring architecture is shown in the following figure:

Monitoring and alarming - 图3

  1. The yellow part is Prometheus related components. Prometheus Server is the main process of Prometheus. At present, Prometheus accesses the monitoring interface of Doris node by Pull, and then stores the time series data in the time series database TSDB (TSDB is included in the Prometheus process, and need not be deployed separately). Prometheus also supports building [Push Gateway] (https://github.com/prometheus/pushgateway) to allow monitored data to be pushed to Push Gateway by Push by monitoring system, and then data from Push Gateway by Prometheus Server through Pull.
  2. [Alert Manager] (https://github.com/prometheus/alertmanager) is a Prometheus alarm component, which needs to be deployed separately (no solution is provided yet, but can be built by referring to official documents). Through Alert Manager, users can configure alarm strategy, receive mail, short messages and other alarms.
  3. The green part is Grafana related components. Grafana Server is the main process of Grafana. After startup, users can configure Grafana through Web pages, including data source settings, user settings, Dashboard drawing, etc. This is also where end users view monitoring data.

Start building

Please start building the monitoring system after you have completed the deployment of Doris.

Prometheus

  1. Download the latest version of Proetheus on the [Prometheus Website] (https://prometheus.io/download/). Here we take version 2.3.2-linux-amd64 as an example.

  2. Unzip the downloaded tar file on the machine that is ready to run the monitoring service.

  3. Open the configuration file promethues.yml. Here we provide an example configuration and explain it (the configuration file is in YML format, pay attention to uniform indentation and spaces):

    Here we use the simplest way of static files to monitor configuration. Prometheus supports a variety of [service discovery] (https://prometheus.io/docs/prometheus/latest/configuration/configuration/), which can dynamically sense the addition and deletion of nodes.

    1. # my global config
    2. global:
    3. scrape_interval: 15s # Global acquisition interval, default 1 m, set to 15s
    4. evaluation_interval: 15s # Global rule trigger interval, default 1 m, set 15s here
    5. # Alertmanager configuration
    6. alerting:
    7. alertmanagers:
    8. - static_configs:
    9. - targets:
    10. # - alertmanager:9093
    11. # A scrape configuration containing exactly one endpoint to scrape:
    12. # Here it's Prometheus itself.
    13. scrape_configs:
    14. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    15. - job_name: 'PALO_CLUSTER' # Each Doris cluster, we call it a job. Job can be given a name here as the name of Doris cluster in the monitoring system.
    16. metrics_path: '/metrics' # Here you specify the restful API to get the monitors. With host: port in the following targets, Prometheus will eventually collect monitoring items through host: port/metrics_path.
    17. static_configs: # Here we begin to configure the target addresses of FE and BE, respectively. All FE and BE are written into their respective groups.
    18. - targets: ['fe_host1:8030', 'fe_host2:8030', 'fe_host3:8030']
    19. labels:
    20. group: fe # Here configure the group of fe, which contains three Frontends
    21. - targets: ['be_host1:8040', 'be_host2:8040', 'be_host3:8040']
    22. labels:
    23. group: be # Here configure the group of be, which contains three Backends
    24. - job_name: 'PALO_CLUSTER_2' # We can monitor multiple Doris clusters in a Prometheus, where we begin the configuration of another Doris cluster. Configuration is the same as above, the following is outlined.
    25. metrics_path: '/metrics'
    26. static_configs:
    27. - targets: ['fe_host1:8030', 'fe_host2:8030', 'fe_host3:8030']
    28. labels:
    29. group: fe
    30. - targets: ['be_host1:8040', 'be_host2:8040', 'be_host3:8040']
    31. labels:
    32. group: be
  4. start Promethues

    Start Promethues with the following command:

    nohup ./prometheus --web.listen-address="0.0.0.0:8181" &

    This command will run Prometheus in the background and specify its Web port as 8181. After startup, data is collected and stored in the data directory.

  5. stop Promethues

    At present, there is no formal way to stop the process, kill - 9 directly. Of course, Prometheus can also be set as a service to start and stop in a service way.

  6. access Prometheus

    Prometheus can be easily accessed through web pages. The page of Prometheus can be accessed by opening port 8181 through browser. Click on the navigation bar, Status -> Targets, and you can see all the monitoring host nodes of the grouped Jobs. Normally, all nodes should be UP, indicating that data acquisition is normal. Click on an Endpoint to see the current monitoring value. If the node state is not UP, you can first access Doris’s metrics interface (see previous article) to check whether it is accessible, or query Prometheus related documents to try to resolve.

  7. So far, a simple Prometheus has been built and configured. For more advanced usage, see [Official Documents] (https://prometheus.io/docs/introduction/overview/)

Grafana

  1. Download the latest version of Grafana on [Grafana’s official website] (https://grafana.com/grafana/download). Here we take version 5.2.1.linux-amd64 as an example.

  2. Unzip the downloaded tar file on the machine that is ready to run the monitoring service.

  3. Open the configuration file conf/defaults.ini. Here we only list the configuration items that need to be changed, and the other configurations can be used by default.

    1. # Path to where grafana can store temp files, sessions, and the sqlite3 db (if that is used)
    2. data = data
    3. # Directory where grafana can store logs
    4. logs = data/log
    5. # Protocol (http, https, socket)
    6. protocal = http
    7. # The ip address to bind to, empty will bind to all interfaces
    8. http_addr =
    9. # The http port to use
    10. http_port = 8182
  4. start Grafana

    Start Grafana with the following command

    nohuo ./bin/grafana-server &

    This command runs Grafana in the background, and the access port is 8182 configured above.

  5. stop Grafana

    At present, there is no formal way to stop the process, kill - 9 directly. Of course, you can also set Grafana as a service to start and stop as a service.

  6. access Grafana

    Through the browser, open port 8182, you can start accessing the Grafana page. The default username password is admin.

  7. Configure Grafana

    For the first landing, you need to set up the data source according to the prompt. Our data source here is Proetheus, which was configured in the previous step.

    The Setting page of the data source configuration is described as follows:

    1. Name: Name of the data source, customized, such as doris_monitor_data_source
    2. Type: Select Prometheus
    3. URL: Fill in the web address of Prometheus, such as http://host:8181
    4. Access: Here we choose the Server mode, which is to access Prometheus through the server where the Grafana process is located.
    5. The other options are available by default.
    6. Click Save & Test at the bottom. If Data source is working, it means that the data source is available.
    7. After confirming that the data source is available, click on the + number in the left navigation bar and start adding Dashboard. Here we have prepared Doris’s dashboard template (at the beginning of this document). When the download is complete, click New dashboard -> Import dashboard -> Upload.json File above to import the downloaded JSON file.
    8. After importing, you can name Dashboard by default Doris Overview. At the same time, you need to select the data source, where you select the doris_monitor_data_source you created earlier.
    9. Click Import to complete the import. Later, you can see Doris’s dashboard display.
  8. So far, a simple Grafana has been built and configured. For more advanced usage, see [Official Documents] (http://docs.grafana.org/)

Dashboard

Here we briefly introduce Doris Dashboard. The content of Dashboard may change with the upgrade of version. This document is not guaranteed to be the latest Dashboard description.

  1. Top Bar

    Monitoring and alarming - 图4

    • The upper left corner is the name of Dashboard.
    • The upper right corner shows the current monitoring time range. You can choose different time ranges by dropping down. You can also specify a regular refresh page interval.
    • Cluster name: Each job name in the Prometheus configuration file represents a Doris cluster. Select a different cluster, and the chart below shows the monitoring information for the corresponding cluster.
    • fe_master: The Master Frontend node corresponding to the cluster.
    • fe_instance: All Frontend nodes corresponding to the cluster. Select a different Frontend, and the chart below shows the monitoring information for the Frontend.
    • be_instance: All Backend nodes corresponding to the cluster. Select a different Backend, and the chart below shows the monitoring information for the Backend.
    • Interval: Some charts show rate-related monitoring items, where you can choose how much interval to sample and calculate the rate (Note: 15s interval may cause some charts to be unable to display).
  2. Row.

    Monitoring and alarming - 图5

    In Grafana, the concept of Row is a set of graphs. As shown in the figure above, Overview and Cluster Overview are two different Rows. Row can be folded by clicking Row. Currently Dashboard has the following Rows (in continuous updates):

    1. Overview: A summary display of all Doris clusters.
    2. Cluster Overview: A summary display of selected clusters.
    3. Query Statistic: Query-related monitoring of selected clusters.
    4. FE JVM: Select Frontend’s JVM monitoring.
    5. BE: A summary display of the backends of the selected cluster.
    6. BE Task: Display of Backends Task Information for Selected Clusters.
  3. Charts

    Monitoring and alarming - 图6

    A typical icon is divided into the following parts:

    1. Hover the I icon in the upper left corner of the mouse to see the description of the chart.
    2. Click on the illustration below to view a monitoring item separately. Click again to display all.
    3. Dragging in the chart can select the time range.
    4. The selected cluster name is displayed in [] of the title.
    5. Some values correspond to the Y-axis on the left and some to the right, which can be distinguished by the -right at the end of the legend.
    6. Click on the name of the chart -> Edit to edit the chart.

Dashboard Update

  1. Click on + in the left column of Grafana and Dashboard.
  2. Click New dashboard in the upper left corner, and Import dashboard appears on the right.
  3. Click Upload .json File to select the latest template file.
  4. Selecting Data Sources
  5. Click on Import (Overwrite) to complete the template update.