告警

概览

IoTDB 告警功能预计支持两种模式:

  • 写入触发:用户写入原始数据到原始时间序列,每插入一条数据都会触发 trigger 的判断逻辑, 若满足告警要求则发送告警到下游数据接收器, 数据接收器再转发告警到外部终端。这种模式:

    • 适合需要即时监控每一条数据的场景。
    • 由于触发器中的运算会影响数据写入性能,适合对原始数据写入性能不敏感的场景。
  • 持续查询:用户写入原始数据到原始时间序列, ContinousQuery 定时查询原始时间序列,将查询结果写入新的时间序列, 每一次写入触发 trigger 的判断逻辑, 若满足告警要求则发送告警到下游数据接收器, 数据接收器再转发告警到外部终端。这种模式:

    • 适合需要定时查询数据在某一段时间内的情况的场景。
    • 适合需要将原始数据降采样并持久化的场景。
    • 由于定时查询几乎不影响原始时间序列的写入,适合对原始数据写入性能敏感的场景。

随着 trigger 模块和 sink 模块的引入, 目前用户使用这两个模块,配合 AlertManager 可以实现写入触发模式的告警。

部署 AlertManager

安装与运行

二进制文件

预编译好的二进制文件可在 这里告警机制 - 图1 (opens new window) 下载。

运行方法:

  1. ./alertmanager --config.file=<your_file>

Docker 镜像

可在 Quay.io告警机制 - 图2 (opens new window)Docker Hub告警机制 - 图3 (opens new window) 获得。

运行方法:

  1. docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager

配置

如下是一个示例,可以覆盖到大部分配置规则,详细的配置规则参见 这里告警机制 - 图4 (opens new window)

示例:

  1. # alertmanager.yml
  2. global:
  3. # The smarthost and SMTP sender used for mail notifications.
  4. smtp_smarthost: 'localhost:25'
  5. smtp_from: 'alertmanager@example.org'
  6. # The root route on which each incoming alert enters.
  7. route:
  8. # The root route must not have any matchers as it is the entry point for
  9. # all alerts. It needs to have a receiver configured so alerts that do not
  10. # match any of the sub-routes are sent to someone.
  11. receiver: 'team-X-mails'
  12. # The labels by which incoming alerts are grouped together. For example,
  13. # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  14. # be batched into a single group.
  15. #
  16. # To aggregate by all possible labels use '...' as the sole label name.
  17. # This effectively disables aggregation entirely, passing through all
  18. # alerts as-is. This is unlikely to be what you want, unless you have
  19. # a very low alert volume or your upstream notification system performs
  20. # its own grouping. Example: group_by: [...]
  21. group_by: ['alertname', 'cluster']
  22. # When a new group of alerts is created by an incoming alert, wait at
  23. # least 'group_wait' to send the initial notification.
  24. # This way ensures that you get multiple alerts for the same group that start
  25. # firing shortly after another are batched together on the first
  26. # notification.
  27. group_wait: 30s
  28. # When the first notification was sent, wait 'group_interval' to send a batch
  29. # of new alerts that started firing for that group.
  30. group_interval: 5m
  31. # If an alert has successfully been sent, wait 'repeat_interval' to
  32. # resend them.
  33. repeat_interval: 3h
  34. # All the above attributes are inherited by all child routes and can
  35. # overwritten on each.
  36. # The child route trees.
  37. routes:
  38. # This routes performs a regular expression match on alert labels to
  39. # catch alerts that are related to a list of services.
  40. - match_re:
  41. service: ^(foo1|foo2|baz)$
  42. receiver: team-X-mails
  43. # The service has a sub-route for critical alerts, any alerts
  44. # that do not match, i.e. severity != critical, fall-back to the
  45. # parent node and are sent to 'team-X-mails'
  46. routes:
  47. - match:
  48. severity: critical
  49. receiver: team-X-pager
  50. - match:
  51. service: files
  52. receiver: team-Y-mails
  53. routes:
  54. - match:
  55. severity: critical
  56. receiver: team-Y-pager
  57. # This route handles all alerts coming from a database service. If there's
  58. # no team to handle it, it defaults to the DB team.
  59. - match:
  60. service: database
  61. receiver: team-DB-pager
  62. # Also group alerts by affected database.
  63. group_by: [alertname, cluster, database]
  64. routes:
  65. - match:
  66. owner: team-X
  67. receiver: team-X-pager
  68. - match:
  69. owner: team-Y
  70. receiver: team-Y-pager
  71. # Inhibition rules allow to mute a set of alerts given that another alert is
  72. # firing.
  73. # We use this to mute any warning-level notifications if the same alert is
  74. # already critical.
  75. inhibit_rules:
  76. - source_match:
  77. severity: 'critical'
  78. target_match:
  79. severity: 'warning'
  80. # Apply inhibition if the alertname is the same.
  81. # CAUTION:
  82. # If all label names listed in `equal` are missing
  83. # from both the source and target alerts,
  84. # the inhibition rule will apply!
  85. equal: ['alertname']
  86. receivers:
  87. - name: 'team-X-mails'
  88. email_configs:
  89. - to: 'team-X+alerts@example.org, team-Y+alerts@example.org'
  90. - name: 'team-X-pager'
  91. email_configs:
  92. - to: 'team-X+alerts-critical@example.org'
  93. pagerduty_configs:
  94. - routing_key: <team-X-key>
  95. - name: 'team-Y-mails'
  96. email_configs:
  97. - to: 'team-Y+alerts@example.org'
  98. - name: 'team-Y-pager'
  99. pagerduty_configs:
  100. - routing_key: <team-Y-key>
  101. - name: 'team-DB-pager'
  102. pagerduty_configs:
  103. - routing_key: <team-DB-key>

在后面的示例中,我们采用的配置如下:

  1. # alertmanager.yml
  2. global:
  3. smtp_smarthost: ''
  4. smtp_from: ''
  5. smtp_auth_username: ''
  6. smtp_auth_password: ''
  7. smtp_require_tls: false
  8. route:
  9. group_by: ['alertname']
  10. group_wait: 1m
  11. group_interval: 10m
  12. repeat_interval: 10h
  13. receiver: 'email'
  14. receivers:
  15. - name: 'email'
  16. email_configs:
  17. - to: ''
  18. inhibit_rules:
  19. - source_match:
  20. severity: 'critical'
  21. target_match:
  22. severity: 'warning'
  23. equal: ['alertname']

API

AlertManager API 分为 v1v2 两个版本,当前 AlertManager API 版本为 v2 (配置参见 api/v2/openapi.yaml告警机制 - 图5 (opens new window))。

默认配置的前缀为 /api/v1/api/v2, 发送告警的 endpoint 为 /api/v1/alerts/api/v2/alerts。 如果用户指定了 --web.route-prefix, 例如 --web.route-prefix=/alertmanager/, 那么前缀将会变为 /alertmanager/api/v1/alertmanager/api/v2, 发送告警的 endpoint 变为 /alertmanager/api/v1/alerts/alertmanager/api/v2/alerts

创建 trigger

编写 trigger 类

用户通过自行创建 Java 类、编写钩子中的逻辑来定义一个触发器。 具体配置流程以及 Sink 模块提供的 AlertManagerSink 相关工具类的使用方法参见 Triggers

下面的示例创建了 org.apache.iotdb.trigger.AlertingExample 类, 其 alertManagerHandler 成员变量可发送告警至地址为 http://127.0.0.1:9093/ 的 AlertManager 实例。

value > 100.0 时,发送 severitycritical 的告警; 当 50.0 < value <= 100.0 时,发送 severitywarning 的告警。

  1. package org.apache.iotdb.trigger;
  2. /*
  3. 此处省略包的引入
  4. */
  5. public class AlertingExample implements Trigger {
  6. private final AlertManagerHandler alertManagerHandler = new AlertManagerHandler();
  7. private final AlertManagerConfiguration alertManagerConfiguration =
  8. new AlertManagerConfiguration("http://127.0.0.1:9093/api/v2/alerts");
  9. private String alertname;
  10. private final HashMap<String, String> labels = new HashMap<>();
  11. private final HashMap<String, String> annotations = new HashMap<>();
  12. @Override
  13. public void onCreate(TriggerAttributes attributes) throws Exception {
  14. alertManagerHandler.open(alertManagerConfiguration);
  15. alertname = "alert_test";
  16. labels.put("series", "root.ln.wf01.wt01.temperature");
  17. labels.put("value", "");
  18. labels.put("severity", "");
  19. annotations.put("summary", "high temperature");
  20. annotations.put("description", "{{.alertname}}: {{.series}} is {{.value}}");
  21. }
  22. @Override
  23. public void onDrop() throws IOException {
  24. alertManagerHandler.close();
  25. }
  26. @Override
  27. public void onStart() {
  28. alertManagerHandler.open(alertManagerConfiguration);
  29. }
  30. @Override
  31. public void onStop() throws Exception {
  32. alertManagerHandler.close();
  33. }
  34. @Override
  35. public Double fire(long timestamp, Double value) throws Exception {
  36. if (value > 100.0) {
  37. labels.put("value", String.valueOf(value));
  38. labels.put("severity", "critical");
  39. AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
  40. alertManagerHandler.onEvent(alertManagerEvent);
  41. } else if (value > 50.0) {
  42. labels.put("value", String.valueOf(value));
  43. labels.put("severity", "warning");
  44. AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
  45. alertManagerHandler.onEvent(alertManagerEvent);
  46. }
  47. return value;
  48. }
  49. @Override
  50. public double[] fire(long[] timestamps, double[] values) throws Exception {
  51. for (double value : values) {
  52. if (value > 100.0) {
  53. labels.put("value", String.valueOf(value));
  54. labels.put("severity", "critical");
  55. AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
  56. alertManagerHandler.onEvent(alertManagerEvent);
  57. } else if (value > 50.0) {
  58. labels.put("value", String.valueOf(value));
  59. labels.put("severity", "warning");
  60. AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
  61. alertManagerHandler.onEvent(alertManagerEvent);
  62. }
  63. }
  64. return values;
  65. }
  66. }

创建 trigger

如下的 sql 语句在 root.ln.wf01.wt01.temperature 时间序列上注册了名为 root-ln-wf01-wt01-alert、 运行逻辑由 org.apache.iotdb.trigger.AlertingExample 类定义的触发器。

  1. CREATE TRIGGER `root-ln-wf01-wt01-alert`
  2. AFTER INSERT
  3. ON root.ln.wf01.wt01.temperature
  4. AS "org.apache.iotdb.trigger.AlertingExample"

写入数据

当我们完成 AlertManager 的部署和启动、Trigger 的创建, 可以通过向时间序列写入数据来测试告警功能。

  1. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (1, 0);
  2. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (2, 30);
  3. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (3, 60);
  4. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (4, 90);
  5. INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (5, 120);

执行完上述写入语句后,可以收到告警邮件。由于我们的 AlertManager 配置中设定 severitycritical 的告警 会抑制 severitywarning 的告警,我们收到的告警邮件中只包含写入 (5, 120) 后触发的告警。

alerting