开启 Prometheus 监控

这篇文章描述了如何让Prometheus监控所有正在运行的Linkis服务。

Prometheus 是一个云原生计算基金会项目,是一个系统和服务监控系统。它以给定的时间间隔从配置的目标收集指标,评估规则表达式,显示结果,并在观察到指定条件时触发警报。

在微服务上下文中,它提供了服务发现功能,可以从服务注册中心动态查找目标,如 Eureka、Consul 等,并通过 http 协议从 API 端点拉取指标。

下图说明了 Prometheus 的架构及其一些生态系统组件:

开启 Prometheus 监控 - 图1

Prometheus可以直接抓取指标,或通过push gateway间接地接收短作业的指标。它将所有抓取的样本存储在本地,并在这些数据上运行规则,以从现有数据聚合和记录新的时间序列,或生成警报。可以使用Grafana或其他API消费者对收集的数据进行可视化。

开启 Prometheus 监控 - 图2

在 Linkis中,我们将使用 Prometheus 中的 Eureka (Service Discover)SD 来使用 Eureka REST API 来查询抓取目标。 Prometheus 将定期检查 REST 端点并为每个应用程序实例创建一个抓取目标。

安装脚本中,可以通过开关进行开启

修改安装脚本linkis-env.sh中的PROMETHEUS_ENABLE

  1. export PROMETHEUS_ENABLE=true

运行 install.sh安装linkis后, prometheus的相关配置会出现在下列文件中:

  1. ## application-linkis.yml ##
  2. eureka:
  3. instance:
  4. metadata-map:
  5. prometheus.path: ${prometheus.path:${prometheus.endpoint}}
  6. ...
  7. management:
  8. endpoints:
  9. web:
  10. exposure:
  11. include: refresh,info,health,metrics,prometheus
  1. ## application-eureka.yml ##
  2. eureka:
  3. instance:
  4. metadata-map:
  5. prometheus.path: ${prometheus.path:/actuator/prometheus}
  6. ...
  7. management:
  8. endpoints:
  9. web:
  10. exposure:
  11. include: refresh,info,health,metrics,prometheus
  1. ## linkis.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. wds.linkis.server.user.restful.uri.pass.auth=/api/rest_j/v1/actuator/prometheus,
  5. ...

如果在引擎内部,如 spark、flink 或 hive,都需要手动添加相同的配置。

  1. ## linkis-engineconn.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. wds.linkis.server.user.restful.uri.pass.auth=/api/rest_j/v1/actuator/prometheus,
  5. ...

修改${LINKIS_HOME}/conf/application-linkis.yml endpoints配置修改 增加prometheus

  1. ## application-linkis.yml ##
  2. management:
  3. endpoints:
  4. web:
  5. exposure:
  6. #增加 prometheus
  7. include: refresh,info,health,metrics,prometheus

修改${LINKIS_HOME}/conf/application-eureka.yml,endpoints配置修改增加prometheus

  1. ## application-eureka.yml ##
  2. management:
  3. endpoints:
  4. web:
  5. exposure:
  6. #增加 prometheus
  7. include: refresh,info,health,metrics,prometheus

修改${LINKIS_HOME}/conf/linkis.properties,去掉prometheus.enable前的注释

  1. ## linkis.properties ##
  2. ...
  3. wds.linkis.prometheus.enable=true
  4. ...
  1. $ bash linkis-start-all.sh

Linkis启动后,各个微服务的prometheus端点是可以直接被访问的,例如http://linkishost:9103/api/rest\_j/v1/actuator/prometheus

开启 Prometheus 监控 - 图3注意

gateway/eureka 服务prometheus端点是没有api/rest_j/v1前缀的 http://linkishost:9001/actuator/prometheus

开启 Prometheus 监控 - 图4注意

gateway/eureka 服务prometheus端点是没有api/rest_j/v1前缀的 http://linkishost:9001/actuator/prometheus

通常来说,云原生应用程序的监控设置将部署在具有服务发现和高可用性的 Kubernetes 上(例如,使用像 Prometheus Operator 这样的 Kubernetes Operator)。 为了快速展示监控仪表盘,和试验不同类型的图表(histogram/ gauge),你需要一个本地简易的构建。 这个部分将会解释如何在本地通过 Docker Compose搭建Prometheus/Alert Manager和Grafana这一监控套件。

首先,让我们定义该技术栈的通用组件,如下所示:

  • Alert Manager容器对外通过端口9093暴露UI,并从alertmanager.conf读取配置;
  • Prometheus容器对外通过端口9090暴露UI,从prometheus.yml读取配置文件,从alert_rules.yml中读取报警规则;
  • Grafana容器对外通过端口3000暴露UI, 指标数据源定义在grafana_datasources.yml中,配置文件通过grafana_config.ini定义;
  • 以下的docker-compose.yml文件总结了上述组件的配置:
  1. ## docker-compose.yml ##
  2. version: "3"
  3. networks:
  4. default:
  5. external: true
  6. name: my-network
  7. services:
  8. prometheus:
  9. image: prom/prometheus:latest
  10. container_name: prometheus
  11. volumes:
  12. - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
  13. - ./config/alertrule.yml:/etc/prometheus/alertrule.yml
  14. - ./prometheus/prometheus_data:/prometheus
  15. command:
  16. - '--config.file=/etc/prometheus/prometheus.yml'
  17. ports:
  18. - "9090:9090"
  19. alertmanager:
  20. image: prom/alertmanager:latest
  21. container_name: alertmanager
  22. volumes:
  23. - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
  24. ports:
  25. - "9093:9093"
  26. grafana:
  27. image: grafana/grafana:latest
  28. container_name: grafana
  29. environment:
  30. - GF_SECURITY_ADMIN_PASSWORD=123456
  31. - GF_USERS_ALLOW_SIGN_UP=false
  32. volumes:
  33. - ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards
  34. - ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources
  35. - ./grafana/grafana_data:/var/lib/grafana
  36. ports:
  37. - "3000:3000"

然后,为了根据 Prometheus 中的指标定义一些警报,您可以将它们分组到一个 alert_rules.yml 中,这样您就可以在生产实例中配置它们之前验证这些警报是否在本地设置中正确触发。例如,以下配置转换了用于监控 Linkis 服务的常用指标。

  • a. Down instance
  • b. High Cpu for each JVM instance (>80%)
  • c. High Heap memory for each JVM instance (>80%)
  • d. High NonHeap memory for each JVM instance (>80%)
  • e. High Waiting thread for each JVM instance (100)
  1. ## alertrule.yml ##
  2. groups:
  3. - name: LinkisAlert
  4. rules:
  5. - alert: LinkisNodeDown
  6. expr: last_over_time(up{job="linkis", application=~"LINKISI.*", application!="LINKIS-CG-ENGINECONN"}[1m])== 0
  7. for: 15s
  8. labels:
  9. severity: critical
  10. service: Linkis
  11. instance: "{{ $labels.instance }}"
  12. annotations:
  13. summary: "instance: {{ $labels.instance }} down"
  14. description: "Linkis instance(s) is/are down in last 1m"
  15. value: "{{ $value }}"
  16. - alert: LinkisNodeCpuHigh
  17. expr: system_cpu_usage{job="linkis", application=~"LINKIS.*"} >= 0.8
  18. for: 1m
  19. labels:
  20. severity: warning
  21. service: Linkis
  22. instance: "{{ $labels.instance }}"
  23. annotations:
  24. summary: "instance: {{ $labels.instance }} cpu overload"
  25. description: "CPU usage is over 80% for over 1min"
  26. value: "{{ $value }}"
  27. - alert: LinkisNodeHeapMemoryHigh
  28. expr: sum(jvm_memory_used_bytes{job="linkis", application=~"LINKIS.*", area="heap"}) by(instance) *100/sum(jvm_memory_max_bytes{job="linkis", application=~"LINKIS.*", area="heap"}) by(instance) >= 50
  29. for: 1m
  30. labels:
  31. severity: warning
  32. service: Linkis
  33. instance: "{{ $labels.instance }}"
  34. annotations:
  35. summary: "instance: {{ $labels.instance }} memory(heap) overload"
  36. description: "Memory usage(heap) is over 80% for over 1min"
  37. value: "{{ $value }}"
  38. - alert: LinkisNodeNonHeapMemoryHigh
  39. expr: sum(jvm_memory_used_bytes{job="linkis", application=~"LINKIS.*", area="nonheap"}) by(instance) *100/sum(jvm_memory_max_bytes{job="linkis", application=~"LINKIS.*", area="nonheap"}) by(instance) >= 60
  40. for: 1m
  41. labels:
  42. severity: warning
  43. service: Linkis
  44. instance: "{{ $labels.instance }}"
  45. annotations:
  46. summary: "instance: {{ $labels.instance }} memory(nonheap) overload"
  47. description: "Memory usage(nonheap) is over 80% for over 1min"
  48. value: "{{ $value }}"
  49. - alert: LinkisWaitingThreadHigh
  50. expr: jvm_threads_states_threads{job="linkis", application=~"LINKIS.*", state="waiting"} >= 100
  51. for: 1m
  52. labels:
  53. severity: warning
  54. service: Linkis
  55. instance: "{{ $labels.instance }}"
  56. annotations:
  57. summary: "instance: {{ $labels.instance }} waiting threads is high"
  58. description: "waiting threads is over 100 for over 1min"
  59. value: "{{ $value }}"

请注意: 由于服务实例一旦关闭,它就不会成为 Prometheus Eureka SD 的目标之一,并且 up 指标在短时间内不会返回任何数据。因此,我们将收集最后一分钟是否 up=0 以确定服务是否处于活动状态。

第三点, 最重要的是在 prometheus.yml 文件中定义 Prometheus 配置。这将定义:

  • 全局设定,例如指标抓取时间间隔,和规则扫描间隔;
  • AlertManager的连接信息,告警规则定义路径;
  • 应用指标端口的连接信息。

这是 Linkis 的示例配置文件:

  1. ## prometheus.yml ##
  2. # my global config
  3. global:
  4. scrape_interval: 30s # By default, scrape targets every 15 seconds.
  5. evaluation_interval: 30s # By default, scrape targets every 15 seconds.
  6. alerting:
  7. alertmanagers:
  8. - static_configs:
  9. - targets: ['alertmanager:9093']
  10. # Load and evaluate rules in this file every 'evaluation_interval' seconds.
  11. rule_files:
  12. - "alertrule.yml"
  13. # A scrape configuration containing exactly one endpoint to scrape:
  14. # Here it's Prometheus itself.
  15. scrape_configs:
  16. - job_name: 'prometheus'
  17. static_configs:
  18. - targets: ['localhost:9090']
  19. - job_name: linkis
  20. eureka_sd_configs:
  21. # the endpoint of your eureka instance
  22. - server: {{linkis-host}}:20303/eureka
  23. relabel_configs:
  24. - source_labels: [__meta_eureka_app_name]
  25. target_label: application
  26. - source_labels: [__meta_eureka_app_instance_metadata_prometheus_path]
  27. action: replace
  28. target_label: __metrics_path__
  29. regex: (.+)

第四点,下面的配置定义了警报将如何发送到外部webhook。

  1. ## alertmanager.yml ##
  2. global:
  3. resolve_timeout: 5m
  4. route:
  5. receiver: 'webhook'
  6. group_by: ['alertname']
  7. # How long to wait to buffer alerts of the same group before sending a notification initially.
  8. group_wait: 1m
  9. # How long to wait before sending an alert that has been added to a group for which there has already been a notification.
  10. group_interval: 5m
  11. # How long to wait before re-sending a given alert that has already been sent in a notification.
  12. repeat_interval: 12h
  13. receivers:
  14. - name: 'webhook'
  15. webhook_configs:
  16. - send_resolved: true
  17. url: {{your-webhook-url}}

最后,在定义完所有配置文件以及 docker compose 文件后,我们可以使用 docker-compose up启动监控套件

在 Prometheus 页面上,预计会看到所有 Linkis 服务实例,如下所示: 开启 Prometheus 监控 - 图5

当 Grafana 可访问的时候,您需要在 Grafana 中导入 prometheus 作为数据源,并导入 id 为 11378 的仪表板模板,该模板通常用于 springboot 服务(2.1+)。然后您可以在那里查看 Linkis 的一个实时仪表板。

然后您可以在那里查看 Linkis 的实时仪表板。

开启 Prometheus 监控 - 图6

您还可以尝试将 Prometheus alter manager 与您自己的 webhook 集成,您可以在其中查看是否触发了告警消息。