高科技Startup构建监控体系之路

贝联珠贯科技, 一家ToB科技公司,定位于帮助客户大幅提升利用率,从而显著降低IT机器投入。

前言

公司当前机器总数100台左右, 没有监控, 总是在机器挂了才知道. 业务问题也只能依靠测试报障. 因为内部涉及多个K8s集群. 每个环境有独立的监控,日志收集系统, 所以需要一个All IN ONE的运维监控系统.

尝试过Grafana+ Mimir + Loki的方式.二次开发成本过大, 并且短期内不能有效告警. 遂放弃. 接着尝试夜莺V5.

通过夜莺监控,免去了我们对告警通知的开发成本, 传统的 Grafana 或者 Alertmanager, 都需要二次对接自己的IM. 而夜莺支持了业务组或者部门的功能, 我们就可以利用这些功能做到告警细化, 并不需要再次对接IM平台. 并且有着更详细、易用的告警配置. 可以做到开箱即用, 学习成本近乎为零。

以下是实践过程,会从系统运维,业务运维,数据库运维等几个方面来进行监控系统搭建.

监控搭建

夜莺搭建

https://github.com/ccfos/nightingale

这里选用最简单的Docker Compose 方式创建夜莺. 正如文档所说如果不是Docker专家, 不建议以这样的形式创建.

夜莺文档

夜莺文档

启动命令如下所示.

  1. git clone https://gitlink.org.cn/ccfos/nightingale.git
  2. cd nightingale/docker
  3. docker compose up -d

服务启动之后,浏览器访问nwebapi的端口,即18000,默认用户是root,密码是root.2020

主机监控安装

这里的主机监控agent 选用的grafana-agent, grafana-agent 集成了绝大部分会使用到的exporter, 做到了All IN ONE.
并且支持Push 模式,简化流程, 这样在流程上只需要在主机启动时,预装grafana-agent, 由grafana-agent主动Push 到中心即可.

安装脚本如下所示:

这个脚本有如下几个注意点:

  1. remote_write 地址要根据自己部署夜莺的地址修改,将x.x.x.x更换为自己的IP即可

  2. $_hostip: 这个建议写为主机IP, 因为对运维来说IP才是最直观的数据

  1. function InstallMonitor(){
  2. [ ! -f /usr/local/bin/grafana-agent ] && wget -O /usr/local/bin/grafana-agent https://lcc-init.oss-cn-hangzhou-internal.aliyuncs.com/grafana-agent
  3. chmod +x /usr/local/bin/grafana-agent
  4. mkdir -p /metrics /etc/grafana-agent
  5. cat >/etc/systemd/system/grafana-agent.service <<EOF
  6. [Unit]
  7. Description="grafana-agent"
  8. After=network.target
  9. [Service]
  10. Type=simple
  11. ExecStart=/usr/local/bin/grafana-agent -config.file /etc/grafana-agent/grafana-agent.yml
  12. WorkingDirectory=/usr/local/bin
  13. SuccessExitStatus=0
  14. LimitNOFILE=65536
  15. StandardOutput=syslog
  16. StandardError=syslog
  17. SyslogIdentifier=grafana-agent
  18. KillMode=process
  19. KillSignal=SIGQUIT
  20. TimeoutStopSec=5
  21. Restart=always
  22. [Install]
  23. WantedBy=multi-user.target
  24. EOF
  25. chmod 0644 /etc/systemd/system/grafana-agent.service
  26. cat >/etc/grafana-agent/grafana-agent.yml <<EOF
  27. server:
  28. log_level: info
  29. http_listen_port: 12345
  30. metrics:
  31. wal_directory: /metrics
  32. global:
  33. scrape_interval: 15s
  34. scrape_timeout: 10s
  35. remote_write:
  36. # 远程写入的地址需要根据云上云下环境来切换.
  37. - url: http://x.x.x.x:19000/prometheus/v1/write
  38. integrations:
  39. agent:
  40. enabled: true
  41. node_exporter:
  42. enabled: true
  43. instance: "$_hostip"
  44. include_exporter_metrics: true
  45. process_exporter:
  46. enabled: true
  47. instance: "$_hostip"
  48. process_names:
  49. - name: "{{.Comm}}"
  50. cmdline:
  51. - '.+'
  52. EOF
  53. systemctl daemon-reload
  54. systemctl enable --now grafana-agent
  55. }

BlackBox Exporter

下载地址: https://github.com/prometheus/blackbox_exporter/releases

下载二进制文件并解压到/usr/local/bin/

安装脚本如下:

  1. function InstallBlackboxExporter(){
  2. cat >/etc/systemd/system/blackbox_exporter.service <<EOF
  3. [Unit]
  4. Description="blackbox_exporter"
  5. After=network.target
  6. [Service]
  7. Type=simple
  8. ExecStart=/usr/local/bin/blackbox_exporter --config.file=/etc/blackbox-exporter/blackbox.yml
  9. WorkingDirectory=/usr/local/bin
  10. SuccessExitStatus=0
  11. LimitNOFILE=65536
  12. StandardOutput=syslog
  13. StandardError=syslog
  14. SyslogIdentifier=blackbox_exporter
  15. KillMode=process
  16. KillSignal=SIGQUIT
  17. TimeoutStopSec=5
  18. Restart=always
  19. [Install]
  20. WantedBy=multi-user.target
  21. EOF
  22. chmod 0644 /etc/systemd/system/blackbox_exporter.service
  23. cat >/etc/blackbox-exporter/blackbox.yml <<EOF
  24. modules:
  25. http_2xx:
  26. prober: http
  27. http_post_2xx:
  28. prober: http
  29. http:
  30. method: POST
  31. tcp_connect:
  32. prober: tcp
  33. pop3s_banner:
  34. prober: tcp
  35. tcp:
  36. query_response:
  37. - expect: "^+OK"
  38. tls: true
  39. tls_config:
  40. insecure_skip_verify: false
  41. grpc:
  42. prober: grpc
  43. grpc:
  44. tls: true
  45. preferred_ip_protocol: "ip4"
  46. grpc_plain:
  47. prober: grpc
  48. grpc:
  49. tls: false
  50. service: "service1"
  51. ssh_banner:
  52. prober: tcp
  53. tcp:
  54. query_response:
  55. - expect: "^SSH-2.0-"
  56. - send: "SSH-2.0-blackbox-ssh-check"
  57. irc_banner:
  58. prober: tcp
  59. tcp:
  60. query_response:
  61. - send: "NICK prober"
  62. - send: "USER prober prober prober :prober"
  63. - expect: "PING :([^ ]+)"
  64. send: "PONG ${1}"
  65. - expect: "^:[^ ]+ 001"
  66. icmp:
  67. prober: icmp
  68. EOF
  69. systemctl daemon-reload
  70. systemctl enable --now blackbox_exporter
  71. }

Mysqld Exporter

下载地址: https://github.com/prometheus/mysqld_exporter

下载二进制文件并解压到/usr/local/bin/

需要监听的数据库执行如下SQL:

xxxxx替换为你设定的密码

  1. create user 'exporter'@'%' identified by 'xxxxx';
  2. GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%' WITH MAX_USER_CONNECTIONS 3;
  3. flush privileges;

安装脚本如下:

mysqld_exporter.cnf: 中密码账户为上面执行SQL创建的用户密码.

  1. function InstallMysqldExporter(){
  2. cat >/etc/systemd/system/mysqld_exporter.service <<EOF
  3. [Unit]
  4. Description="mysqld_exporter"
  5. After=network.target
  6. [Service]
  7. Type=simple
  8. ExecStart=/usr/local/bin/mysqld_exporter --config.my-cnf=/etc/mysqld_exporter.cnf --collect.auto_increment.columns --collect.binlog_size --collect.global_status --collect.global_variables --collect.info_schema.innodb_metrics --collect.info_schema.innodb_cmp --collect.info_schema.innodb_cmpmem --collect.info_schema.processlist --collect.info_schema.query_response_time --collect.info_schema.tables --collect.info_schema.tablestats --collect.info_schema.userstats --collect.perf_schema.eventswaits --collect.perf_schema.file_events --collect.perf_schema.indexiowaits --collect.perf_schema.tableiowaits --collect.perf_schema.tablelocks --collect.slave_status
  9. WorkingDirectory=/usr/local/bin
  10. SuccessExitStatus=0
  11. LimitNOFILE=65536
  12. StandardOutput=syslog
  13. StandardError=syslog
  14. SyslogIdentifier=mysqld_exporter
  15. KillMode=process
  16. KillSignal=SIGQUIT
  17. TimeoutStopSec=5
  18. Restart=always
  19. [Install]
  20. WantedBy=multi-user.target
  21. EOF
  22. chmod 0644 /etc/systemd/system/mysqld_exporter.service
  23. cat >/etc/mysqld_exporter.cnf <<EOF
  24. [client]
  25. user=exporter
  26. password=xxxx
  27. host=x.x.x.x
  28. port=3306
  29. EOF
  30. systemctl daemon-reload
  31. systemctl enable --now mysqld_exporter
  32. }

consul + consul-template 动态生成配置

安装 Consul

-bind-client 需要替换为本机IP

  1. function InstallConsul(){
  2. yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
  3. yum -y install consul
  4. mkdir -p /data/consul
  5. cat >/etc/systemd/system/consul.service <<EOF
  6. [Unit]
  7. Description="consul"
  8. After=network.target
  9. [Service]
  10. Type=simple
  11. ExecStart=/usr/bin/consul agent -server -bootstrap-expect 1 -bind=x.x.x.x -client=x.x.x.x -data-dir=/data/consul -node=agent-one -config-dir=/etc/consul.d -ui
  12. WorkingDirectory=/usr/bin/
  13. SuccessExitStatus=0
  14. LimitNOFILE=65536
  15. StandardOutput=syslog
  16. StandardError=syslog
  17. SyslogIdentifier=consul
  18. KillMode=process
  19. KillSignal=SIGQUIT
  20. TimeoutStopSec=5
  21. Restart=always
  22. [Install]
  23. WantedBy=multi-user.target
  24. EOF
  25. chmod 0644 /etc/systemd/system/consul.service
  26. systemctl daemon-reload
  27. systemctl enable --now consul
  28. }

安装Consul-template

安装脚本如下所示:

x.x.x.x 替换为夜莺地址 , a.b.c.d 替换为consul部署地址

  1. wget https://releases.hashicorp.com/consul-template/0.29.0/consul-template_0.29.0_linux_amd64.zip
  2. unzip consul-template_0.29.0_linux_amd64.zip
  3. chmod +x consul-template
  4. mv consul-template /usr/local/bin/consul-template
  5. mkdir -p /etc/consul-template/template
  6. cat > /etc/consul-template/consul-template.conf << EOF
  7. log_level = "warn"
  8. syslog {
  9. # This enables syslog logging.
  10. enabled = true
  11. # This is the name of the syslog facility to log to.
  12. facility = "LOCAL5"
  13. }
  14. consul {
  15. # auth {
  16. # enabled = true
  17. # username = "test"
  18. # password = "test"
  19. # }
  20. # 注意替换为consul地址
  21. address = "a.b.c.d:8500"
  22. retry {
  23. enabled = true
  24. attempts = 12
  25. backoff = "250ms"
  26. # If max_backoff is set to 10s and backoff is set to 1s, sleep times
  27. # would be: 1s, 2s, 4s, 8s, 10s, 10s, ...
  28. max_backoff = "3m"
  29. }
  30. }
  31. template {
  32. source = "/etc/consul-template/templates/url-monitor.ctmpl"
  33. destination = "/home/nightingale-main/docker/prometc/conf.d/url/url.yaml"
  34. command = "curl -X POST http://x.x.x.x:9090/-/reload"
  35. command_timeout = "60s"
  36. backup = true
  37. wait {
  38. min = "2s"
  39. max = "20s"
  40. }
  41. }
  42. template {
  43. source = "/etc/consul-template/templates/icmp-monitor.ctmpl"
  44. destination = "/home/nightingale-main/docker/prometc/conf.d/icmp/icmp.yaml"
  45. command = ""
  46. command_timeout = "60s"
  47. backup = true
  48. wait {
  49. min = "2s"
  50. max = "20s"
  51. }
  52. }
  53. EOF
  54. cat > /etc/consul-template/consul-template.conf/template/url-monitor.ctmpl <<EOF
  55. - targets:
  56. {{- range ls "blackbox/url/http200" }}
  57. - http://{{ .Key }}{{ .Value }}
  58. {{- end }}
  59. EOF
  60. cat > /etc/consul-template/consul-template.conf/template/icmp-monitor.ctmpl <<EOF
  61. {{- range ls "blackbox/icmp" }}
  62. - targets:
  63. - {{ .Key }}
  64. labels:
  65. instance: {{ .Key }}
  66. {{- end }}
  67. EOF
  68. cat > /etc/systemd/system/consul-template.service <<EOF
  69. [Unit]
  70. Description="consul-template"
  71. After=network.target
  72. [Service]
  73. Type=simple
  74. ExecStart=/usr/local/bin/consul-template -config /etc/consul-template/consul-template.conf
  75. WorkingDirectory=/usr/local/bin
  76. SuccessExitStatus=0
  77. LimitNOFILE=65536
  78. StandardOutput=syslog
  79. StandardError=syslog
  80. SyslogIdentifier=consul-template
  81. KillMode=process
  82. KillSignal=SIGQUIT
  83. TimeoutStopSec=5
  84. Restart=always
  85. [Install]
  86. WantedBy=multi-user.target
  87. EOF
  88. systemctl daemon-reload
  89. systemctl enable --now consul-template.service

配置Consul K/V 动态生成URL监控

添加如下K/V,K/V 对应上文*.ctmpl 文件中渲染地址. 在这里Key 为域名,Values 为路径

Conusl配置

Conusl配置

修改Promtheus配置

nightingale-main/docker/prometc/prometheus.yml追加如下内容:

  1. - job_name: MySQL
  2. static_configs:
  3. - targets:
  4. - x.x.x.x:9104
  5. labels:
  6. instance: MySQL-dev
  7. - job_name: process
  8. static_configs:
  9. - targets:
  10. - x.x.x.x:9256
  11. - job_name: 'blackbox-url-monitor'
  12. metrics_path: /probe
  13. params:
  14. module: [http_2xx] # Look for a HTTP 200 response.
  15. file_sd_configs:
  16. - refresh_interval: 1m
  17. files:
  18. - ./conf.d/url/*.yaml
  19. relabel_configs:
  20. - source_labels: [__address__]
  21. target_label: __param_target
  22. - source_labels: [__param_target]
  23. target_label: instance
  24. - target_label: __address__
  25. replacement: x.x.x.x:9115
  26. - job_name: 'blackbox-icmp-monitor'
  27. scrape_interval: 1m
  28. metrics_path: /probe
  29. params:
  30. module: [icmp]
  31. file_sd_configs:
  32. - refresh_interval: 1m
  33. files:
  34. - ./conf.d/icmp/*.yaml
  35. relabel_configs:
  36. - source_labels: [__address__]
  37. target_label: __param_target
  38. - target_label: __address__
  39. replacement: x.x.x.x:9115

nightingale-main/docker/prometc/ 下创建目录conf.d. 命令如下:

  1. cd nightingale-main/docker/prometc/
  2. mkdir -p conf.d/{icmp,url}

重启promtheus,命令如下所示:

  1. docker restart prometheus

重启后检查prometheus状态

promtheus状态

promtheus状态

日志监控搭建

感谢夜莺社区支持.

  1. 大前提, 夜莺版本高于5.9.2
  2. 已有Loki. 并且Loki已经支持多租户.

Loki的配置在这里不做赘述,网上教程太多了.

docker-compose.yml 追加如下内容, 与nserver 同级

  1. lokinserver:
  2. image: registry.cn-hangzhou.aliyuncs.com/lcc-middleware/nightingale:5.9.2
  3. container_name: lokinserver
  4. hostname: nserver
  5. restart: always
  6. environment:
  7. GIN_MODE: release
  8. TZ: Asia/Shanghai
  9. WAIT_HOSTS: mysql:3306, redis:6379
  10. volumes:
  11. - ./lokin9eetc:/app/etc
  12. ports:
  13. - "20000:20000"
  14. networks:
  15. - nightingale
  16. depends_on:
  17. - mysql
  18. - redis
  19. - prometheus
  20. - ibex
  21. links:
  22. - mysql:mysql
  23. - redis:redis
  24. - prometheus:prometheus
  25. - ibex:ibex
  26. command: >
  27. sh -c "/wait && /app/n9e server"

生成lokinserver容器的配置文件.操作如下.

  1. cp -r n9eetc lokin9eetc
  2. cd lokin9eetc

修改lokin9eetc/server.conf文件中Reader字段,内容如下:

如果开启多租户记得传Headers, 如果没开,则去除Headers字段 Loki的API中带loki前缀的都是兼容prometheus风格的API 所以一定要加. Prom字段替换为自己的域名

  1. [Reader]
  2. # prometheus base url
  3. Url = "http://loki.xxx.xxx/loki/"
  4. # Basic auth username
  5. BasicAuthUser = ""
  6. # Basic auth password
  7. BasicAuthPass = ""
  8. # timeout settings, unit: ms
  9. Timeout = 30000
  10. DialTimeout = 10000
  11. TLSHandshakeTimeout = 30000
  12. ExpectContinueTimeout = 1000
  13. IdleConnTimeout = 90000
  14. # time duration, unit: ms
  15. KeepAlive = 30000
  16. MaxConnsPerHost = 0
  17. MaxIdleConns = 100
  18. MaxIdleConnsPerHost = 10
  19. Headers = ["X-Scope-OrgID","lcc-loki"]

修改配置文件nightingale-main/docker/n9eetc/webapi.conf, 追加如下内容

如果开启多租户记得传Headers, 如果没开,则去除Headers字段 Loki的API中带loki前缀的都是兼容prometheus风格的API 所以一定要加. Prom字段替换为自己的域名

  1. [[Clusters]]
  2. # Prometheus cluster name
  3. Name = "Loki"
  4. # # Prometheus APIs base url
  5. Prom = "http://loki.xxx.xxx/loki/"
  6. # # Basic auth username
  7. BasicAuthUser = ""
  8. # Basic auth password
  9. BasicAuthPass = ""
  10. # timeout settings, unit: ms
  11. Timeout = 30000
  12. DialTimeout = 10000
  13. TLSHandshakeTimeout = 30000
  14. ExpectContinueTimeout = 1000
  15. IdleConnTimeout = 90000
  16. # time duration, unit: ms
  17. KeepAlive = 30000
  18. MaxConnsPerHost = 0
  19. MaxIdleConns = 100
  20. MaxIdleConnsPerHost = 100
  21. Headers = ["X-Scope-OrgID","lcc-loki"]

重启夜莺监控:

  1. docker-compose up -d

告警规则配置

系统运维

CPU利用率 > 90

  1. (100-(avg by (mode, instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])))*100) > 90

Innode 利用率>90

  1. (100 - ((node_filesystem_files_free * 100) / node_filesystem_files))>90

sshd 服务挂了

  1. (namedprocess_namegroup_num_procs{groupname="sshd"}) == 0

内存利用率 > 95

  1. (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - (node_memory_Cached_bytes + node_memory_Buffers_bytes))/node_memory_MemTotal_bytes*100 > 95

文件句柄 > 90

  1. (node_filefd_allocated{}/node_filefd_maximum{}*100)

IO wait > 30%

  1. avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 30

过去一分钟IOutil > 80

  1. (rate(node_disk_io_time_seconds_total{} [1m]) *100) > 80

Ping > 1s

  1. avg_over_time(probe_icmp_duration_seconds[1m]) > 1

平均负载>2

  1. (avg(node_load1) by(instance)/count by (instance)(node_cpu_seconds_total{mode='idle'})) >2

TCP重传率>5%

  1. (rate(node_netstat_Tcp_RetransSegs{}[5m])/ rate(node_netstat_Tcp_OutSegs{}[5m])*100) > 5

磁盘利用率 > 85%

  1. (100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) ) > 85

节点重启

  1. node_reboot_required > 0

业务运维

我们是GO应用,其他应用根据需要设定

一分钟内日志ERROR>10

日志这里主要选,我们上面添加的Loki集群

error日志

error日志

URL探测不通

  1. probe_http_status_code <= 199 OR probe_http_status_code >= 400

过去一分钟出现Panic

Panic日志

Panic日志

数据库运维

仅罗列部分, 更多可以在导入规则中查找

mysql规则

mysql规则

数据库重启

  1. mysql_global_status_uptime < 60

连接数超过80%

  1. avg by (instance) (mysql_global_status_threads_connected) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80`

最近一分钟有慢查询

  1. increase(mysql_global_status_slow_queries[1m]) > 0