Prometheus监控

Canal server 性能指标监控基于prometheus的实现。

关于prometheus,参见官网

效果示意图

image.png | left | 747x413

Quick start

  • 安装并部署对应平台的prometheus,参见官方guide

  • 配置prometheus.yml,添加canal的job,示例:

  1. - job_name: 'canal'
  2. static_configs:
  3. - targets: ['localhost:11112'] //端口配置即为canal.properties中的canal.metrics.pull.port
  • 启动prometheus与canal server

  • 安装与部署grafana,推荐使用新版本(5.2)。

  • 启动grafana-server,使用用户admin与密码admin登录localhost:3000 (默认配置下)。

  • 配置prometheus datasource.

  • 导入模板(canal/conf/metrics/Canal_instances_tmpl.json),参考这里

  • 进入dashboard 'Canal instances', 在'datasource'下拉框中选择刚才配置的prometheus datasource, 然后'destination'下拉框中就可以切换instance了(如果没出现instances列表就刷新下页面), just enjoy it.


canal监控相关原始指标列表:

指标说明单位精度
canal_instance_transactionsinstance接收transactions计数--
canal_instanceinstance基本信息--
canal_instance_subscriptionsinstance订阅数量--
canal_instance_publish_blocking_timeinstance dump线程提交到异步解析队列过程中的阻塞时间(仅parallel解析模式)msns
canal_instance_received_binlog_bytesinstance接收binlog字节数byte-
canal_instance_parser_modeinstance解析模式(是否开启parallel解析)--
canal_instance_client_packetsinstance client请求次数的计数--
canal_instance_client_bytes向instance client发送数据包字节计数byte-
canal_instance_client_empty_batches向instance client发送get接口的空结果计数--
canal_instance_client_request_errorinstance client请求失败计数--
canal_instance_client_request_latencyinstance client请求的响应时间概况--
canal_instance_sink_blocking_timeinstance sink线程put数据至store的阻塞时间msns
canal_instance_store_produce_seqinstance store接收到的events sequence number--
canal_instance_store_consume_seqinstance store成功消费的events sequence number--
canal_instance_storeinstance store基本信息--
canal_instance_store_produce_meminstance store接收到的所有events占用内存总量byte-
canal_instance_store_consume_meminstance store成功消费的所有events占用内存总量byte-
canal_instance_put_rowsstore put操作完成的table rows--
canal_instance_get_rowsclient get请求返回的table rows--
canal_instance_ack_rowsclient ack操作释放的table rows--
canal_instance_traffic_delayserver与MySQL master的延时msms
canal_instance_put_delaystore put操作events的延时msms
canal_instance_get_delayclient get请求返回events的延时msms
canal_instance_ack_delayclient ack操作释放events的延时msms

监控展示指标

指标简述多指标
BasicCanal instance 基本信息。
Network bandwith网络带宽。包含inbound(canal server读取binlog的网络带宽)和outbound(canal server返回给canal client的网络带宽)
DelayCanal server与master延时;store 的put, get, ack操作对应的延时。
Blockingsink线程blocking占比;dump线程blocking占比(仅parallel mode)。
TPS(transaction)Canal instance 处理binlog的TPS,以MySQL transaction为单位计算。
TPS(tableRows)分别对应store的put, get, ack操作针对数据表变更行的TPS
Client requestsCanal client请求server的请求数统计,结果按请求类型分类(比如get/ack/sub/rollback等)。
Response timeCanal client请求server的响应时间统计。
Empty packetsCanal client请求server返回空结果的统计。
Store remain eventsCanal instance ringbuffer中堆积的events数量。
Store remain memCanal instance ringbuffer中堆积的events内存使用量。
Client QPSclient发送请求的QPS,按GET与CLIENTACK分类统计

JVM 相关信息

The Java client includes collectors for garbage collection, memory pools, JMX, classloading, and thread counts. These can be added individually or just use the DefaultExports to conveniently register them.

DefaultExports.initialize();

详见:prometheus/client_java

监控指标详述与应用场景

Blocking

Image text

  1. clamp_max(rate(canal_instance_sink_blocking_time{destination="example"}[2m]), 1000) / 10

sink线程blocking时间片比例(向store中put events时)。若idle占比很高,则store总体上处于满的状态,client的consume速度低于server的produce速度

  1. clamp_max(rate(canal_instance_publish_blocking_time{destination="example"}[2m]), 1000) / 10

dump线程blocking时间片比例(仅parallel mode, dump线程向disruptor发布event时)。若idle占比较高:

1. Sinking blocking ratio也很高,则瓶颈是因为client的consume速度相对较慢。

2. Sinking blocking ratio较低,那么server端parser是性能瓶颈,可参考Performance进行tuning.


Delay(seconds)

Image text

  1. canal_instance_traffic_delay{destination="example"} / 1000

Server与MySQL master之间的延时。

  1. canal_instance_put_delay{destination="example"} / 1000

Store put操作时间点的延时。

  1. canal_instance_get_delay{destination="example"} / 1000

Client get操作时间点的延时。

  1. canal_instance_ack_delay{destination="example"} / 1000

Client ack操作时间点的延时。

Note: delay的准确度依赖于master与canal server间的ntp同步。当binlog execTime超过canal server当前时间戳,则delay为0.


网络带宽(KB/s)

Image text

  1. rate(canal_instance_received_binlog_bytes{destination="example"}[2m]) / 1024

Dump线程读取binlog所占用带宽。当'Sink线程空闲比'与'Dump线程空闲比'都很低,delay却比较高的情况,请查看binlog接收速率是否符合预期。

  1. rate(canal_instance_client_bytes{destination="example"}[2m]) / 1024

向Instance client发送格式化binlog所占用的带宽。MySQL低负载时,client get所返回的空包同样会占用不少的带宽。


TPS(MySQL transaction)

Image text

  1. rate(canal_instance_transactions{destination="example"}[2m])

Canal instance处理transaction的TPS,以TRANSACTION_END事件为基准。


TPS(Table row)

Image text

  1. rate(canal_instance_put_rows{destination="example"}[2m])

对应store put操作的tableRows TPS.

  1. rate(canal_instance_get_rows{destination="example"}[2m])

对应client get操作的tableRows TPS.

  1. rate(canal_instance_ack_rows{destination="example"}[2m])

对应client ack操作的tableRows TPS.


Client requests

Image test

  1. canal_instance_client_packets{destination="example"}

Netty server处理的client requests,以packetType为label分类统计。


Empty packets

Image text

  1. rate(canal_instance_client_empty_batches{destination="example"}[2m])

client get返回每秒空包量。如果正常traffic下,该值很大,考虑使用connector的timeout机制,节省资源。

  1. rate(canal_instance_client_packets{destination="example", packetType="GET"}[2m])

nonempty, 作为empty rate的参照。


Response time

Image text

  1. canal_instance_client_request_latency_bucket{destination="example"}

Histogram, client请求响应时间统计。关于histogram.


Event store占用

Image text

  1. canal_instance_store_produce_seq{destination="example"} - canal_instance_store_consume_seq{destination="example"}

Event store内未ack的events数量,实时性受scrape_interval影响。


Event store memory占用(KB, 仅memory mode)

Image text

  1. (canal_instance_store_produce_mem{destination="example"} - canal_instance_store_consume_mem{destination="example"}) / 1024

Event store内未ack的events所占用内存大小,实时性受scrape_interval影响。


Client QPS

Image text

  1. rate(canal_instance_client_packets{destination="example",packetType="GET"}[2m])

GET类型QPS.

  1. rate(canal_instance_client_packets{destination="example",packetType="CLIENTACK"}[2m])

CLIENTACK类型QPS.


状态信息

Image text

  1. canal_instance{destination="example"}
  2. canal_instance_parser_mode{destination="example"}
  3. canal_instance_store{destination="example"}

通过labels展示状态信息。