3.4 监控汇总表

由于 TiDB 集群的监控指标数量较大,需要提供便捷的方式从众多监控中找出异常的监控项,TiDB 4.0 提供了监控汇总表,监控汇总表 information_schema.metrics_summaryinformation_schema.metrics_summary_by_label 用于汇总所有监控数据,来提升用户对各个监控指标进行排查的效率。两者不同在于 information_schema.metrics_summary_by_label 会对不同的 label 使用区分统计。

1. 查询示例

以查询 ['2020-03-08 13:23:00', '2020-03-08 13:33:00') 时间范围内 TiDB 集群中平均耗时最高的 3 组监控项为例。通过直接查询 information_schema.metrics_summary 表,并通过 /*+ time_range() */ 这个 hint 来指定时间范围来构造以下 SQL:

  1. mysql> select /*+ time_range('2020-03-08 13:23:00','2020-03-08 13:33:00') */ *
  2. from information_schema.`METRICS_SUMMARY`
  3. where metrics_name like 'tidb%duration'
  4. and avg_value > 0
  5. and quantile = 0.99
  6. order by avg_value desc
  7. limit 3\G
  8. ***************************[ 1. row ]***************************
  9. METRICS_NAME | tidb_get_token_duration
  10. QUANTILE | 0.99
  11. SUM_VALUE | 8.972509
  12. AVG_VALUE | 0.996945
  13. MIN_VALUE | 0.996515
  14. MAX_VALUE | 0.997458
  15. COMMENT | The quantile of Duration (us) for getting token, it should be small until concurrency limit is reached(second)
  16. ***************************[ 2. row ]***************************
  17. METRICS_NAME | tidb_query_duration
  18. QUANTILE | 0.99
  19. SUM_VALUE | 0.269079
  20. AVG_VALUE | 0.007272
  21. MIN_VALUE | 0.000667
  22. MAX_VALUE | 0.01554
  23. COMMENT | The quantile of TiDB query durations(second)
  24. ***************************[ 3. row ]***************************
  25. METRICS_NAME | tidb_kv_request_duration
  26. QUANTILE | 0.99
  27. SUM_VALUE | 0.170232
  28. AVG_VALUE | 0.004601
  29. MIN_VALUE | 0.000975
  30. MAX_VALUE | 0.013
  31. COMMENT | The quantile of kv requests durations by store

注意:其中 tidb_get_token_durationCOMMENT 列中注释了值的单位是微秒(us)

类似的,查询 metrics_summary_by_label 监控汇总表结果如下:

  1. mysql> select /*+ time_range('2020-03-08 13:23:00','2020-03-08 13:33:00') */ *
  2. from information_schema.`METRICS_SUMMARY_BY_LABEL`
  3. where metrics_name like 'tidb%duration'
  4. and avg_value > 0
  5. and quantile = 0.99
  6. order by avg_value desc
  7. limit 10\G
  8. ***************************[ 1. row ]***************************
  9. INSTANCE | 172.16.5.40:10089
  10. METRICS_NAME | tidb_get_token_duration
  11. LABEL |
  12. QUANTILE | 0.99
  13. SUM_VALUE | 8.972509
  14. AVG_VALUE | 0.996945
  15. MIN_VALUE | 0.996515
  16. MAX_VALUE | 0.997458
  17. COMMENT | The quantile of Duration (us) for getting token, it should be small until concurrency limit is reached(second)
  18. ***************************[ 2. row ]***************************
  19. INSTANCE | 172.16.5.40:10089
  20. METRICS_NAME | tidb_query_duration
  21. LABEL | Select
  22. QUANTILE | 0.99
  23. SUM_VALUE | 0.072083
  24. AVG_VALUE | 0.008009
  25. MIN_VALUE | 0.007905
  26. MAX_VALUE | 0.008241
  27. COMMENT | The quantile of TiDB query durations(second)
  28. ***************************[ 3. row ]***************************
  29. INSTANCE | 172.16.5.40:10089
  30. METRICS_NAME | tidb_query_duration
  31. LABEL | Rollback
  32. QUANTILE | 0.99
  33. SUM_VALUE | 0.072083
  34. AVG_VALUE | 0.008009
  35. MIN_VALUE | 0.007905
  36. MAX_VALUE | 0.008241
  37. COMMENT | The quantile of TiDB query durations(second)

前文提到 metrics_summary_by_label 表结构相对于 metrics_summary 多了一列 LABEL, 以上面查询结果的第 2, 3 行为例:分别表示 tidb_query_durationSelect/Rollback 类型的语句平均耗时非常高。

2. 推荐用法

除以上示例之外,监控汇总表可以通过两个时间段的全链路监控对比,迅速找出监控数据变化最大的模块,快速定位瓶颈,以下对比两个时间段的所有监控(其中 t1 为 baseline),并按照差别最大的监控排序:

  • 时间段 t1 : (“2020-03-03 17:08:00”, “2020-03-03 17:11:00”)
  • 时间段 t2 : (“2020-03-03 17:18:00”, “2020-03-03 17:21:00”)

对两个时间段的监控按照 METRICS_NAME 进行 join,并按照差值排序。其中 /*+ time_range() */ 是用于指定查询时间的 hint。

  1. 查询 t1.avg_value / t2.avg_value 差异最大的 10 个监控项
  1. mysql> SELECT
  2. t1.avg_value / t2.avg_value AS ratio,
  3. t1.metrics_name,
  4. t1.avg_value,
  5. t2.avg_value,
  6. t2.comment
  7. FROM
  8. (
  9. SELECT /*+ time_range("2020-03-03 17:08:00", "2020-03-03 17:11:00")*/
  10. *
  11. FROM information_schema.metrics_summary
  12. ) t1
  13. JOIN
  14. (
  15. SELECT /*+ time_range("2020-03-03 17:18:00", "2020-03-03 17:21:00")*/
  16. *
  17. FROM information_schema.metrics_summary
  18. ) t2
  19. ON t1.metrics_name = t2.metrics_name
  20. ORDER BY
  21. ratio DESC limit 10;
  22. +----------------+-----------------------------------+-------------------+-------------------+--------------------------------------------------------------------------+
  23. | ratio | metrics_name | avg_value | avg_value | comment |
  24. +----------------+-----------------------------------+-------------------+-------------------+--------------------------------------------------------------------------+
  25. | 17.6439787379 | tikv_region_average_written_bytes | 30816827.0953 | 1746591.71568 | The average rate of writing bytes to Regions per TiKV instance |
  26. | 8.88407551364 | tikv_region_average_written_keys | 108557.034612 | 12219.283193 | The average rate of written keys to Regions per TiKV instance |
  27. | 6.4105335594 | tidb_kv_write_num | 4493.293654 | 700.923505 | The quantile of kv write times per transaction execution |
  28. | 2.99993333333 | tidb_gc_total_count | 1.0 | 0.333341 | The total count of kv storage garbage collection time durations |
  29. | 2.66412165823 | tikv_engine_avg_seek_duration | 6569.879007 | 2466.05818 | The time consumed when executing seek operation, the unit is microsecond |
  30. | 2.66412165823 | tikv_engine_max_seek_duration | 6569.879007 | 2466.05818 | The time consumed when executing seek operation, the unit is microsecond |
  31. | 2.49994444321 | tikv_region_change | -0.277778 | -0.111114 | The count of region change per TiKV instance |
  32. | 2.16063829787 | etcd_wal_fsync_duration | 0.002539 | 0.001175 | The quantile time consumed of writing WAL into the persistent storage |
  33. | 2.06089264604 | node_memory_free | 4541448192.0 | 2203631616.0 | |
  34. | 1.96749064186 | tidb_kv_write_size | 514489.28 | 261495.159902 | The quantile of kv write size per transaction execution |
  35. +----------------+-----------------------------------+-------------------+-------------------+--------------------------------------------------------------------------+

查询结果表示:

  • t1 时间段内的 tikv_region_average_written_bytes (region 的平均写入字节数) 比 t2 时间段高了 17.6 倍
  • t1 时间段内的 tikv_region_average_written_keys (region 的平均写入 keys 数) 比 t2 时间段高了 8.8 倍
  • t1 时间段内的 tidb_kv_write_size (tidb 每个事务写入的 kv 大小) 比 t2 时间段高了 1.96 倍

通过以上结果可以轻易看出 t1 时间段的写入要比 t2 时间段高。

  1. 反过来,查询 t2.avg_value / t1.avg_value 差异最大的 10 个监控项
  1. mysql> SELECT
  2. t2.avg_value / t1.avg_value AS ratio,
  3. t1.metrics_name,
  4. t1.avg_value,
  5. t2.avg_value,
  6. t2.comment
  7. FROM
  8. (
  9. SELECT /*+ time_range("2020-03-03 17:08:00", "2020-03-03 17:11:00")*/
  10. *
  11. FROM information_schema.metrics_summary
  12. ) t1
  13. JOIN
  14. (
  15. SELECT /*+ time_range("2020-03-03 17:18:00", "2020-03-03 17:21:00")*/
  16. *
  17. FROM information_schema.metrics_summary
  18. ) t2
  19. ON t1.metrics_name = t2.metrics_name
  20. ORDER BY
  21. ratio DESC limit 10;
  22. +----------------+-----------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+
  23. | ratio | metrics_name | avg_value | avg_value | comment |
  24. +----------------+-----------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+
  25. | 5865.59537065 | tidb_slow_query_cop_process_total_time | 0.016333 | 95.804724 | The total time of TiDB slow query statistics with slow query total cop process time(second) |
  26. | 3648.74109023 | tidb_distsql_partial_scan_key_total_num | 10865.666667 | 39646004.4394 | The total num of distsql partial scan key numbers |
  27. | 267.002351165 | tidb_slow_query_cop_wait_total_time | 0.003333 | 0.890008 | The total time of TiDB slow query statistics with slow query total cop wait time(second) |
  28. | 192.43267836 | tikv_cop_total_response_total_size | 2515333.66667 | 484032394.445 | |
  29. | 192.43267836 | tikv_cop_total_response_size | 41922.227778 | 8067206.57408 | |
  30. | 152.780296296 | tidb_distsql_scan_key_total_num | 5304.333333 | 810397.618317 | The total num of distsql scan numbers |
  31. | 126.042290167 | tidb_distsql_execution_total_time | 0.421622 | 53.142143 | The total time of distsql execution(second) |
  32. | 105.164020657 | tikv_cop_scan_details | 134.450733 | 14139.379665 | |
  33. | 105.164020657 | tikv_cop_scan_details_total | 8067.043981 | 848362.77991 | |
  34. | 101.635495394 | tikv_cop_total_kv_cursor_operations | 1070.875 | 108838.91113 | |
  35. +----------------+-----------------------------------------+----------------+------------------+---------------------------------------------------------------------------------------------+

上面查询结果表示:

  • t2 时间段内的 tidb_slow_query_cop_process_total_time (tidb 慢查询中的 cop process 耗时 ) 比 t1 时间段高了 5865 倍
  • t2 时间段内的 tidb_distsql_partial_scan_key_total_num(tidb 的 distsql 请求扫描key 的数量) 比 t1 时间段高了 3648 倍
  • t2 时间段内的 tikv_cop_total_response_size(tikv 的 cop 请求结果的大小 ) 比 t1 时间段高了 192 倍
  • t2 时间段内的 tikv_cop_scan_details(tikv 的 cop 请求的 scan ) 比 t1 时间段高了 105 倍

通过上面两个时间段对比查询可以大致了解集群在这 2 个时间段的负载情况。t2 时间段的 Cop 请求要比 t2 时间段高很多,导致 TiKV 的 Copprocessor 过载,出现了 cop task 等待,可以猜测可能是 t2 时间段出现了一些大查询,或者是查询较多的负载。

实际上,在 t1 ~ t2 整个时间段内都在跑 go-ycsb 的压测,然后在 t2 时间段跑了 20 个 tpch 的查询,所以是因为 tpch 大查询导致了很多的 cop 请求。