cAdvisor Exporter

grafana-agent 内置了 cadvisor, 可以支持采集容器的各项指标。不过 cadvisor 针对宿主机需要设置相关的权限,具体可以参考 cAdvisor docs.

配置并启用cadvisor_exporter

生成grafana-agent-cfg.yaml 配置文件,其中开启cadvisor integration,配置文件具体举例如下:

  1. cat <<EOF > /tmp/grafana-agent-cfg.yaml
  2. server:
  3. log_level: info
  4. http_listen_port: 12345
  5. metrics:
  6. global:
  7. scrape_interval: 15s
  8. remote_write:
  9. - url: 'https://n9e-server:19000/prometheus/v1/write'
  10. basic_auth:
  11. username: ${FC_USERNAME}
  12. password: ${FC_PASSWORD}
  13. integrations:
  14. cadvisor:
  15. enabled: true
  16. EOF

在docker中启动 grafana-agent,同时映射相关目录

  1. docker run \
  2. -v /tmp/agent:/etc/agent/data \
  3. -v /tmp/grafana-agent-cfg.yaml:/etc/agent/agent.yaml \
  4. -p 12345:12345 \
  5. -d \
  6. --privileged \
  7. grafana/agent:v0.23.0 \
  8. --config.file=/etc/agent/agent.yaml \
  9. --metrics.wal-directory=/etc/agent/data

执行 curl http://localhost:12345/agent/api/v1/targets |jq,输出结果中预期应该包含 integrations/cadvisor 字段,如下:

  1. {
  2. "status": "success",
  3. "data": [
  4. {
  5. "instance": "7f383657f506f53a739e2df61be58891",
  6. "target_group": "integrations/cadvisor",
  7. "endpoint": "http://127.0.0.1:12345/integrations/cadvisor/metrics",
  8. "state": "up",
  9. "labels": {
  10. "agent_hostname": "509c1284c59c",
  11. "instance": "509c1284c59c:12345",
  12. "job": "integrations/cadvisor"
  13. },
  14. "discovered_labels": {
  15. "__address__": "127.0.0.1:12345",
  16. "__metrics_path__": "/integrations/cadvisor/metrics",
  17. "__scheme__": "http",
  18. "__scrape_interval__": "15s",
  19. "__scrape_timeout__": "10s",
  20. "agent_hostname": "509c1284c59c",
  21. "job": "integrations/cadvisor"
  22. },
  23. "last_scrape": "2022-02-17T14:54:50.652267586Z",
  24. "scrape_duration_ms": 30,
  25. "scrape_error": ""
  26. }
  27. ]
  28. }

执行 curl http://localhost:12345/integrations/cadvisor/metrics,预期输出结果下:

  1. cadvisor_version_info{cadvisorRevision="",cadvisorVersion="",dockerVersion="",kernelVersion="5.10.76-linuxkit",osVersion="Debian GNU/Linux 10 (buster)"} 1
  2. container_blkio_device_usage_total{device="/dev/vda",id="/",major="254",minor="0",operation="Read"} 4.6509056e+07 1645109878135
  3. container_blkio_device_usage_total{device="/dev/vda",id="/",major="254",minor="0",operation="Write"} 3.13243648e+09 1645109878135
  4. container_cpu_load_average_10s{id="/"} 0 1645109878135
  5. container_cpu_system_seconds_total{id="/"} 57.789 1645109878135
  6. container_cpu_usage_seconds_total{cpu="total",id="/"} 91.57 1645109878135
  7. container_cpu_user_seconds_total{id="/"} 33.781 1645109878135
  8. container_fs_inodes_free{device="/dev",id="/"} 254415 1645109878135
  9. container_fs_inodes_free{device="/dev/shm",id="/"} 254551 1645109878135
  10. container_fs_inodes_free{device="/dev/vda1",id="/"} 3.890602e+06 1645109878135
  11. container_fs_inodes_free{device="/rootfs/dev/shm",id="/"} 254551 1645109878135
  12. ...

采集的指标列表

  1. # CPU
  2. # 容器运行经过的cfs周期总数
  3. container_cpu_cfs_periods_total: Number of elapsed enforcement period intervals
  4. # 容器运行时发生节流的cfs周期总数
  5. container_cpu_cfs_throttled_periods_total: Number of throttled period intervals
  6. # 容器发生cpu节流的总时间
  7. container_cpu_cfs_throttled_seconds_total: Total time duration the container has been throttled
  8. container_cpu_load_average_10s: Value of container cpu load average over the last 10 seconds
  9. container_cpu_system_seconds_total: Cumulative system cpu time consumed
  10. container_cpu_usage_seconds_total: Cumulative cpu time consumed
  11. container_cpu_user_seconds_total: Cumulative user cpu time consumed
  12. # 容器描述中的CPU周期配置
  13. container_spec_cpu_period: CPU period of the container
  14. # 容器描述中的CPU quota配置
  15. container_spec_cpu_quota: CPU quota of the container
  16. # 容器描述中的CPU权重配置
  17. container_spec_cpu_shares: CPU share of the container
  18. # MEM
  19. container_memory_cache: Total page cache memory
  20. container_memory_failcnt: Number of memory usage hits limits
  21. container_memory_failures_total: Cumulative count of memory allocation failures
  22. container_memory_mapped_file: Size of memory mapped files
  23. container_memory_max_usage_bytes: Maximum memory usage recorded
  24. container_memory_rss: Size of RSS
  25. container_memory_swap: Container swap usage
  26. container_memory_usage_bytes: Current memory usage, including all memory regardless of when it was accessed
  27. container_oom_events_total: Count of out of memory events observed for the container
  28. container_spec_memory_limit_bytes: Memory limit for the container
  29. container_spec_memory_reservation_limit_bytes: Memory reservation limit for the container
  30. container_spec_memory_swap_limit_bytes: Memory swap limit for the container
  31. # Disk
  32. # 设备IO使用总量
  33. container_blkio_device_usage_total: Blkio device bytes usage
  34. container_fs_inodes_free: Number of available Inodes
  35. container_fs_inodes_total: Total number of Inodes
  36. container_fs_io_current: Number of I/Os currently in progress
  37. # 容器IO总耗时
  38. container_fs_io_time_seconds_total: Cumulative count of seconds spent doing I/Os
  39. container_fs_io_time_weighted_seconds_total: Cumulative weighted I/O time
  40. container_fs_limit_bytes: Number of bytes that can be consumed by the container on this filesystem
  41. container_fs_reads_bytes_total: Cumulative count of bytes read
  42. container_fs_read_seconds_total: Cumulative count of seconds spent reading
  43. container_fs_reads_merged_total: Cumulative count of reads merged
  44. container_fs_reads_total: Cumulative count of reads completed
  45. container_fs_sector_reads_total: Cumulative count of sector reads completed
  46. container_fs_sector_writes_total: Cumulative count of sector writes completed
  47. container_fs_usage_bytes: Number of bytes that are consumed by the container on this filesystem
  48. container_fs_writes_bytes_total: Cumulative count of bytes written
  49. container_fs_write_seconds_total: Cumulative count of seconds spent writing
  50. container_fs_writes_merged_total: Cumulative count of writes merged
  51. container_fs_writes_total: Cumulative count of writes completed
  52. # Network
  53. container_network_receive_bytes_total: Cumulative count of bytes received
  54. container_network_receive_errors_total: Cumulative count of errors encountered while receiving
  55. container_network_receive_packets_dropped_total: Cumulative count of packets dropped while receiving
  56. container_network_receive_packets_total: Cumulative count of packets received
  57. container_network_transmit_bytes_total: Cumulative count of bytes transmitted
  58. container_network_transmit_errors_total: Cumulative count of errors encountered while transmitting
  59. container_network_transmit_packets_dropped_total: Cumulative count of packets dropped while transmitting
  60. container_network_transmit_packets_total: Cumulative count of packets transmitted
  61. # System
  62. container_tasks_state: Number of tasks in given state (sleeping, running, stopped, uninterruptible, or ioawaiting)
  63. # Others
  64. container_last_seen: Last time a container was seen by the exporter
  65. container_start_time_seconds: Start time of the container since unix epoch

完整地配置项说明

  1. # Enables the cadvisor integration, allowing the Agent to automatically
  2. # collect metrics for the specified github objects.
  3. [enabled: <boolean> | default = false]
  4. # Sets an explicit value for the instance label when the integration is
  5. # self-scraped. Overrides inferred values.
  6. [instance: <string> | default = <integrations_config.instance>]
  7. # Automatically collect metrics from this integration. If disabled,
  8. # the cadvisor integration will be run but not scraped and thus not
  9. # remote-written. Metrics for the integration will be exposed at
  10. # /integrations/cadvisor/metrics and can be scraped by an external
  11. # process.
  12. [scrape_integration: <boolean> | default = <integrations_config.scrape_integrations>]
  13. # How often should the metrics be collected? Defaults to
  14. # prometheus.global.scrape_interval.
  15. [scrape_interval: <duration> | default = <global_config.scrape_interval>]
  16. # The timeout before considering the scrape a failure. Defaults to
  17. # prometheus.global.scrape_timeout.
  18. [scrape_timeout: <duration> | default = <global_config.scrape_timeout>]
  19. # Allows for relabeling labels on the target.
  20. relabel_configs:
  21. [- <relabel_config> ... ]
  22. # Relabel metrics coming from the integration, allowing to drop series
  23. # from the integration that you don't care about.
  24. metric_relabel_configs:
  25. [ - <relabel_config> ... ]
  26. # How frequent to truncate the WAL for this integration.
  27. [wal_truncate_frequency: <duration> | default = "60m"]
  28. #
  29. # cAdvisor-specific configuration options
  30. #
  31. # Convert container labels and environment variables into labels on prometheus metrics for each container. If false, then only metrics exported are container name, first alias, and image name.
  32. [store_container_labels: <boolean> | default = true]
  33. # List of container labels to be converted to labels on prometheus metrics for each container. store_container_labels must be set to false for this to take effect.
  34. allowlisted_container_labels:
  35. [ - <string> ]
  36. # List of environment variable keys matched with specified prefix that needs to be collected for containers, only support containerd and docker runtime for now.
  37. env_metadata_allowlist:
  38. [ - <string> ]
  39. # List of cgroup path prefix that needs to be collected even when docker_only is specified.
  40. raw_cgroup_prefix_allowlist:
  41. [ - <string> ]
  42. # Path to a JSON file containing configuration of perf events to measure. Empty value disabled perf events measuring.
  43. [perf_events_config: <boolean>]
  44. # resctrl mon groups updating interval. Zero value disables updating mon groups.
  45. [resctrl_interval: <int> | default = 0]
  46. # List of `metrics` to be disabled. If set, overrides the default disabled metrics.
  47. disabled_metrics:
  48. [ - <string> ]
  49. # List of `metrics` to be enabled. If set, overrides disabled_metrics
  50. enabled_metrics:
  51. [ - <string> ]
  52. # Length of time to keep data stored in memory
  53. [storage_duration: <duration> | default = "2m"]
  54. # Containerd endpoint
  55. [containerd: <string> | default = "/run/containerd/containerd.sock"]
  56. # Containerd namespace
  57. [containerd_namespace: <string> | default = "k8s.io"]
  58. # Docker endpoint
  59. [docker: <string> | default = "unix:///var/run/docker.sock"]
  60. # Use TLS to connect to docker
  61. [docker_tls: <boolean> | default = false]
  62. # Path to client certificate for TLS connection to docker
  63. [docker_tls_cert: <string> | default = "cert.pem"]
  64. # Path to private key for TLS connection to docker
  65. [docker_tls_key: <string> | default = "key.pem"]
  66. # Path to a trusted CA for TLS connection to docker
  67. [docker_tls_ca: <string> | default = "ca.pem"]
  68. # Only report docker containers in addition to root stats
  69. [docker_only: <boolean> | default = false]