cAdvisor Exporter

grafana-agent 内置了 cadvisor, 可以支持采集容器的各项指标。不过 cadvisor 针对宿主机需要设置相关的权限，具体可以参考 cAdvisor docs.

配置并启用cadvisor_exporter

生成grafana-agent-cfg.yaml 配置文件，其中开启cadvisor integration，配置文件具体举例如下：

cat <<EOF > /tmp/grafana-agent-cfg.yaml
server:
  log_level: info
  http_listen_port: 12345
metrics:
  global:
    scrape_interval: 15s
    remote_write:
      - url: 'https://n9e-server:19000/prometheus/v1/write'
        basic_auth:
          username: ${FC_USERNAME}
          password: ${FC_PASSWORD}
integrations:
  cadvisor:
    enabled: true
EOF

在docker中启动 grafana-agent，同时映射相关目录

docker run \
  -v /tmp/agent:/etc/agent/data \
  -v /tmp/grafana-agent-cfg.yaml:/etc/agent/agent.yaml \
  -p 12345:12345 \
  -d \
  --privileged \
  grafana/agent:v0.23.0 \
  --config.file=/etc/agent/agent.yaml \
  --metrics.wal-directory=/etc/agent/data

执行 curl http://localhost:12345/agent/api/v1/targets |jq,输出结果中预期应该包含 integrations/cadvisor 字段，如下：

{
  "status": "success",
  "data": [
    {
      "instance": "7f383657f506f53a739e2df61be58891",
      "target_group": "integrations/cadvisor",
      "endpoint": "http://127.0.0.1:12345/integrations/cadvisor/metrics",
      "state": "up",
      "labels": {
        "agent_hostname": "509c1284c59c",
        "instance": "509c1284c59c:12345",
        "job": "integrations/cadvisor"
      },
      "discovered_labels": {
        "__address__": "127.0.0.1:12345",
        "__metrics_path__": "/integrations/cadvisor/metrics",
        "__scheme__": "http",
        "__scrape_interval__": "15s",
        "__scrape_timeout__": "10s",
        "agent_hostname": "509c1284c59c",
        "job": "integrations/cadvisor"
      },
      "last_scrape": "2022-02-17T14:54:50.652267586Z",
      "scrape_duration_ms": 30,
      "scrape_error": ""
    }
  ]
}

执行 curl http://localhost:12345/integrations/cadvisor/metrics,预期输出结果下：

cadvisor_version_info{cadvisorRevision="",cadvisorVersion="",dockerVersion="",kernelVersion="5.10.76-linuxkit",osVersion="Debian GNU/Linux 10 (buster)"} 1
container_blkio_device_usage_total{device="/dev/vda",id="/",major="254",minor="0",operation="Read"} 4.6509056e+07 1645109878135
container_blkio_device_usage_total{device="/dev/vda",id="/",major="254",minor="0",operation="Write"} 3.13243648e+09 1645109878135
container_cpu_load_average_10s{id="/"} 0 1645109878135
container_cpu_system_seconds_total{id="/"} 57.789 1645109878135
container_cpu_usage_seconds_total{cpu="total",id="/"} 91.57 1645109878135
container_cpu_user_seconds_total{id="/"} 33.781 1645109878135
container_fs_inodes_free{device="/dev",id="/"} 254415 1645109878135
container_fs_inodes_free{device="/dev/shm",id="/"} 254551 1645109878135
container_fs_inodes_free{device="/dev/vda1",id="/"} 3.890602e+06 1645109878135
container_fs_inodes_free{device="/rootfs/dev/shm",id="/"} 254551 1645109878135
...

采集的指标列表

# CPU
# 容器运行经过的cfs周期总数
container_cpu_cfs_periods_total: Number of elapsed enforcement period intervals
# 容器运行时发生节流的cfs周期总数
container_cpu_cfs_throttled_periods_total: Number of throttled period intervals
# 容器发生cpu节流的总时间
container_cpu_cfs_throttled_seconds_total: Total time duration the container has been throttled
container_cpu_load_average_10s: Value of container cpu load average over the last 10 seconds
container_cpu_system_seconds_total: Cumulative system cpu time consumed
container_cpu_usage_seconds_total: Cumulative cpu time consumed
container_cpu_user_seconds_total: Cumulative user cpu time consumed
# 容器描述中的CPU周期配置
container_spec_cpu_period: CPU period of the container
# 容器描述中的CPU quota配置
container_spec_cpu_quota: CPU quota of the container
# 容器描述中的CPU权重配置
container_spec_cpu_shares: CPU share of the container
# MEM
container_memory_cache: Total page cache memory
container_memory_failcnt: Number of memory usage hits limits
container_memory_failures_total: Cumulative count of memory allocation failures
container_memory_mapped_file: Size of memory mapped files
container_memory_max_usage_bytes: Maximum memory usage recorded
container_memory_rss: Size of RSS
container_memory_swap: Container swap usage
container_memory_usage_bytes: Current memory usage, including all memory regardless of when it was accessed
container_oom_events_total: Count of out of memory events observed for the container
container_spec_memory_limit_bytes: Memory limit for the container
container_spec_memory_reservation_limit_bytes: Memory reservation limit for the container
container_spec_memory_swap_limit_bytes: Memory swap limit for the container
# Disk
# 设备IO使用总量
container_blkio_device_usage_total: Blkio device bytes usage
container_fs_inodes_free: Number of available Inodes
container_fs_inodes_total: Total number of Inodes
container_fs_io_current: Number of I/Os currently in progress
# 容器IO总耗时
container_fs_io_time_seconds_total: Cumulative count of seconds spent doing I/Os
container_fs_io_time_weighted_seconds_total: Cumulative weighted I/O time
container_fs_limit_bytes: Number of bytes that can be consumed by the container on this filesystem
container_fs_reads_bytes_total: Cumulative count of bytes read
container_fs_read_seconds_total: Cumulative count of seconds spent reading
container_fs_reads_merged_total: Cumulative count of reads merged
container_fs_reads_total: Cumulative count of reads completed
container_fs_sector_reads_total: Cumulative count of sector reads completed
container_fs_sector_writes_total: Cumulative count of sector writes completed
container_fs_usage_bytes: Number of bytes that are consumed by the container on this filesystem
container_fs_writes_bytes_total: Cumulative count of bytes written
container_fs_write_seconds_total: Cumulative count of seconds spent writing
container_fs_writes_merged_total: Cumulative count of writes merged
container_fs_writes_total: Cumulative count of writes completed
# Network
container_network_receive_bytes_total: Cumulative count of bytes received
container_network_receive_errors_total: Cumulative count of errors encountered while receiving
container_network_receive_packets_dropped_total: Cumulative count of packets dropped while receiving
container_network_receive_packets_total: Cumulative count of packets received
container_network_transmit_bytes_total: Cumulative count of bytes transmitted
container_network_transmit_errors_total: Cumulative count of errors encountered while transmitting
container_network_transmit_packets_dropped_total: Cumulative count of packets dropped while transmitting
container_network_transmit_packets_total: Cumulative count of packets transmitted
# System
container_tasks_state: Number of tasks in given state (sleeping, running, stopped, uninterruptible, or ioawaiting)
# Others
container_last_seen: Last time a container was seen by the exporter
container_start_time_seconds: Start time of the container since unix epoch

完整地配置项说明

  # Enables the cadvisor integration, allowing the Agent to automatically
  # collect metrics for the specified github objects.
  [enabled: <boolean> | default = false]
  # Sets an explicit value for the instance label when the integration is
  # self-scraped. Overrides inferred values.
  [instance: <string> | default = <integrations_config.instance>]
  # Automatically collect metrics from this integration. If disabled,
  # the cadvisor integration will be run but not scraped and thus not
  # remote-written. Metrics for the integration will be exposed at
  # /integrations/cadvisor/metrics and can be scraped by an external
  # process.
  [scrape_integration: <boolean> | default = <integrations_config.scrape_integrations>]
  # How often should the metrics be collected? Defaults to
  # prometheus.global.scrape_interval.
  [scrape_interval: <duration> | default = <global_config.scrape_interval>]
  # The timeout before considering the scrape a failure. Defaults to
  # prometheus.global.scrape_timeout.
  [scrape_timeout: <duration> | default = <global_config.scrape_timeout>]
  # Allows for relabeling labels on the target.
  relabel_configs:
    [- <relabel_config> ... ]
  # Relabel metrics coming from the integration, allowing to drop series
  # from the integration that you don't care about.
  metric_relabel_configs:
    [ - <relabel_config> ... ]
  # How frequent to truncate the WAL for this integration.
  [wal_truncate_frequency: <duration> | default = "60m"]
  #
  # cAdvisor-specific configuration options
  #
  # Convert container labels and environment variables into labels on prometheus metrics for each container. If false, then only metrics exported are container name, first alias, and image name.
  [store_container_labels: <boolean> | default = true]
  # List of container labels to be converted to labels on prometheus metrics for each container. store_container_labels must be set to false for this to take effect.
  allowlisted_container_labels:
    [ - <string> ]
  # List of environment variable keys matched with specified prefix that needs to be collected for containers, only support containerd and docker runtime for now.
  env_metadata_allowlist:
    [ - <string> ]
  # List of cgroup path prefix that needs to be collected even when docker_only is specified.
  raw_cgroup_prefix_allowlist:
    [ - <string> ]
  # Path to a JSON file containing configuration of perf events to measure. Empty value disabled perf events measuring.
  [perf_events_config: <boolean>]
  # resctrl mon groups updating interval. Zero value disables updating mon groups.
  [resctrl_interval: <int> | default = 0]
  # List of `metrics` to be disabled. If set, overrides the default disabled metrics.
  disabled_metrics:
    [ - <string> ]
  # List of `metrics` to be enabled. If set, overrides disabled_metrics
  enabled_metrics:
    [ - <string> ]
  # Length of time to keep data stored in memory
  [storage_duration: <duration> | default = "2m"]
  # Containerd endpoint
  [containerd: <string> | default = "/run/containerd/containerd.sock"]
  # Containerd namespace
  [containerd_namespace: <string> | default = "k8s.io"]
  # Docker endpoint
  [docker: <string> | default = "unix:///var/run/docker.sock"]
  # Use TLS to connect to docker
  [docker_tls: <boolean> | default = false]
  # Path to client certificate for TLS connection to docker
  [docker_tls_cert: <string> | default = "cert.pem"]
  # Path to private key for TLS connection to docker
  [docker_tls_key: <string> | default = "key.pem"]
  # Path to a trusted CA for TLS connection to docker
  [docker_tls_ca: <string> | default = "ca.pem"]
  # Only report docker containers in addition to root stats
  [docker_only: <boolean> | default = false]