1. What is metrics?

Along with IoTDB running, some metrics reflecting current system’s status will be collected continuously, which will provide some useful information helping us resolving system problems and detecting potential system risks.

2. When to use metrics?

Belows are some typical application scenarios

  1. System is running slowly

    When system is running slowly, we always hope to have information about system’s running status as detail as possible, such as

    • JVM:Is there FGC?How long does it cost?How much does the memory usage decreased after GC?Are there lots of threads?
    • System:Is the CPU usage too hi?Are there many disk IOs?
    • Connections:How many connections are there in the current time?
    • Interface:What is the TPS and latency of every interface?
    • ThreadPool:Are there many pending tasks?
    • Cache Hit Ratio
  2. No space left on device

    When meet a “no space left on device” error, we really want to know which kind of data file had a rapid rise in the past hours.

  3. Is the system running in abnormal status

    We could use the count of error logs、the alive status of nodes in cluster, etc, to determine whether the system is running abnormally.

3. Who will use metrics?

Any person cares about the system’s status, including but not limited to RD, QA, SRE, DBA, can use the metrics to work more efficiently.

4. What metrics does IoTDB have?

For now, we have provided some metrics for several core modules of IoTDB, and more metrics will be added or updated along with the development of new features and optimization or refactoring of architecture.

4.1. Key Concept

Before step into next, we’d better stop to have a look into some key concepts about metrics.

Every metric data has two properties

  • Metric Name

    The name of this metric,for example, logback_events_total indicates the total count of log events。

  • Tag

    Each metric could have 0 or several sub classes (Tag), for the same example, the logback_events_total metric has a sub class named level, which means the total count of log events at the specific level

4.2. Data Format

IoTDB provides metrics data both in JMX and Prometheus format. For JMX, you can get these metrics via org.apache.iotdb.metrics.

Next, we will choose Prometheus format data as samples to describe each kind of metric.

4.3. IoTDB Metrics

4.3.1. API

MetricTaglevelDescriptionSample
entry_seconds_countname=”interface name”importantThe total request count of the interfaceentry_seconds_count{name=”openSession”,} 1.0
entry_seconds_sumname=”interface name”importantThe total cost seconds of the interfaceentry_seconds_sum{name=”openSession”,} 0.024
entry_seconds_maxname=”interface name”importantThe max latency of the interfaceentry_seconds_max{name=”openSession”,} 0.024
quantity_totalname=”pointsIn”importantThe total points inserted into IoTDBquantity_total{name=”pointsIn”,} 1.0

4.3.2. Task

MetricTaglevelDescriptionSample
queuename=”compaction_inner/compaction_cross/flush”,
status=”running/waiting”
importantThe count of current tasks in running and waiting statusqueue{name=”flush”,status=”waiting”,} 0.0
queue{name=”flush”,status=”running”,} 0.0
cost_task_seconds_countname=”compaction/flush”importantThe total count of tasks occurs till nowcost_task_seconds_count{name=”flush”,} 1.0
cost_task_seconds_maxname=”compaction/flush”importantThe seconds of the longest task takes till nowcost_task_seconds_max{name=”flush”,} 0.363
cost_task_seconds_sumname=”compaction/flush”importantThe total cost seconds of all tasks till nowcost_task_seconds_sum{name=”flush”,} 0.363
data_writtenname=”compaction”,
type=”aligned/not-aligned/total”
importantThe size of data written in compactiondata_written{name=”compaction”,type=”total”,} 10240
data_readname=”compaction”importantThe size of data read in compactiondata_read={name=”compaction”,} 10240

4.3.3. Memory Usage

MetricTaglevelDescriptionSample
memname=”chunkMetaData/storageGroup/mtree”importantCurrent memory size of chunkMetaData/storageGroup/mtree data in bytesmem{name=”chunkMetaData”,} 2050.0

4.3.4. Cache Hit Ratio

MetricTaglevelDescriptionSample
cache_hitname=”chunk/timeSeriesMeta/bloomFilter”importantCache hit ratio of chunk/timeSeriesMeta and prevention ratio of bloom filtercache_hit{name=”chunk”,} 80

4.3.5. Business Data

MetricTaglevelDescriptionSample
quantityname=”timeSeries/storageGroup/device”importantThe current count of timeSeries/storageGroup/devices in IoTDBquantity{name=”timeSeries”,} 1.0

4.3.6. Cluster

MetricTaglevelDescriptionSample
cluster_node_leader_countname=””importantThe count of dataGroupLeader on each node, which reflects the distribution of leaderscluster_node_leader_count{name=”127.0.0.1”,} 2.0
cluster_uncommitted_logname=””importantThe count of uncommitted_log on each node in data groups it belongs tocluster_uncommitted_log{name=”127.0.0.1_Data-127.0.0.1-40010-raftId-0”,} 0.0
cluster_node_statusname=””importantThe current node status, 1=online 2=offlinecluster_node_status{name=”127.0.0.1”,} 1.0
cluster_elect_totalname=””,status=”fail/win”importantThe count and result (won or failed) of elections the node participated in.cluster_elect_total{name=”127.0.0.1”,status=”win”,} 1.0

4.4. IoTDB PreDefined Metrics Set

Users can modify the value of predefinedMetrics in the iotdb-metric.yml file to enable the predefined set of metrics,now support JVM, LOGBACK, FILE, PROCESS, SYSYTEM.

4.4.1. JVM

4.4.1.1. Threads
MetricTaglevelDescriptionSample
jvm_threads_live_threadsNoneImportantThe current count of threadsjvm_threads_live_threads 25.0
jvm_threads_daemon_threadsNoneImportantThe current count of daemon threadsjvm_threads_daemon_threads 12.0
jvm_threads_peak_threadsNoneImportantThe max count of threads till nowjvm_threads_peak_threads 28.0
jvm_threads_states_threadsstate=”runnable/blocked/waiting/timed-waiting/new/terminated”ImportantThe count of threads in each statusjvm_threads_states_threads{state=”runnable”,} 10.0
4.4.1.2. GC
MetricTaglevelDescriptionSample
jvm_gc_pause_seconds_countaction=”end of major GC/end of minor GC”,cause=”xxxx”ImportantThe total count of YGC/FGC events and its causejvm_gc_pause_seconds_count{action=”end of major GC”,cause=”Metadata GC Threshold”,} 1.0
jvm_gc_pause_seconds_sumaction=”end of major GC/end of minor GC”,cause=”xxxx”ImportantThe total cost seconds of YGC/FGC and its causejvm_gc_pause_seconds_sum{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.03
jvm_gc_pause_seconds_maxaction=”end of major GC”,cause=”Metadata GC Threshold”ImportantThe max cost seconds of YGC/FGC till now and its causejvm_gc_pause_seconds_max{action=”end of major GC”,cause=”Metadata GC Threshold”,} 0.0
jvm_gc_memory_promoted_bytes_totalNoneImportantCount of positive increases in the size of the old generation memory pool before GC to after GCjvm_gc_memory_promoted_bytes_total 8425512.0
jvm_gc_max_data_size_bytesNoneImportantMax size of long-lived heap memory pooljvm_gc_max_data_size_bytes 2.863661056E9
jvm_gc_live_data_size_bytesNoneImportantSize of long-lived heap memory pool after reclamationjvm_gc_live_data_size_bytes 8450088.0
jvm_gc_memory_allocated_bytes_totalNoneImportantIncremented for an increase in the size of the (young) heap memory pool after one GC to before the nextjvm_gc_memory_allocated_bytes_total 4.2979144E7
4.4.1.3. Memory
MetricTaglevelDescriptionSample
jvm_buffer_memory_used_bytesid=”direct/mapped”ImportantAn estimate of the memory that the Java virtual machine is using for this buffer pooljvm_buffer_memory_used_bytes{id=”direct”,} 3.46728099E8
jvm_buffer_total_capacity_bytesid=”direct/mapped”ImportantAn estimate of the total capacity of the buffers in this pooljvm_buffer_total_capacity_bytes{id=”mapped”,} 0.0
jvm_buffer_count_buffersid=”direct/mapped”ImportantAn estimate of the number of buffers in the pooljvm_buffer_count_buffers{id=”direct”,} 183.0
jvm_memory_committed_bytes{area=”heap/nonheap”,id=”xxx”,}ImportantThe amount of memory in bytes that is committed for the Java virtual machine to usejvm_memory_committed_bytes{area=”heap”,id=”Par Survivor Space”,} 2.44252672E8
jvm_memory_committed_bytes{area=”nonheap”,id=”Metaspace”,} 3.9051264E7
jvm_memory_max_bytes{area=”heap/nonheap”,id=”xxx”,}ImportantThe maximum amount of memory in bytes that can be used for memory managementjvm_memory_max_bytes{area=”heap”,id=”Par Survivor Space”,} 2.44252672E8
jvm_memory_max_bytes{area=”nonheap”,id=”Compressed Class Space”,} 1.073741824E9
jvm_memory_used_bytes{area=”heap/nonheap”,id=”xxx”,}ImportantThe amount of used memoryjvm_memory_used_bytes{area=”heap”,id=”Par Eden Space”,} 1.000128376E9
jvm_memory_used_bytes{area=”nonheap”,id=”Code Cache”,} 2.9783808E7
4.4.1.4. Classes
MetricTaglevelDescriptionSample
jvm_classes_unloaded_classes_totalNoneImportantThe total number of classes unloaded since the Java virtual machine has started executionjvm_classes_unloaded_classes_total 680.0
jvm_classes_loaded_classesNoneImportantThe number of classes that are currently loaded in the Java virtual machinejvm_classes_loaded_classes 5975.0
jvm_compilation_time_ms_total{compiler=”HotSpot 64-Bit Tiered Compilers”,}ImportantThe approximate accumulated elapsed time spent in compilationjvm_compilation_time_ms_total{compiler=”HotSpot 64-Bit Tiered Compilers”,} 107092.0

4.4.2. File

MetricTaglevelDescriptionSample
file_sizename=”wal/seq/unseq”importantThe current file size of wal/seq/unseq in bytesfile_size{name=”wal”,} 67.0
file_countname=”wal/seq/unseq”importantThe current count of wal/seq/unseq filesfile_count{name=”seq”,} 1.0

4.4.3. Logback

MetricTaglevelDescription示例
logback_events_total{level=”trace/debug/info/warn/error”,}ImportantThe count of trace/debug/info/warn/error log events till nowlogback_events_total{level=”warn”,} 0.0

4.4.4. Process

MetricTaglevelDescription示例
process_cpu_loadname=”cpu”corecurrent process CPU Usage (%)process_cpu_load{name=”process”,} 5.0
process_cpu_timename=”cpu”coretotal Process CPU Time Occupied (ns)process_cpu_time{name=”process”,} 3.265625E9
process_max_memname=”memory”coreThe maximum available memory for the JVMprocess_max_mem{name=”process”,} 3.545759744E9
process_used_memname=”memory”coreThe current available memory for the JVMprocess_used_mem{name=”process”,} 4.6065456E7
process_total_memname=”memory”coreThe current requested memory for the JVMprocess_total_mem{name=”process”,} 2.39599616E8
process_free_memname=”memory”coreThe free available memory for the JVMprocess_free_mem{name=”process”,} 1.94035584E8
process_mem_rationame=”memory”coreMemory footprint ratio of processprocess_mem_ratio{name=”process”,} 0.0
process_threads_countname=”process”coreThe current number of threadsprocess_threads_count{name=”process”,} 11.0
process_statusname=”process”coreThe process survivor status, 1.0 means survivorship, and 0.0 means terminatedprocess_status{name=”process”,} 1.0

4.4.5. System

MetricTaglevelDescription示例
sys_cpu_loadname=”cpu”corecurrent system CPU Usage(%)sys_cpu_load{name=”system”,} 15.0
sys_cpu_coresname=”cpu”coreavailable CPU coressys_cpu_cores{name=”system”,} 16.0
sys_total_physical_memory_sizename=”memory”coreMaximum physical memory of systemsys_total_physical_memory_size{name=”system”,} 1.5950999552E10
sys_free_physical_memory_sizename=”memory”coreThe current available memory of systemsys_free_physical_memory_size{name=”system”,} 4.532396032E9
sys_total_swap_space_sizename=”memory”coreThe maximum swap area of systemsys_total_swap_space_size{name=”system”,} 2.1051273216E10
sys_free_swap_space_sizename=”memory”coreThe available swap area of systemsys_free_swap_space_size{name=”system”,} 2.931576832E9
sys_committed_vm_sizename=”memory”importantthe amount of virtual memory available to running processessys_committed_vm_size{name=”system”,} 5.04344576E8
sys_disk_total_spacename=”disk”coreThe total disk spacesys_disk_total_space{name=”system”,} 5.10770798592E11
sys_disk_free_spacename=”disk”coreThe available disk spacesys_disk_free_space{name=”system”,} 3.63467845632E11

4.5. Add custom metrics

  • If you want to add your own metrics data in IoTDB, please see the [IoTDB Metric Framework] (https://github.com/apache/iotdb/tree/master/metrics) document.
  • Metric embedded point definition rules
    • Metric: The name of the monitoring item. For example, entry_seconds_count is the cumulative number of accesses to the interface, and file_size is the total number of files.
    • Tags: Key-Value pair, used to identify monitored items, optional
      • name = xxx: The name of the monitored item. For example, for the monitoring itementry_seconds_count, the meaning of name is the name of the monitored interface.
      • status = xxx: The status of the monitored item is subdivided. For example, the monitoring item of the monitoring task can use this parameter to separate the running task and the stopped task.
      • user = xxx: The monitored item is related to a specific user, such as the total number of writes by the root user.
      • Customize for the situation…
  • Monitoring indicator level meaning:
    • The default startup level for online operation is Important level, the default startup level for offline debugging is Normal level, and the audit strictness is Core > Important > Normal > All
    • Core: The core indicator of the system, used by the operation and maintenance personnel, which is related to the performance, stability, and security** of the system, such as the status of the instance, the load of the system, etc.
    • Important: An important indicator of the module, which is used by operation and maintenance and testers, and is directly related to the running status of each module, such as the number of merged files, execution status, etc.
    • Normal: General indicators of the module, used by developers to facilitate locating the module when problems occur, such as specific key operation situations in the merger.
    • All: All indicators of the module, used by module developers, often used when the problem is reproduced, so as to solve the problem quickly.

5. How to get these metrics?

The metrics collection switch is disabled by default,you need to enable it from conf/iotdb-metric.yml, Currently, it also supports hot loading via load configuration after startup.

5.1. Iotdb-metric.yml

  1. # whether enable the module
  2. enableMetric: false
  3. # Is stat performance of operation latency
  4. enablePerformanceStat: false
  5. # Multiple reporter, options: [JMX, PROMETHEUS, IOTDB], IOTDB is off by default
  6. metricReporterList:
  7. - JMX
  8. - PROMETHEUS
  9. # Type of monitor frame, options: [MICROMETER, DROPWIZARD]
  10. monitorType: MICROMETER
  11. # Level of metric level, options: [CORE, IMPORTANT, NORMAL, ALL]
  12. metricLevel: IMPORTANT
  13. # Predefined metric, options: [JVM, LOGBACK, FILE, PROCESS, SYSTEM]
  14. predefinedMetrics:
  15. - JVM
  16. - FILE
  17. # The http server's port for prometheus exporter to get metric data.
  18. prometheusExporterPort: 9091
  19. # The config of iotdb reporter
  20. ioTDBReporterConfig:
  21. host: 127.0.0.1
  22. port: 6667
  23. username: root
  24. password: root
  25. database: _metric
  26. pushPeriodInSecond: 15

Then you can get metrics data as follows

  1. Enable metrics switch in iotdb-metric.yml
  2. You can just stay other config params as default.
  3. Start/Restart your IoTDB server/cluster
  4. Open your browser or use the curl command to request http://servier_ip:9091/metrics,then you will get metrics data like follows:
  1. ...
  2. # HELP file_count
  3. # TYPE file_count gauge
  4. file_count{name="wal",} 0.0
  5. file_count{name="unseq",} 0.0
  6. file_count{name="seq",} 2.0
  7. # HELP file_size
  8. # TYPE file_size gauge
  9. file_size{name="wal",} 0.0
  10. file_size{name="unseq",} 0.0
  11. file_size{name="seq",} 560.0
  12. # HELP queue
  13. # TYPE queue gauge
  14. queue{name="flush",status="waiting",} 0.0
  15. queue{name="flush",status="running",} 0.0
  16. # HELP quantity
  17. # TYPE quantity gauge
  18. quantity{name="timeSeries",} 1.0
  19. quantity{name="storageGroup",} 1.0
  20. quantity{name="device",} 1.0
  21. # HELP logback_events_total Number of error level events that made it to the logs
  22. # TYPE logback_events_total counter
  23. logback_events_total{level="warn",} 0.0
  24. logback_events_total{level="debug",} 2760.0
  25. logback_events_total{level="error",} 0.0
  26. logback_events_total{level="trace",} 0.0
  27. logback_events_total{level="info",} 71.0
  28. # HELP mem
  29. # TYPE mem gauge
  30. mem{name="storageGroup",} 0.0
  31. mem{name="mtree",} 1328.0
  32. ...

5.2. Integrating with Prometheus and Grafana

As above descriptions,IoTDB provides metrics data in standard Prometheus format,so we can integrate with Prometheus and Grafana directly.

The following picture describes the relationships among IoTDB, Prometheus and Grafana

iotdb_prometheus_grafana

  1. Along with running, IoTDB will collect its metrics continuously.
  2. Prometheus scrapes metrics from IoTDB at a constant interval (can be configured).
  3. Prometheus saves these metrics to its inner TSDB.
  4. Grafana queries metrics from Prometheus at a constant interval (can be configured) and then presents them on the graph.

So, we need to do some additional works to configure and deploy Prometheus and Grafana.

For instance, you can config your Prometheus as follows to get metrics data from IoTDB:

  1. job_name: pull-metrics
  2. honor_labels: true
  3. honor_timestamps: true
  4. scrape_interval: 15s
  5. scrape_timeout: 10s
  6. metrics_path: /metrics
  7. scheme: http
  8. follow_redirects: true
  9. static_configs:
  10. - targets:
  11. - localhost:9091

The following documents may help you have a good journey with Prometheus and Grafana.

Prometheus getting_startedMetric Tool - 图2 (opens new window)

Prometheus scrape metricsMetric Tool - 图3 (opens new window)

Grafana getting_startedMetric Tool - 图4 (opens new window)

Grafana query metrics from PrometheusMetric Tool - 图5 (opens new window)

5.3. Apache IoTDB Dashboard

We provide the Apache IoTDB Dashboard, and the rendering shown in Grafana is as follows:

Apache IoTDB Dashboard

How to get Apache IoTDB Dashboard:

  1. You can obtain the json files of Dashboards corresponding to different iotdb versions in the grafana-metrics-example folder.
  2. You can visit Grafana Dashboard official websiteMetric Tool - 图7 (opens new window), search for Apache IoTDB Dashboard and use

When creating Grafana, you can select the json file you just downloaded to Import and select the corresponding target data source for Apache IoTDB Dashboard.