Along with IoTDB running, we hope to observe the status of IoTDB, so as to troubleshoot system problems or discover potential system risks in time. A series of metrics that can reflect the operating status of the system are system monitoring metrics.

1. When to use metric framework?

Belows are some typical application scenarios

  1. System is running slowly

    When system is running slowly, we always hope to have information about system’s running status as detail as possible, such as:

    • JVM:Is there FGC? How long does it cost? How much does the memory usage decreased after GC? Are there lots of threads?
    • System:Is the CPU usage too hi?Are there many disk IOs?
    • Connections:How many connections are there in the current time?
    • Interface:What is the TPS and latency of every interface?
    • Thread Pool:Are there many pending tasks?
    • Cache Hit Ratio
  2. No space left on device

    When meet a “no space left on device” error, we really want to know which kind of data file had a rapid rise in the past hours.

  3. Is the system running in abnormal status

    We could use the count of error logs、the alive status of nodes in cluster, etc, to determine whether the system is running abnormally.

2. Who will use metric framework?

Any person cares about the system’s status, including but not limited to RD, QA, SRE, DBA, can use the metrics to work more efficiently.

3. What is metrics?

3.1. Key Concept

In IoTDB’s metric module, each metrics is uniquely identified by Metric Name and Tags.

  • Metric Name: Metric type name, such as logback_events means log events.
  • Tags: indicator classification, in the form of Key-Value pairs, each indicator can have 0 or more categories, common Key-Value pairs:
    • name = xxx: The name of the monitored object, which is the description of business logic. For example, for a monitoring item of type Metric Name = entry_seconds_count, the meaning of name refers to the monitored business interface.
    • type = xxx: Monitoring indicator type subdivision, which is a description of monitoring indicator itself. For example, for monitoring items of type Metric Name = point, the meaning of type refers to the specific type of monitoring points.
    • status = xxx: The status of the monitored object is a description of business logic. For example, for monitoring items of type Metric Name = Task, this parameter can be used to distinguish the status of the monitored object.
    • user = xxx: The relevant user of the monitored object is a description of business logic. For example, count the total points written by the root user.
    • Customize according to the specific situation: For example, there is a level classification under logback_events_total, which is used to indicate the number of logs under a specific level.
  • Metric Level: The level of metric managing level, The default startup level is Core level, the recommended startup level is Important level, and the audit strictness is Core > Important > Normal > All
    • Core: Core metrics of the system, used by the operation and maintenance personnel, which is related to the * performance, stability, and security* of the system, such as the status of the instance, the load of the system, etc.
    • Important: Important metrics of the module, which is used by operation and maintenance and testers, and is directly related to the running status of each module, such as the number of merged files, execution status, etc.
    • Normal: Normal metrics of the module, used by developers to facilitate locating the module when problems occur, such as specific key operation situations in the merger.
    • All: All metrics of the module, used by module developers, often used when the problem is reproduced, so as to solve the problem quickly.

3.2. External data format for metrics

  • IoTDB provides metrics in JMX, Prometheus and IoTDB formats:
    • For JMX, metrics can be obtained through org.apache.iotdb.metrics.
    • For Prometheus, the value of the metrics can be obtained through the externally exposed port
    • External exposure in IoTDB mode: metrics can be obtained by executing IoTDB queries

4. The detail of metrics

Currently, IoTDB provides metrics for some main modules externally, and with the development of new functions and system optimization or refactoring, metrics will be added and updated synchronously.

If you want to add your own metrics data in IoTDB, please see the [IoTDB Metric Framework] (https://github.com/apache/iotdb/tree/master/metricsMetric Tool - 图1open in new window) document.

4.1. Core level metrics

Core-level metrics are enabled by default during system operation. The addition of each Core-level metrics needs to be carefully evaluated. The current Core-level metrics are as follows:

4.1.1. Cluster

MetricTagsTypeDescription
config_nodename=”total”,status=”Registered/Online/Unknown”AutoGaugeThe number of registered/online/unknown confignodes
data_nodename=”total”,status=”Registered/Online/Unknown”AutoGaugeThe number of registered/online/unknown datanodes
cluster_node_leader_countname=”{ip}:{port}”GaugeThe count of consensus group leader on each node
cluster_node_statusname=”{ip}:{port}”,type=”ConfigNode/DataNode”GaugeThe current node status, 0=Unkonwn 1=online
entryname=”{interface}”TimerThe time consumed of thrift operations
memname=”IoTConsensus”AutoGaugeThe memory usage of IoTConsensus, Unit: byte

4.1.2. Interface

MetricTagsTypeDescription
thrift_connectionsname=”ConfigNodeRPC”AutoGaugeThe number of thrift internal connections in ConfigNode
thrift_connectionsname=”InternalRPC”AutoGaugeThe number of thrift internal connections in DataNode
thrift_connectionsname=”MPPDataExchangeRPC”AutoGaugeThe number of thrift internal connections in MPP
thrift_connectionsname=”ClientRPC”AutoGaugeThe number of thrift connections of Client
thrift_active_threadsname=”ConfigNodeRPC-Service”AutoGaugeThe number of thrift active internal connections in ConfigNode
thrift_active_threadsname=”DataNodeInternalRPC-Service”AutoGaugeThe number of thrift active internal connections in DataNode
thrift_active_threadsname=”MPPDataExchangeRPC-Service”AutoGaugeThe number of thrift active internal connections in MPP
thrift_active_threadsname=”ClientRPC-Service”AutoGaugeThe number of thrift active connections of client
session_idle_timename = “sessionId”HistogramThe distribution of idle time of different sessions

4.1.2. Node Statistics

MetricTagsTypeDescription
quantityname=”database”AutoGaugeThe number of database
quantityname=”timeSeries”AutoGaugeThe number of timeseries
quantityname=”pointsIn”CounterThe number of write points
pointsdatabase=”{database}”, type=”flush”GaugeThe point number of last flushed memtable

4.1.3. Cluster Tracing

MetricTagsTypeDescription
performance_overviewinterface=”{interface}”, type=”{statement_type}”TimerThe time consumed of operations in client
performance_overview_detailstage=”authority”TimerThe time consumed on authority authentication
performance_overview_detailstage=”parser”TimerThe time consumed on parsing statement
performance_overview_detailstage=”analyzer”TimerThe time consumed on analyzing statement
performance_overview_detailstage=”planner”TimerThe time consumed on planning
performance_overview_detailstage=”scheduler”TimerThe time consumed on scheduling
performance_overview_schedule_detailstage=”local_scheduler”TimerThe time consumed on local scheduler
performance_overview_schedule_detailstage=”remote_scheduler”TimerThe time consumed on remote scheduler
performance_overview_local_detailstage=”schema_validate”TimerThe time consumed on schema validation
performance_overview_local_detailstage=”trigger”TimerThe time consumed on trigger
performance_overview_local_detailstage=”storage”TimerThe time consumed on consensus
performance_overview_storage_detailstage=”engine”TimerThe time consumed on write stateMachine
performance_overview_engine_detailstage=”lock”TimerThe time consumed on grabbing lock in DataRegion
performance_overview_engine_detailstage=”create_memtable_block”TimerThe time consumed on creating new memtable
performance_overview_engine_detailstage=”memory_block”TimerThe time consumed on insert memory control
performance_overview_engine_detailstage=”wal”TimerThe time consumed on writing wal
performance_overview_engine_detailstage=”memtable”TimerThe time consumed on writing memtable
performance_overview_engine_detailstage=”last_cache”TimerThe time consumed on updating last cache

4.1.5. Task Statistics

MetricTagsTypeDescription
queuename=”compaction_inner”, status=”running/waiting”GaugeThe number of inner compaction tasks
queuename=”compaction_cross”, status=”running/waiting”GaugeThe number of cross compatcion tasks
queuename=”flush”,status=”running/waiting”AutoGaugeThe number of flush tasks
cost_taskname=”inner_compaction/cross_compaction/flush”GaugeThe time consumed of compaction tasks

4.1.6. IoTDB process

MetricTagsTypeDescription
process_cpu_loadname=”process”AutoGaugeThe current CPU usage of IoTDB process, Unit: %
process_cpu_timename=”process”AutoGaugeThe total CPU time occupied of IoTDB process, Unit: ns
process_max_memname=”memory”AutoGaugeThe maximum available memory of IoTDB process
process_total_memname=”memory”AutoGaugeThe current requested memory for IoTDB process
process_free_memname=”memory”AutoGaugeThe free available memory of IoTDB process

4.1.7. System

MetricTagsTypeDescription
sys_cpu_loadname=”system”AutoGaugeThe current CPU usage of system, Unit: %
sys_cpu_coresname=”system”GaugeThe available number of CPU cores
sys_total_physical_memory_sizename=”memory”GaugeThe maximum physical memory of system
sys_free_physical_memory_sizename=”memory”AutoGaugeThe current available memory of system
sys_total_swap_space_sizename=”memory”AutoGaugeThe maximum swap space of system
sys_free_swap_space_sizename=”memory”AutoGaugeThe available swap space of system
sys_committed_vm_sizename=”memory”AutoGaugeThe space of virtual memory available to running processes
sys_disk_total_spacename=”disk”AutoGaugeThe total disk space
sys_disk_free_spacename=”disk”AutoGaugeThe available disk space

4.1.8. Log

MetricTagsTypeDescription
logback_eventslevel=”trace/debug/info/warn/error”CounterThe number of log events

4.1.9. File

MetricTagsTypeDescription
file_sizename=”wal”AutoGaugeThe size of WAL file, Unit: byte
file_sizename=”seq”AutoGaugeThe size of sequence TsFile, Unit: byte
file_sizename=”unseq”AutoGaugeThe size of unsequence TsFile, Unit: byte
file_sizename=”inner-seq-temp”AutoGaugeThe size of inner sequence space compaction temporal file
file_sizename=”inner-unseq-temp”AutoGaugeThe size of inner unsequence space compaction temporal file
file_sizename=”cross-temp”AutoGaugeThe size of cross space compaction temoporal file
file_sizename=”modsAutoGaugeThe size of modification files
file_countname=”wal”AutoGaugeThe count of WAL file
file_countname=”seq”AutoGaugeThe count of sequence TsFile
file_countname=”unseq”AutoGaugeThe count of unsequence TsFile
file_countname=”inner-seq-temp”AutoGaugeThe count of inner sequence space compaction temporal file
file_countname=”inner-unseq-temp”AutoGaugeThe count of inner unsequence space compaction temporal file
file_countname=”cross-temp”AutoGaugeThe count of cross space compaction temporal file
file_countname=”open_file_handlers”AutoGaugeThe count of open files of the IoTDB process, only supports Linux and MacOS
file_countname=”modsAutoGaugeThe count of modification file

4.1.10. JVM Memory

MetricTagsTypeDescription
jvm_buffer_memory_used_bytesid=”direct/mapped”AutoGaugeThe used size of buffer
jvm_buffer_total_capacity_bytesid=”direct/mapped”AutoGaugeThe max size of buffer
jvm_buffer_count_buffersid=”direct/mapped”AutoGaugeThe number of buffer
jvm_memory_committed_bytesAutoGaugeThe committed memory of JVM
jvm_memory_max_bytesAutoGaugeThe max memory of JVM
jvm_memory_used_bytesAutoGaugeThe used memory of JVM

4.1.11. JVM Thread

MetricTagsTypeDescription
jvm_threads_live_threadsAutoGaugeThe number of live thread
jvm_threads_daemon_threadsAutoGaugeThe number of daemon thread
jvm_threads_peak_threadsAutoGaugeThe number of peak thread
jvm_threads_states_threadsstate=”runnable/blocked/waiting/timed-waiting/new/terminated”AutoGaugeThe number of thread in different states

4.1.12. JVM GC

MetricTagsTypeDescription
jvm_gc_pauseaction=”end of major GC/end of minor GC”,cause=”xxxx”TimerThe number and time consumed of Young GC/Full Gc caused by different reason
jvm_gc_concurrent_phase_timeaction=”{action}”,cause=”{cause}”TimerThe number and time consumed of Young GC/Full Gc caused by different
jvm_gc_max_data_size_bytesAutoGaugeThe historical maximum value of old memory
jvm_gc_live_data_size_bytesAutoGaugeThe usage of old memory
jvm_gc_memory_promoted_bytesCounterThe accumulative value of positive memory growth of old memory
jvm_gc_memory_allocated_bytesCounterThe accumulative value of positive memory growth of allocated memory

4.2. Important level metrics

4.2.1. Node

MetricTagsTypeDescription
regionname=”total”,type=”SchemaRegion”AutoGaugeThe total number of SchemaRegion in PartitionTable
regionname=”total”,type=”DataRegion”AutoGaugeThe total number of DataRegion in PartitionTable
regionname=”{ip}:{port}”,type=”SchemaRegion”GaugeThe number of SchemaRegion in PartitionTable of specific node
regionname=”{ip}:{port}”,type=”DataRegion”GaugeThe number of DataRegion in PartitionTable of specific node

4.2.2. RatisConsensus

MetricTagsTypeDescription
ratis_consensus_writestage=”writeLocally”TimerThe time cost of writing locally stage
ratis_consensus_writestage=”writeRemotely”TimerThe time cost of writing remotely stage
ratis_consensus_writestage=”writeStateMachine”TimerThe time cost of writing state machine stage
ratis_serverclientWriteRequestTimerTime taken to process write requests from client
ratis_serverfollowerAppendEntryLatencyTimerTime taken for followers to append log entries
ratis_log_workerappendEntryLatencyTimerTotal time taken to append a raft log entry
ratis_log_workerqueueingDelayTimerTime taken for a Raft log operation to get into the queue after being requested, waiting queue to be non-full
ratis_log_workerenqueuedTimeTimerTime spent by a Raft log operation in the queue
ratis_log_workerwritelogExecutionTimeTimerTime taken for a Raft log write operation to complete execution
ratis_log_workerflushTimeTimerTime taken to flush log
ratis_log_workerclosedSegmentsSizeInBytesGaugeSize of closed raft log segments in bytes
ratis_log_workeropenSegmentSizeInBytesGaugeSize of open raft log segment in bytes

4.2.3. IoTConsensus

MetricTagsTypeDescription
mutli_leadername=”logDispatcher-{IP}:{Port}”, region=”{region}”, type=”currentSyncIndex”AutoGaugeThe sync index of synchronization thread in replica group
mutli_leadername=”logDispatcher-{IP}:{Port}”, region=”{region}”, type=”cachedRequestInMemoryQueue”AutoGaugeThe size of cache requests of synchronization thread in replica group
mutli_leadername=”IoTConsensusServerImpl”, region=”{region}”, type=”searchIndex”AutoGaugeThe write process of main process in replica group
mutli_leadername=”IoTConsensusServerImpl”, region=”{region}”, type=”safeIndex”AutoGaugeThe sync index of replica group
mutli_leadername=”IoTConsensusServerImpl”, region=”{region}”, type=”syncLag”AutoGaugeThe sync lag of replica group
mutli_leadername=”IoTConsensusServerImpl”, region=”{region}”, type=”LogEntriesFromWAL”AutoGaugeThe number of logEntries from wal in Batch
mutli_leadername=”IoTConsensusServerImpl”, region=”{region}”, type=”LogEntriesFromQueue”AutoGaugeThe number of logEntries from queue in Batch
stagename=”iot_consensus”, region=”{region}”, type=”getStateMachineLock”HistogramThe time consumed to get statemachine lock in main process
stagename=”iot_consensus”, region=”{region}”, type=”checkingBeforeWrite”HistogramThe time consumed to precheck before write in main process
stagename=”iot_consensus”, region=”{region}”, type=”writeStateMachine”HistogramThe time consumed to write statemachine in main process
stagename=”iot_consensus”, region=”{region}”, type=”offerRequestToQueue”HistogramThe time consumed to try to offer request to queue in main process
stagename=”iot_consensus”, region=”{region}”, type=”consensusWrite”HistogramThe time consumed to the whole write in main process
stagename=”iot_consensus”, region=”{region}”, type=”constructBatch”HistogramThe time consumed to construct batch in synchronization thread
stagename=”iot_consensus”, region=”{region}”, type=”syncLogTimePerRequest”HistogramThe time consumed to sync log in asynchronous callback process

4.2.4. Cache

MetricTagsTypeDescription
cache_hitname=”chunk”AutoGaugeThe cache hit ratio of ChunkCache, Unit: %
cache_hitname=”schema”AutoGaugeThe cache hit ratio of SchemaCache, Unit: %
cache_hitname=”timeSeriesMeta”AutoGaugeThe cache hit ratio of TimeseriesMetadataCache, Unit: %
cache_hitname=”bloomFilter”AutoGaugeThe interception rate of bloomFilter in TimeseriesMetadataCache, Unit: %
cachename=”Database”, type=”hit”CounterThe hit number of Database Cache
cachename=”Database”, type=”all”CounterThe access number of Database Cache
cachename=”SchemaPartition”, type=”hit”CounterThe hit number of SchemaPartition Cache
cachename=”SchemaPartition”, type=”all”CounterThe access number of SSchemaPartition Cache
cachename=”DataPartition”, type=”hit”CounterThe hit number of DataPartition Cache
cachename=”DataPartition”, type=”all”CounterThe access number of SDataPartition Cache

4.2.5. Memory

MetricTagsTypeDescription
memname=”database{name}”AutoGaugeThe memory usage of DataRegion in DataNode, Unit: byte
memname=”chunkMetaData{name}”AutoGaugeThe memory usage of chunkMetaData when writting TsFile, Unit: byte
memname=”IoTConsensus”AutoGaugeThe memory usage of IoTConsensus, Unit: byte
memname=”IoTConsensusQueue”AutoGaugeThe memory usage of IoTConsensus Queue, Unit: byte
memname=”IoTConsensusSync”AutoGaugeThe memory usage of IoTConsensus SyncStatus, Unit: byte
memname=”schema_region_total_usage”AutoGaugeThe memory usage of all SchemaRegion, Unit: byte

4.2.6. Compaction

MetricTagsTypeDescription
data_writtenname=”compaction”, type=”aligned/not-aligned/total”CounterThe written size of compaction
data_readname=”compaction”CounterThe read size of compaction
compaction_task_countname = “inner_compaction”, type=”sequence”CounterThe number of inner sequence compction
compaction_task_countname = “inner_compaction”, type=”unsequence”CounterThe number of inner sequence compction
compaction_task_countname = “cross_compaction”, type=”cross”CounterThe number of corss compction

4.2.7. IoTDB Process

MetricTagsTypeDescription
process_used_memname=”memory”AutoGaugeThe used memory of IoTDB process
process_mem_rationame=”memory”AutoGaugeThe used memory ratio of IoTDB process
process_threads_countname=”process”AutoGaugeThe number of thread of IoTDB process
process_statusname=”process”AutoGaugeThe status of IoTDB process, 1=live, 0=dead

4.2.8. JVM Class

MetricTagsTypeDescription
jvm_classes_unloaded_classesAutoGaugeThe number of unloaded class
jvm_classes_loaded_classesAutoGaugeThe number of loaded class

4.2.9. JVM Compilation

MetricTagsTypeDescription
jvm_compilation_time_msAutoGaugeThe time consumed in compilation

4.2.10. Query Planning

MetricTagsTypeDescription
query_plan_coststage=”analyzer”TimerThe query statement analysis time-consuming
query_plan_coststage=”logical_planner”TimerThe query logical plan planning time-consuming
query_plan_coststage=”distribution_planner”TimerThe query distribution plan planning time-consuming
query_plan_coststage=”partition_fetcher”TimerThe partition information fetching time-consuming
query_plan_coststage=”schema_fetcher”TimerThe schema information fetching time-consuming

4.2.11. Plan Dispatcher

MetricTagsTypeDescription
dispatcherstage=”wait_for_dispatch”TimerThe distribution plan dispatcher time-consuming
dispatcherstage=”dispatch_read”TimerThe distribution plan dispatcher time-consuming (only query)

4.2.12. Query Resource

MetricTagsTypeDescription
query_resourcetype=”sequence_tsfile”RateThe access frequency of sequence tsfiles
query_resourcetype=”unsequence_tsfile”RateThe access frequency of unsequence tsfiles
query_resourcetype=”flushing_memtable”RateThe access frequency of flushing memtables
query_resourcetype=”working_memtable”RateThe access frequency of working memtables

4.2.13. Data Exchange

MetricTagsTypeDescription
data_exchange_costoperation=”source_handle_get_tsblock”, type=”local/remote”TimerThe time-consuming that source handles receive TsBlock
data_exchange_costoperation=”source_handle_deserialize_tsblock”, type=”local/remote”TimerThe time-consuming that source handles deserialize TsBlock
data_exchange_costoperation=”sink_handle_send_tsblock”, type=”local/remote”TimerThe time-consuming that sink handles send TsBlock
data_exchange_costoperation=”send_new_data_block_event_task”, type=”server/caller”TimerThe RPC time-consuming that sink handles send TsBlock
data_exchange_costoperation=”get_data_block_task”, type=”server/caller”TimerThe RPC time-consuming that source handles receive TsBlock
data_exchange_costoperation=”on_acknowledge_data_block_event_task”, type=”server/caller”TimerThe RPC time-consuming that source handles ack received TsBlock
data_exchange_countname=”send_new_data_block_num”, type=”server/caller”HistogramThe number of sent TsBlocks by sink handles
data_exchange_countname=”get_data_block_num”, type=”server/caller”HistogramThe number of received TsBlocks by source handles
data_exchange_countname=”on_acknowledge_data_block_num”, type=”server/caller”HistogramThe number of acknowledged TsBlocks by source handles

4.2.14. Query Task Schedule

MetricTagsTypeDescription
driver_schedulername=”ready_queued_time”TimerThe queuing time of ready queue
driver_schedulername=”block_queued_time”TimerThe queuing time of blocking queue
driver_schedulername=”ready_queue_task_count”AutoGaugeThe number of tasks queued in the ready queue
driver_schedulername=”block_queued_task_count”AutoGaugeThe number of tasks queued in the blocking queue

4.2.15. Query Execution

MetricTagsTypeDescription
query_executionstage=”local_execution_planner”TimerThe time-consuming of operator tree construction
query_executionstage=”query_resource_init”TimerThe time-consuming of query resource initialization
query_executionstage=”get_query_resource_from_mem”TimerThe time-consuming of query resource memory query and construction
query_executionstage=”driver_internal_process”TimerThe time-consuming of driver execution
query_executionstage=”wait_for_result”TimerThe time-consuming of getting query result from result handle
operator_execution_costname=”{operator_name}”TimerThe operator execution time
operator_execution_countname=”{operator_name}”CounterThe number of operator calls (counted by the number of next method calls)
aggregationfrom=”raw_data”TimerThe time-consuming of performing an aggregation calculation from a batch of raw data
aggregationfrom=”statistics”TimerThe time-consuming of updating an aggregated value with statistics
series_scan_coststage=”load_timeseries_metadata”, type=”aligned/non_aligned”, from=”mem/disk”TimerThe time-consuming of loading TimeseriesMetadata
series_scan_coststage=”read_timeseries_metadata”, type=””, from=”cache/file”TimerThe time-consuming of reading TimeseriesMetadata of a tsfile
series_scan_coststage=”timeseries_metadata_modification”, type=”aligned/non_aligned”, from=”null”TimerThe time-consuming of filtering TimeseriesMetadata by mods
series_scan_coststage=”load_chunk_metadata_list”, type=”aligned/non_aligned”, from=”mem/disk”TimerThe time-consuming of loading ChunkMetadata list
series_scan_coststage=”chunk_metadata_modification”, type=”aligned/non_aligned”, from=”mem/disk”TimerThe time-consuming of filtering ChunkMetadata by mods
series_scan_coststage=”chunk_metadata_filter”, type=”aligned/non_aligned”, from=”mem/disk”TimerThe time-consuming of filtering ChunkMetadata by query filter
series_scan_coststage=”construct_chunk_reader”, type=”aligned/non_aligned”, from=”mem/disk”TimerThe time-consuming of constructing ChunkReader
series_scan_coststage=”read_chunk”, type=””, from=”cache/file”TimerThe time-consuming of reading Chunk
series_scan_coststage=”init_chunk_reader”, type=”aligned/non_aligned”, from=”mem/disk”TimerThe time-consuming of initializing ChunkReader (constructing PageReader)
series_scan_coststage=”build_tsblock_from_page_reader”, type=”aligned/non_aligned”, from=”mem/disk”TimerThe time-consuming of constructing Tsblock from PageReader
series_scan_coststage=”build_tsblock_from_merge_reader”, type=”aligned/non_aligned”, from=”null”TimerThe time-consuming of constructing Tsblock from MergeReader (handling overlapping data)

4.2.16 Schema Engine

MetricTagsTypeDescription
schema_enginename=”schema_region_total_mem_usage”AutoGaugeMemory usgae for all SchemaRegion
schema_enginename=”schema_region_mem_capacity”AutoGaugeMemory capacity for all SchemaRegion
schema_enginename=”schema_engine_mode”GaugeMode of SchemaEngine
schema_enginename=”schema_region_consensus”GaugeConsensus protocol of SchemaRegion
schema_enginename=”schema_region_number”AutoGaugeNumber of SchemaRegion
quantityname=”template_series_cnt”AutoGaugeNumber of template series
schema_regionname=”schema_region_mem_usage”, region=”SchemaRegion[{regionId}]”AutoGaugeMemory usgae for each SchemaRegion
schema_regionname=”schema_region_series_cnt”, region=”SchemaRegion[{regionId}]”AutoGaugeNumber of total timeseries for each SchemaRegion
schema_regionname=”activated_template_cnt”, region=”SchemaRegion[{regionId}]”AutoGaugeNumber of Activated template for each SchemaRegion
schema_regionname=”template_series_cnt”, region=”SchemaRegion[{regionId}]”AutoGaugeNumber of template series for each SchemaRegion

4.2.17 Write Performance

MetricTagsTypeDescription
wal_node_numname=”wal_nodes_num”AutoGaugeNum of WALNode
wal_coststage=”make_checkpoint” type=”<checkpoint_type>”TimerTime cost of make checkpoints for all checkpoint type
wal_costtype=”serialize_one_wal_info_entry”TimerTime cost of serialize one WALInfoEntry
wal_coststage=”sync_wal_buffer” type=”<force_flag>”TimerTime cost of sync WALBuffer
wal_buffername=”used_ratio”HistogramUsed ratio of WALBuffer
wal_coststage=”serialize_wal_entry” type=”serialize_wal_entry_total”TimerTime cost of WALBuffer serialize task
wal_node_infoname=”effective_info_ratio” type=”<wal_node_id>”HistogramEffective info ratio of WALNode
wal_node_infoname=”oldest_mem_table_ram_when_cause_snapshot” type=”<wal_node_id>”HistogramRam of oldest memTable when cause snapshot
wal_node_infoname=”oldest_mem_table_ram_when_cause_flush” type=”<wal_node_id>”HistogramRam of oldest memTable when cause flush
flush_sub_task_costtype=”sort_task”TimerTime cost of sort series in flush sort stage
flush_sub_task_costtype=”encoding_task”TimerTime cost of sub encoding task in flush encoding stage
flush_sub_task_costtype=”io_task”TimerTime cost of sub io task in flush io stage
flush_coststage=”write_plan_indices”TimerTime cost of write plan indices
flush_coststage=”sort”TimerTime cost of flush sort stage
flush_coststage=”encoding”TimerTime cost of flush encoding stage
flush_coststage=”io”TimerTime cost of flush io stage
pending_flush_tasktype=”pending_task_num”AutoGaugeNum of pending flush task num
pending_flush_tasktype=”pending_sub_task_num”AutoGaugeNum of pending flush sub task num
flushing_mem_table_statusname=”mem_table_size” region=”DataRegion[<data_region_id>]”HistogramSize of flushing memTable
flushing_mem_table_statusname=”total_point_num” region=”DataRegion[<data_region_id>]”HistogramPoint num of flushing memTable
flushing_mem_table_statusname=”series_num” region=”DataRegion[<data_region_id>]”HistogramSeries num of flushing memTable
flushing_mem_table_statusname=”avg_series_points_num” region=”DataRegion[<data_region_id>]”HistogramPoint num of flushing memChunk
flushing_mem_table_statusname=”tsfile_compression_ratio” region=”DataRegion[<data_region_id>]”HistogramTsFile Compression ratio of flushing memTable
flushing_mem_table_statusname=”flush_tsfile_size” region=”DataRegion[<data_region_id>]”HistogramTsFile size of flushing memTable

4.3. Normal level Metrics

4.3.1. Cluster

MetricTagsTypeDescription
regionname=”{DatabaseName}”,type=”SchemaRegion/DataRegion”AutoGaugeThe number of DataRegion/SchemaRegion of database in specific node
slotname=”{DatabaseName}”,type=”schemaSlotNumber/dataSlotNumber”AutoGaugeThe number of DataSlot/SchemaSlot of database in specific node

4.4. All Metric

Currently there is no All level metrics, and it will continue to be added in the future.

5. How to get these metrics?

The relevant configuration of the metric module is in conf/iotdb-{datanode/confignode}.properties, and all configuration items support hot loading through the load configuration command.

5.1. JMX

For metrics exposed externally using JMX, you can view them through Jconsole. After entering the Jconsole monitoring page, you will first see an overview of various running conditions of IoTDB. Here you can see heap memory information, thread information, class information, and the server’s CPU usage.

5.1.1. Obtain metric data

After connecting to JMX, you can find the “MBean” named “org.apache.iotdb.metrics” through the “MBeans” tab, and you can view the specific values of all monitoring metrics in the sidebar.

metric-jmx

5.1.2. Get other relevant data

After connecting to JMX, you can find the “MBean” named “org.apache.iotdb.service” through the “MBeans” tab, as shown in the image below, to understand the basic status of the service

Metric Tool - 图3

In order to improve query performance, IOTDB caches ChunkMetaData and TsFileMetaData. Users can use MXBean and expand the sidebar org.apache.iotdb.db.service to view the cache hit ratio:

Metric Tool - 图4

5.2. Prometheus

5.2.1. The mapping from metric type to prometheus format

For metrics whose Metric Name is name and Tags are K1=V1, …, Kn=Vn, the mapping is as follows, where value is a specific value

Metric TypeMapping
Countername_total{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
AutoGauge、Gaugename{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
Histogramname_max{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
name_sum{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
name_count{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
name{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.0”} value
name{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.5”} value
name{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.99”} value
name{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.999”} value
Ratename_total{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
name_total{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, rate=”m1”} value
name_total{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, rate=”m5”} value
name_total{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, rate=”m15”} value
name_total{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, rate=”mean”} value
Timername_seconds_max{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
name_seconds_sum{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
name_seconds_count{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”} value
name_seconds{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.0”} value
name_seconds{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.5”} value
name_seconds{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.99”} value
name_seconds{cluster=”clusterName”, nodeType=”nodeType”, nodeId=”nodeId”, k1=”V1”, …, Kn=”Vn”, quantile=”0.999”} value

5.2.2. Config File

  1. Taking DataNode as an example, modify the iotdb-datanode.properties configuration file as follows:
  1. dn_metric_reporter_list=PROMETHEUS
  2. dn_metric_level=CORE
  3. dn_metric_prometheus_reporter_port=9091

Then you can get metrics data as follows

  1. Start IoTDB DataNodes
  2. Open a browser or use curl to visit http://servier_ip:9091/metrics, you can get the following metric data:
  1. ...
  2. # HELP file_count
  3. # TYPE file_count gauge
  4. file_count{name="wal",} 0.0
  5. file_count{name="unseq",} 0.0
  6. file_count{name="seq",} 2.0
  7. ...

5.2.3. Prometheus + Grafana

As shown above, IoTDB exposes monitoring metrics data in the standard Prometheus format to the outside world. Prometheus can be used to collect and store monitoring indicators, and Grafana can be used to visualize monitoring indicators.

The following picture describes the relationships among IoTDB, Prometheus and Grafana

iotdb_prometheus_grafana

iotdb_prometheus_grafana

  1. Along with running, IoTDB will collect its metrics continuously.
  2. Prometheus scrapes metrics from IoTDB at a constant interval (can be configured).
  3. Prometheus saves these metrics to its inner TSDB.
  4. Grafana queries metrics from Prometheus at a constant interval (can be configured) and then presents them on the graph.

So, we need to do some additional works to configure and deploy Prometheus and Grafana.

For instance, you can config your Prometheus as follows to get metrics data from IoTDB:

  1. job_name: pull-metrics
  2. honor_labels: true
  3. honor_timestamps: true
  4. scrape_interval: 15s
  5. scrape_timeout: 10s
  6. metrics_path: /metrics
  7. scheme: http
  8. follow_redirects: true
  9. static_configs:
  10. - targets:
  11. - localhost:9091

The following documents may help you have a good journey with Prometheus and Grafana.

Prometheus getting_startedMetric Tool - 图6open in new window

Prometheus scrape metricsMetric Tool - 图7open in new window

Grafana getting_startedMetric Tool - 图8open in new window

Grafana query metrics from PrometheusMetric Tool - 图9open in new window

5.2.4. Apache IoTDB Dashboard

We provide the Apache IoTDB Dashboard, and the rendering shown in Grafana is as follows:

Apache IoTDB Dashboard

Apache IoTDB Dashboard

You can obtain the json files of Dashboards in enterprise version.

5.3. IoTDB

5.3.1. IoTDB mapping relationship of metrics

For metrics whose Metric Name is name and Tags are K1=V1, …, Kn=Vn, the mapping is as follows, taking root.__ system.metric.clusterName.nodeType.nodeId as an example by default

Metric TypeMapping
Counterroot.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.value
AutoGauge、Gaugeroot.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.value
Histogramroot.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.count
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.max
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.sum
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p0
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p50
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p75
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p99
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p999
Rateroot.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.count
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.mean
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.m1
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.m5
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.m15
Timerroot.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.count
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.max
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.mean
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.sum
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p0
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p50
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p75
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p99
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.p999
root.system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.m1
root.
system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.m5
root.__system.metric.clusterName.nodeType.nodeId.name.K1=V1Kn=Vn.m15

5.3.2. Obtain metrics

According to the above mapping relationship, related IoTDB query statements can be formed to obtain metrics