Alarm

EMQX offers a built-in monitoring and alarm functionality for monitoring the internal state changes, such as CPU occupancy, system and process memory occupancy, number of processes, rule engine resource status, and cluster partition and healing. EMQX triggers and records these changes when they exceed a threshold or deviate from expectations, and removes them from the list once they are restored.

This page introduces which alarm information EMQX provides, how you can obtain and check the detailed alarm information, and how to configure the alarm settings and thresholds in EMQX. The monitoring and alarm function keeps you notified of potential problems during operation. By configuring alarms and setting appropriate thresholds, you can make sure that EMQX remains secure, stable, and reliable.

Alarm List

The following table lists the alarms that can be triggered to indicate potential problems during system monitoring.

TIP

Depending on the severance and impacts on the system, alarms can have 3 levels:

  • Error: Errors caused by user presets. The client can perceive the error and retry.

  • Warning: Occasional errors, and need to be taken seriously if they occur frequently.

  • Critical: Irreversible data loss between the client and server, causing communication and business interruption.

The levels are defined from development perspectives and are only for a recommendation. You can define your own alarm levels according to the business needs.

AlarmLevelDescriptionDetailsThreshold
high_system_memory_usageWarningSystem memory usage is too high“System memory usage is higher than ~p%”os_mon.sysmem_high_watermark = 70%
high_process_memory_usageWarningSingle Erlang process memory usage is too high (percentage of system memory usage)Process memory usage is higher than ~p%os_mon.procmem_high_watermark = 5%
high_cpu_usageWarningCPU usage is too high~p% cpu usageos_mon.cpu_high_watermark = 80% os_mon.cpu_low_watermark = 60%
too_many_processesWarningToo many processes~p% process usagevm_mon.process_high_watermark = 80% vm_mon.process_low_watermark = 60%
partitionCriticalPartition occurs at nodePartition occurs at node ~s-
resourceCriticalResource is disconnectedResource ~s(~s) is down-
conn_congestionCriticalConnection process congestionconnection congested-

Get Alarms

EMQX provides you with various ways to get alarms and check detailed alarm information. One way is to view the alarms on EMQX Dashboard, where you can view a list of active or historical alarms. However, it is only a central place for easy access to an overview of alarms that have been triggered. Another way is to subscribe to system topics through MQTT to receive real-time notifications of alarms with detailed alarm information. Alarms can also be accessed from the log or via REST API.

View Alarms on Dashboard

On EMQX Dashboard, click Monitoring -> Alarms. Select the Active or History tab, and you can see the list of currently active alarms and historical alarms.

view-alarms

Get Alarms via System Topic

EMQX will publish an MQTT message to system topics $SYS/brokers/<Node>/alarms/activate or $SYS/brokers/<Node>/alarms/deactivate when an alarm is triggered or cleared. Users can subscribe to the topic to receive alarm notifications.

The payload in the alarm notification message is in JSON format and contains the following fields:

FieldTypeDescription
namestringAlarm name
detailsobjectAlarm details
messagestringHuman-readable alarm instructions
activate_atintegerA UNIX timestamp in microseconds representing the activation time of the alarm
deactivate_atinteger / stringA UNIX timestamp in microseconds representing the deactivation time of the alarm. The value of this field for the activated alarm is infinity.
activatedbooleanWhether the alarm is activated

Taking the alarm of high system memory usage as an example, you will receive an alarm message like below:

alarm massage

One system multifunction will be repeatedly reported. That is, if one alarm on high CPU usage is activated, the system will not generate another alarm of the same type. The generated alarm will be automatically deactivated when the monitored metric returns to normal, or you can manually deactivate the alarm.

Get Alarms from Log

The activation and deactivation of alarms can be written to log (console or file). When failures occur during message transmission or event processing, detailed information can be logged, and the logging system can also be used to capture alerts through log analysis. The following example shows the detailed alarm information printed in the log: The log level is warning, and the msg field is alarm_is_activated and alarm_is_deactivated.

view-alarms-log

Get Alarms via REST API

You can query and manage alarms through the API. Click Alarms on the left navigation menu on the UI to execute this API request. For how to work with EMQX API, see REST API.

view-alarms-api

Alarm Configuration

Alarm configuration includes configuring alarm settings and thresholds. Alarm settings determine how the alarm message is displayed and stored, while alarm thresholds establish limits or values that trigger the alarm when potential problems are detected. The alarm configuration feature allows you to customize the alarm settings and thresholds to meet your business needs.

Configure Alarm Settings

The settings for alarms can only be configured by modifying the configuration items in emqx.conf file. The following table lists the configuration items available for alarm setting configuration.

Configuration ItemDescriptionDefault ValueOptional Values
alarm.actionsActions of writing the alarm to log (console or file) and publishing the alarm as an MQTT message to the system topics $SYS/brokers/<node_name>/alarms/activate and $SYS/brokers/<node_name>/alarms/deactivate. The actions are triggered when the alarm is activated or deactivated.[“log”, “publish”]-
alarm.size_limitThe maximum total number of deactivated alarms to be kept as history. When this limit is exceeded, the oldest deactivated alarms are deleted.10001-3000
alarm.validity_periodRetention time of deactivated alarms. Alarms are not deleted immediately when deactivated but after a period of time.24h-

Configure Alarm Thresholds via Dashboard

Alarm thresholds can be configured on EMQX Dashboard. There are two ways to launch the Monitoring page for configuring the alarm thresholds:

  1. On the Alarms page, click the Setting button and you will be led to the Monitoring page.
  2. From the left navigation menu, click Management -> Monitoring.

On the Monitoring page, click the Erlang tab, you can configure the following items for the system performance of the Erlang Virtual Machine:

monitoring-system

  • Process limit check interval: Specify the time interval for checking the periodic process limit. The default value is 30 seconds.

  • Process high watermark: Specify the threshold value of processes that can simultaneously exist at the local node. When the percentage exceeds the specified number, an alarm is raised. The default value is 80 percent.

  • Process low watermark: Specify the threshold value of processes that can simultaneously exist at the local node. When the percentage is lowered to the specified number, an alarm is cleared. The default value is 60 percent.

  • Enable Long GC monitoring: Disabled by default. When enabled, a warning-level log long_gc is emitted and an MQTT message is published to the system topic $SYS/sysmon/long_gc when an Erlang process spends long time performing garbage collection.

  • Enable Long Schedule monitoring: Enabled by default, which means when the Erlang VM detects a task scheduled for too long, a warning level log long_schedule is emitted. You can set the proper time scheduled for a task in the text box. The default value is 240 milliseconds.

  • Enable Large Heap monitoring: Enabled by default, which means when an Erlang process consumed a large amount of memory for its heap space, a warning level log large_heap is emitted, and an MQTT message is published to the system topic $SYS/sysmon/large_heap. You can set the limit of space bytesize in the text box. The default value is 32 MB.

  • Enable Busy Distribution Port monitoring: Enabled by default, which means when the Remote Procedure Call (RPC) connection used to communicate with other nodes in the cluster is overloaded, a warning level log busy_dis_port log is emitted, and an MQTT message is published to system topic $SYS/sysmon/busy_dist_port.

  • Enable Busy Port monitory: Enabled by default, which means when a port is overloaded, a warning level log busy_port is emitted, and an MQTT message is published to the system topic $SYS/sysmon/busy_port.

After you complete the configurations, click Save Changes.

Click the Operating System tab, you can configure the following items for the system performance:

monitoring-operating-system-ee

  • The time interval of the periodic CPU check: Specify the time interval for checking the CPU usage. The default value is 60 seconds.

  • CPU high watermark: Specify the threshold value of how much system CPU can be used. When the percentage exceeds the specified value, a corresponding alarm is raised. The default value is 80 percent.

  • CPU low watermark: Specify the threshold value of how much system CPU can be used. When the percentage is lowered to the specified value, a corresponding alarm is released. The default value is 60 percent.

  • Mem check interval: Enabled by default. You can specify the time interval for periodic memory checks. The default value is 60 seconds.

  • SysMem high watermark: Specify the threshold for how much system memory can be allocated. When the percentage exceeds the specified value, a corresponding alarm is raised. The default value is 70%.

  • ProcMem high watermark: Specify the threshold for how much system memory can be allocated by one Erlang process. When the percentage exceeds the specified value, a corresponding alarm is raised. The default value is 5%.

After you complete the configurations, click Save Changes.

Configure Alarm Thresholds via Configuration Items

You can also configure alarm thresholds by modifying the configuration items for alarm thresholds. The following configuration items are currently available to be modified in emqx.conf file:

Configuration ItemDescriptionDefault Value
sysmon.os.cpu_check_intervalCheck interval for CPU usage.60s
sysmon.os.cpu_high_watermarkThe high watermark of the CPU usage, the threshold to activate the alarm.80%
sysmon.os.cpu_low_watermarkThe low watermark of the CPU usage, the threshold to deactivate the alarm.60%
sysmon.os.mem_check_intervalCheck interval for memory usage.60s
sysmon.os.sysmem_high_watermarkThe high watermark of the system memory usage. The alarm will be activated when the total memory occupied reaches this value.70%
sysmon.os.procmem_high_watermarkThe high watermark of the process memory usage. The alarm will be activated when the memory occupied by a single process reaches this value.5%
sysmonn.vm.process_check_intervalCheck interval for the number of processes.30s
sysmon.vm.process_high_watermarkThe high watermark of the process occupancy rate; The alarm will be activated when this threshold is reached; Measured as a ratio of the number of created processes/maximum number limit.80%
sysmon.vm.process_low_watermarkThe low water mark of the process occupancy rate; The alarm will be deactivated when it goes below this threshold; Measured as a ratio of the number of created processes/maximum number limit.60%
sysmonn.vm.long_gcWhether to enable Long GC monitoringdisabled
sysmon.vm.long_scheduleWhether to enable Long Schedule monitoringdisabled
sysmon.vm.large_heapWhether to enable Large Heap monitoringdisabled
sysmon.vm.busy_portWhether to enable Busy Distribution Port monitoringtrue
sysmonn.top.num_itemsNumber of top processes per monitoring group10
sysmon.top.sample_interlvalCheck interval for top processes2s
sysmon.top.max_procsStop collecting data when the number of processes in the VM exceeds this value1000000