Monitoring Checkpointing

Overview

Flink’s web interface provides a tab to monitor the checkpoints of jobs. These stats are also available after the job has terminated. There are four different tabs to display information about your checkpoints: Overview, History, Summary, and Configuration. The following sections will cover all of these in turn.

Monitoring

Overview Tab

The overview tabs lists the following statistics. Note that these statistics don’t survive a JobManager loss and are reset to if your JobManager fails over.

  • Checkpoint Counts
    • Triggered: The total number of checkpoints that have been triggered since the job started.
    • In Progress: The current number of checkpoints that are in progress.
    • Completed: The total number of successfully completed checkpoints since the job started.
    • Failed: The total number of failed checkpoints since the job started.
    • Restored: The number of restore operations since the job started. This also tells you how many times the job has restarted since submission. Note that the initial submission with a savepoint also counts as a restore and the count is reset if the JobManager was lost during operation.
  • Latest Completed Checkpoint: The latest successfully completed checkpoints. Clicking on More details gives you detailed statistics down to the subtask level.
  • Latest Failed Checkpoint: The latest failed checkpoint. Clicking on More details gives you detailed statistics down to the subtask level.
  • Latest Savepoint: The latest triggered savepoint with its external path. Clicking on More details gives you detailed statistics down to the subtask level.
  • Latest Restore: There are two types of restore operations.
    • Restore from Checkpoint: We restored from a regular periodic checkpoint.
    • Restore from Savepoint: We restored from a savepoint.

History Tab

The checkpoint history keeps statistics about recently triggered checkpoints, including those that are currently in progress.

Checkpoint Monitoring: History
- ID: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1.- Status: The current status of the checkpoint, which is either In Progress (), Completed (), or Failed (). If the triggered checkpoint is a savepoint, you will see a symbol.- Trigger Time: The time when the checkpoint was triggered at the JobManager.- Latest Acknowledgement: The time when the latest acknowledged for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).- End to End Duration: The duration from the trigger timestamp until the latest acknowledgement (or n/a if no acknowledgement received yet). This end to end duration for a complete checkpoint is determined by the last subtask that acknowledges the checkpoint. This time is usually larger than single subtasks need to actually checkpoint the state.- State Size: The state size over all acknowledged subtasks.- Buffered During Alignment: The number of bytes buffered during alignment over all acknowledged subtasks. This is only > 0 if a stream alignment takes place during checkpointing. If the checkpointing mode is ATLEAST_ONCE this will always be zero as at least once mode does not require stream alignment.#### History Size ConfigurationYou can configure the number of recent checkpoints that are remembered for the history via the following configuration key. The default is 10.
  1. # Number of recent checkpoints that are rememberedweb.checkpoints.history: 15
### Summary TabThe summary computes a simple min/average/maximum statistics over all completed checkpoints for the End to End Duration, State Size, and Bytes Buffered During Alignment (see History for details about what these mean).
Checkpoint Monitoring: Summary
Note that these statistics don’t survive a JobManager loss and are reset to if your JobManager fails over.### Configuration TabThe configuration list your streaming configuration:- Checkpointing Mode: Either _Exactly Once
or At least Once.- Interval: The configured checkpointing interval. Trigger checkpoints in this interval.- Timeout: Timeout after which a checkpoint is cancelled by the JobManager and a new checkpoint is triggered.- Minimum Pause Between Checkpoints: Minimum required pause between checkpoints. After a checkpoint has completed successfully, we wait at least for this amount of time before triggering the next one, potentially delaying the regular interval.- Maximum Concurrent Checkpoints: The maximum number of checkpoints that can be in progress concurrently.- Persist Checkpoints Externally: Enabled or Disabled. If enabled, furthermore lists the cleanup config for externalized checkpoints (delete or retain on cancellation).### Checkpoint DetailsWhen you click on a More details link for a checkpoint, you get a Minimum/Average/Maximum summary over all its operators and also the detailed numbers per single subtask.
Checkpoint Monitoring: Details

Summary per Operator

Checkpoint Monitoring: Details Summary

All Subtask Statistics

Checkpoint Monitoring: Subtasks