Monitoring Checkpointing
Overview
Flink’s web interface provides a tab to monitor the checkpoints of jobs. These stats are also available after the job has terminated. There are four different tabs to display information about your checkpoints: Overview, History, Summary, and Configuration. The following sections will cover all of these in turn.
Monitoring
Overview Tab
The overview tabs lists the following statistics. Note that these statistics don’t survive a JobManager loss and are reset to if your JobManager fails over.
- Checkpoint Counts
- Triggered: The total number of checkpoints that have been triggered since the job started.
- In Progress: The current number of checkpoints that are in progress.
- Completed: The total number of successfully completed checkpoints since the job started.
- Failed: The total number of failed checkpoints since the job started.
- Restored: The number of restore operations since the job started. This also tells you how many times the job has restarted since submission. Note that the initial submission with a savepoint also counts as a restore and the count is reset if the JobManager was lost during operation.
- Latest Completed Checkpoint: The latest successfully completed checkpoints. Clicking on
More details
gives you detailed statistics down to the subtask level. - Latest Failed Checkpoint: The latest failed checkpoint. Clicking on
More details
gives you detailed statistics down to the subtask level. - Latest Savepoint: The latest triggered savepoint with its external path. Clicking on
More details
gives you detailed statistics down to the subtask level. - Latest Restore: There are two types of restore operations.
- Restore from Checkpoint: We restored from a regular periodic checkpoint.
- Restore from Savepoint: We restored from a savepoint.
History Tab
The checkpoint history keeps statistics about recently triggered checkpoints, including those that are currently in progress.
ATLEAST_ONCE
this will always be zero as at least once mode does not require stream alignment.#### History Size ConfigurationYou can configure the number of recent checkpoints that are remembered for the history via the following configuration key. The default is 10
.### Summary TabThe summary computes a simple min/average/maximum statistics over all completed checkpoints for the End to End Duration, State Size, and Bytes Buffered During Alignment (see History for details about what these mean).
# Number of recent checkpoints that are rememberedweb.checkpoints.history: 15