Metrics System

Slack Docker Pulls GitHub edit source

Metrics provide insight into what is going on in the cluster. They are an invaluable resource for monitoring and debugging. Alluxio has a configurable metrics system based on the Coda Hale Metrics Library. In the metrics system, sources generate metrics, and sinks consume these metrics. The metrics system polls sources periodically and passes metric records to sinks.

Alluxio’s metrics are partitioned into different instances corresponding to Alluxio components. Within each instance, users can configure a set of sinks to which metrics are reported. The following instances are currently supported:

  • Client: Any process with the Alluxio client library.
  • Master: The Alluxio master process.
  • Worker: The Alluxio worker process.

A sink specifies where metrics are delivered to. Each instance can report to zero or more sinks.

  • ConsoleSink: Outputs metrics values to the console.
  • CsvSink: Exports metrics data to CSV files at regular intervals.
  • JmxSink: Registers metrics for viewing in a JMX console.
  • GraphiteSink: Sends metrics to a Graphite server.
  • MetricsServlet: Adds a servlet in Web UI to serve metrics data as JSON data.

Configuration

The metrics system is configured via a configuration file that Alluxio expects to be present at $ALLUXIO_HOME/conf/metrics.properties. A custom file location can be specified via the alluxio.metrics.conf.file configuration property. Alluxio provides a metrics.properties.template under the conf directory which includes all configurable properties and guidance of how to specify each property.

Default HTTP JSON Sink

By default, MetricsServlet is enabled in Alluxio leading master and workers.

You can send an HTTP request to /metrics/json/ of the Alluxio leading master to get a snapshot of all metrics in JSON format. Metrics on the Alluxio leading master contains its own instance metrics and a summary of the cluster-wide aggregated metrics.

  1. # Get the metrics in JSON format from Alluxio leading master
  2. $ curl <LEADING_MASTER_HOSTNAME>:<MASTER_WEB_PORT>/metrics/json
  3. # For example, get the metrics from master process running locally with default web port
  4. $ curl 127.0.0.1:19999/metrics/json/

Send an HTTP request to /metrics/json/ of the active Alluxio workers to get per-worker metrics.

  1. # Get the metrics in JSON format from an active Alluxio worker
  2. $ curl <WORKER_HOSTNAME>:<WORKER_WEB_PORT>/metrics/json
  3. # For example, get the metrics from worker process running locally with default web port
  4. $ curl 127.0.0.1:30000/metrics/json/

Sample CSV Sink Setup

This section gives an example of writing collected metrics to a CSV file.

First, create the polling directory for CsvSink (if it does not already exist):

  1. $ mkdir /tmp/alluxio-metrics

In the metrics property file, $ALLUXIO_HOME/conf/metrics.properties by default, add the following properties:

  1. # Enable CsvSink
  2. sink.csv.class=alluxio.metrics.sink.CsvSink
  3. # Polling period for CsvSink
  4. sink.csv.period=1
  5. sink.csv.unit=seconds
  6. # Polling directory for CsvSink, ensure this directory exists!
  7. sink.csv.directory=/tmp/alluxio-metrics

If Alluxio is deployed in a cluster, this file needs to be distributed to all the nodes.

After starting Alluxio, the CSV files containing metrics will be found in the sink.csv.directory. The filename will correspond with the metric name.

Refer to metrics.properties.template for all possible sink specific configurations.

Metric Types

Each metric falls into one of the following metric types:

  • Gauge: Records a value
  • Meter: Measures the rate of events over time (e.g., “requests per second”)
  • Counter: Measures the number of times an event occurs
  • Timer: Measures both the rate that a particular event is called and the distribution of its duration

For more details about the metric types, please refer to the metrics library documentation

Alluxio Metrics

There are two types of metrics in Alluxio, cluster-wide aggregated metrics, and per-process detailed metrics.

Cluster metrics are collected by the leading master and displayed in the metrics tab of the web UI. These metrics are designed to provide a snapshot of the cluster state and the overall amount of data and metadata served by Alluxio.

Process metrics are collected by each Alluxio process and exposed in a machine-readable format through any configured sinks. Process metrics are highly detailed and are intended to be consumed by third-party monitoring tools. Users can then view fine-grained dashboards with time-series graphs of each metric, such as data transferred or the number of RPC invocations.

Metrics in Alluxio have the following format for master node metrics:

master.[metricName].[tag1].[tag2]…

Metrics in Alluxio have the following format for non-master node metrics:

[processType].[hostName].[metricName].[tag1].[tag2]…

There is generally an Alluxio metric for every RPC invocation, to Alluxio or to the under store.

Tags are additional pieces of metadata for the metric such as user name or under storage location. Tags can be used to further filter or aggregate on various characteristics.

Cluster Metrics

Master Metrics

Workers and clients send metrics data to the Alluxio master through heartbeats. The interval is defined by property alluxio.master.worker.heartbeat.interval and alluxio.user.metrics.heartbeat.interval respectively.

Each client will be assigned an application id. All the metrics sent by this client contain the client application id information. By default, this will be in the form of ‘app-[random number]’. This value can be configured through the property alluxio.user.app.id, so multiple clients can be combined into a logical application.

  • Alluxio cluster information
Metric NameDescription
Master.WorkersTotal number of active Alluxio workers in this cluster
  • Alluxio storage capacity
Metric NameDescription
Master.CapacityTotalTotal capacity of the Alluxio file system in bytes
Master.CapacityTotalTierTotal capacity in tier of the Alluxio file system in bytes
Master.CapacityUsedUsed capacity of the file system in bytes
Master.CapacityUsedTierUsed capacity in tier of the Alluxio file system in bytes
Master.CapacityFreeFree capacity of the Alluxio file system in bytes
Master.CapacityFreeTierFree capacity in tier of the Alluxio file system in bytes
  • Under storage capacity
Metric NameDescription
Master.UfsCapacityTotalTotal capacity of the under file system in bytes
Master.UfsCapacityUsedUsed capacity of the under file system in bytes
Master.UfsCapacityFreeFree capacity of the under file system in bytes
  • Total amount of data transferred through Alluxio and I/O throughput estimates (meter statistics)
Metric NameDescription
cluster.BytesReadAlluxioTotal number of bytes read from Alluxio storage. This does not include UFS reads
cluster.BytesReadAlluxioThroughputBytes read throughput from Alluxio storage
cluster.BytesReadDomainTotal number of bytes read from Alluxio storage via domain socket
cluster.BytesReadDomainThroughputBytes read throughput from Alluxio storage via domain socket
cluster.BytesReadLocalTotal number of bytes read from local filesystem
cluster.BytesReadLocalThroughputBytes read throughput from local filesystem
cluster.BytesWrittenAlluxioTotal number of bytes written to Alluxio storage. This does not include UFS writes
cluster.BytesWrittenAlluxioThroughputBytes write throughput to Alluxio storage
cluster.BytesWrittenDomainTotal number of bytes written to Alluxio storage via domain socket
cluster.BytesWrittenDomainThroughputThroughput of bytes written to Alluxio storage via domain socket
  • I/O to under storages
Metric NameDescription
cluster.BytesReadUfsAllTotal number of bytes read from all Alluxio UFSes
cluster.BytesReadUfsThroughputBytes read throughput from all Alluxio UFSes
cluster.BytesWrittenUfsAllTotal number of bytes written to all Alluxio UFSes
cluster.BytesWrittenUfsThroughputBytes write throughput to all Alluxio UFSes
  • Under storage RPCs

For all th UFS RPCs (e.g. create file, delete file, get file status), the timer metrics of each RPC will be recorded as well as the failure counters if any.

For example: cluster.UfsOp<RPC_NAME>.UFS:<UFS_ADDRESS> records the number of UFS operation ran on UFS

Master Metrics

  • Master summary information
Metric NameDescription
Master.TotalPathsTotal number of files and directory in Alluxio namespace
Master.UfsSessionCount-Ufs:The total number of currently opened UFS sessions to connect to the given
  • Master Logical operations and results
Metric NameDescription
Master.CreateFileOpsTotal number of the CreateFile operations
Master.FilesCreatedTotal number of the succeed CreateFile operations
Master.CompleteFileOpsTotal number of the CompleteFile operations
Master.FilesCompletedTotal number of the succeed CompleteFile operations
Master.GetFileInfoOpsTotal number of the GetFileInfo operations
Master.FileInfosGotTotal number of the succeed GetFileInfo operations
Master.GetFileBlockInfoOpsTotal number of GetFileBlockInfo operations
Master.FileBlockInfosGotTotal number of succeed GetFileBlockInfo operations
Master.FreeFileOpsTotal number of FreeFile operations
Master.FilesFreedTotal number of succeed FreeFile operations
Master.FilesPersistedTotal number of successfully persisted files
Master.FilesPinnedTotal number of currently pinned files
Master.CreateDirectoryOpsTotal number of the CreateDirectory operations
Master.DirectoriesCreatedTotal number of the succeed CreateDirectory operations
Master.DeletePathOpsTotal number of the Delete operations
Master.PathsDeletedTotal number of the succeed Delete operations
Master.GetNewBlockOpsTotal number of the GetNewBlock operations
Master.NewBlocksGotTotal number of the succeed GetNewBlock operations
Master.MountOpsTotal number of Mount operations
Master.PathsMountedTotal number of succeed Mount operations
Master.UnmountOpsTotal number of Unmount operations
Master.PathsUnmountedTotal number of succeed Unmount operations
Master.RenamePathOpsTotal number of Rename operations
Master.PathsRenamedTotal number of succeed Rename operations
Master.SetAclOpsTotal number of SetAcl operations
Master.SetAttributeOpsTotal number of SetAttribute operations

All the Alluxio filesystem client operations come with a retry mechanism where master metrics record how many retries an operation has (in the format of Master.<RPC_NAME>Retries) and how many failures an operation runs into (in the format of Master.<RPC_NAME>Failures).

  • Master timer metrics
Metric NameDescription
Master.blockHeartbeat.UserThe duration statistics of BlockHeartbeat operations
Master.ConnectFromMaster.UFS:.UFS_TYPE:The duration statistics of connecting from master to UFS
Master.GetSpace.UFS:.UFS_TYPE:The duration statistics of getting space of UFS
Master.getConfigHashThe duration statistics of getting hashes of cluster and path level configuration
Master.getConfigurationThe duration statistics of getting cluster level and path level configuration
Master.getPinnedFileIdsThe duration statistics of getting the ids of pinned files
Master.getWorkerIdThe duration statistics of getting worker id
Master.registerWorkerThe duration statistics of registering worker to master
  • Other Master metrics
Metric NameDescription
Master.LastBackupEntriesCountThe total number of entries written in last leading master metadata backup
Master.BackupEntriesProcessTimeThe process time of the last backup
Master.LastBackupRestoreCountThe total number of entries restored from backup when a leading master initializes its metadata
Master.BackupRestoreProcessTimeThe process time of the last restore from backup

Worker Metrics

Metric NameDescription
Worker..CapacityTotalTotal capacity of this worker in bytes
Worker..CapacityUsedUsed capacity of this worker in bytes
Worker..CapacityFreeFree capacity of this worker in bytes
Worker..BlocksCachedTotal number of blocks in Alluxio worker storages
Worker..BlocksAccessedTotal number of times blocks in this worker are accessed
Worker..BlocksCanceledTotal number of aborted temporary blocks
Worker..BlocksDeletedTotal number of deleted blocks
Worker..BlocksEvictedTotal number of blocks removed by this worker
Worker..BlocksLostTotal number of lost blocks
Worker..BlocksPromotedTotal number of blocks moved by clients from one location to another in this worker
Worker..AsyncCacheRequestsTotal number of async cache requests
Worker..AsyncCacheDuplicateRequestsTotal number of duplicate requests of caching the same block asynchronously
Worker..AsyncCacheSucceededBlocksTotal number of blocks succeed in async cache
Worker..AsyncCacheFailedBlocksTotal number of blocks failed to be cached asynchronously
Worker..AsyncCacheUfsBlocksTotal number of async cache blocks which have local source
Worker..AsyncCacheRemoteBlocksTotal number of remote blocks to async cache locally

Process Common Metrics

The following metrics are collected on each instance (Master, Worker or Client).

  • JVM attributes
Metric NameDescription
nameThe name of the JVM
uptimeThe uptime of the JVM
vendorThe current JVM vendor
  • Garbage collector statistics
Metric NameDescription
PS-MarkSweep.countTotal number of mark and sweep
PS-MarkSweep.timeThe time used to mark and sweep
PS-Scavenge.countTotal number of scavenge
PS-Scavenge.timeThe time used to scavenge
  • Memory usage

Alluxio provides overall and detailed memory usage information. Detailed memory usage information of code cache, compressed class space, metaspace, PS Eden space, PS old gen, and PS survivor space is collected in each process.

A subset of the memory usage metrics are listed as following:

Metric NameDescription
total.committedThe amount of memory in bytes that is guaranteed to be available for use by the JVM
total.initThe amount of the memory in bytes that is available for use by the JVM
total.maxThe maximum amount of memory in bytes that is available for use by the JVM
total.usedThe amount of memory currently used in bytes
heap.committedThe amount of memory from heap area guaranteed to be available
heap.initThe amount of memory from heap area available at initialization
heap.maxThe maximum amount of memory from heap area that is available
heap.usageThe amount of memory from heap area currently used in GB
heap.usedThe amount of memory from heap area that has been used
pools.Code-Cache.usedUsed memory of collection usage from the pool from which memory is used for compilation and storage of native code
pools.Compressed-Class-Space.usedUsed memory of collection usage from the pool from which memory is use for class metadata
pools.PS-Eden-Space.usedUsed memory of collection usage from the pool from which memory is initially allocated for most objects
pools.PS-Survivor-Space.usedUsed memory of collection usage from the pool containing objects that have survived the garbage collection of the Eden space