List of Metrics

Slack Docker Pulls GitHub edit source

There are two types of metrics in Alluxio, cluster-wide aggregated metrics, and per-process detailed metrics.

  • Cluster metrics are collected and calculated by the leading master and displayed in the metrics tab of the web UI. These metrics are designed to provide a snapshot of the cluster state and the overall amount of data and metadata served by Alluxio.

  • Process metrics are collected by each Alluxio process and exposed in a machine-readable format through any configured sinks. Process metrics are highly detailed and are intended to be consumed by third-party monitoring tools. Users can then view fine-grained dashboards with time-series graphs of each metric, such as data transferred or the number of RPC invocations.

Metrics in Alluxio have the following format for master node metrics:

  1. Master.[metricName].[tag1].[tag2]...

Metrics in Alluxio have the following format for non-master node metrics:

  1. [processType].[metricName].[tag1].[tag2]...[hostName]

There is generally an Alluxio metric for every RPC invocation, to Alluxio or to the under store.

Tags are additional pieces of metadata for the metric such as user name or under storage location. Tags can be used to further filter or aggregate on various characteristics.

Cluster Metrics

Workers and clients send metrics data to the Alluxio master through heartbeats. The interval is defined by property alluxio.master.worker.heartbeat.interval and alluxio.user.metrics.heartbeat.interval respectively.

Bytes metrics are aggregated value from workers or clients. Bytes throughput metrics are calculated on the leading master. The values of bytes throughput metrics equal to bytes metrics counter value divided by the metrics record time and shown as bytes per minute.

NameTypeDescription
Cluster.BytesReadDomainCOUNTERTotal number of bytes read from Alluxio storage via domain socket reported by all workers
Cluster.BytesReadDomainThroughputGAUGEBytes read throughput from Alluxio storage via domain socket by all workers
Cluster.BytesReadLocalCOUNTERTotal number of bytes short-circuit read from local storage by all clients
Cluster.BytesReadLocalThroughputGAUGEBytes throughput short-circuit read from local storage by all clients
Cluster.BytesReadPerUfsCOUNTERTotal number of bytes read from a specific UFS by all workers
Cluster.BytesReadRemoteCOUNTERTotal number of bytes read from Alluxio storage or underlying UFS if data does not exist in Alluxio storage reported by all workers. This does not include short-circuit local reads and domain socket reads
Cluster.BytesReadRemoteThroughputGAUGEBytes read throughput from Alluxio storage or underlying UFS if data does not exist in Alluxio storage reported by all workers. This does not include short-circuit local reads and domain socket reads
Cluster.BytesReadUfsAllCOUNTERTotal number of bytes read from a all Alluxio UFSes by all workers
Cluster.BytesReadUfsThroughputGAUGEBytes read throughput from all Alluxio UFSes by all workers
Cluster.BytesWrittenDomainCOUNTERTotal number of bytes written to Alluxio storage via domain socket by all workers
Cluster.BytesWrittenDomainThroughputGAUGEThroughput of bytes written to Alluxio storage via domain socket by all workers
Cluster.BytesWrittenLocalCOUNTERTotal number of bytes short-circuit written to local storage by all clients
Cluster.BytesWrittenLocalThroughputGAUGEBytes throughput written to local storage by all clients
Cluster.BytesWrittenPerUfsCOUNTERTotal number of bytes written to a specific Alluxio UFS by all workers
Cluster.BytesWrittenRemoteCOUNTERTotal number of bytes written to Alluxio storage in all workers or the underlying UFS. This does not include short-circuit local writes and domain socket writes.
Cluster.BytesWrittenRemoteThroughputGAUGEBytes write throughput to Alluxio storage in all workers or the underlying UFS. This does not include short-circuit local writes and domain socket writes.
Cluster.BytesWrittenUfsAllCOUNTERTotal number of bytes written to all Alluxio UFSes by all workers
Cluster.BytesWrittenUfsThroughputGAUGEBytes write throughput to all Alluxio UFSes by all workers
Cluster.CapacityFreeGAUGETotal free bytes on all tiers, on all workers of Alluxio
Cluster.CapacityTotalGAUGETotal capacity (in bytes) on all tiers, on all workers of Alluxio
Cluster.CapacityUsedGAUGETotal used bytes on all tiers, on all workers of Alluxio
Cluster.RootUfsCapacityFreeGAUGEFree capacity of the Alluxio root UFS in bytes
Cluster.RootUfsCapacityTotalGAUGETotal capacity of the Alluxio root UFS in bytes
Cluster.RootUfsCapacityUsedGAUGEUsed capacity of the Alluxio root UFS in bytes
Cluster.WorkersGAUGETotal number of active workers inside the cluster

Master Metrics

Default master metrics:

NameTypeDescription
Master.CompleteFileOpsCOUNTERTotal number of the CompleteFile operations
Master.CreateDirectoryOpsCOUNTERTotal number of the CreateDirectory operations
Master.CreateFileOpsCOUNTERTotal number of the CreateFile operations
Master.DeletePathOpsCOUNTERTotal number of the Delete operations
Master.DirectoriesCreatedCOUNTERTotal number of the succeed CreateDirectory operations
Master.EdgeCacheEvictionsGAUGETotal number of edges (inode metadata) that was evicted from cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.
Master.EdgeCacheHitsGAUGETotal number of hits in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.
Master.EdgeCacheLoadTimesGAUGETotal load times in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.
Master.EdgeCacheMissesGAUGETotal number of misses in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.
Master.EdgeCacheSizeGAUGETotal number of edges (inode metadata) cached. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.
Master.EdgeLockPoolSizeGAUGEThe size of master edge lock pool
Master.FileBlockInfosGotCOUNTERTotal number of succeed GetFileBlockInfo operations
Master.FileInfosGotCOUNTERTotal number of the succeed GetFileInfo operations
Master.FilesCompletedCOUNTERTotal number of the succeed CompleteFile operations
Master.FilesCreatedCOUNTERTotal number of the succeed CreateFile operations
Master.FilesFreedCOUNTERTotal number of succeed FreeFile operations
Master.FilesPersistedCOUNTERTotal number of successfully persisted files
Master.FilesPinnedGAUGETotal number of currently pinned files
Master.FreeFileOpsCOUNTERTotal number of FreeFile operations
Master.GetFileBlockInfoOpsCOUNTERTotal number of GetFileBlockInfo operations
Master.GetFileInfoOpsCOUNTERTotal number of the GetFileInfo operations
Master.GetNewBlockOpsCOUNTERTotal number of the GetNewBlock operations
Master.InodeCacheEvictionsGAUGETotal number of inodes that was evicted from the cache.
Master.InodeCacheHitsGAUGETotal number of hits in the inodes (inode metadata) cache.
Master.InodeCacheLoadTimesGAUGETotal load times in the inodes (inode metadata) cache.
Master.InodeCacheMissesGAUGETotal number of misses in the inodes (inode metadata) cache.
Master.InodeCacheSizeGAUGETotal number of inodes (inode metadata) cached.
Master.InodeLockPoolSizeGAUGEThe size of master inode lock pool
Master.JournalFlushFailureCOUNTERTotal number of failed journal flush
Master.JournalFlushTimerTIMERThe timer statistics of journal flush
Master.JournalGainPrimacyTimerTIMERThe timer statistics of journal gain primacy
Master.LastBackupEntriesCountGAUGEThe total number of entries written in the last leading master metadata backup
Master.LastBackupRestoreCountGAUGEThe total number of entries restored from backup when a leading master initializes its metadata
Master.LastBackupRestoreTimeMsGAUGEThe process time of the last restore from backup
Master.LastBackupTimeMsGAUGEThe process time of the last backup
Master.ListingCacheSizeGAUGEThe size of master listing cache
Master.MountOpsCOUNTERTotal number of Mount operations
Master.NewBlocksGotCOUNTERTotal number of the succeed GetNewBlock operations
Master.PathsDeletedCOUNTERTotal number of the succeed Delete operations
Master.PathsMountedCOUNTERTotal number of succeed Mount operations
Master.PathsRenamedCOUNTERTotal number of succeed Rename operations
Master.PathsUnmountedCOUNTERTotal number of succeed Unmount operations
Master.RenamePathOpsCOUNTERTotal number of Rename operations
Master.SetAclOpsCOUNTERTotal number of SetAcl operations
Master.SetAttributeOpsCOUNTERTotal number of SetAttribute operations
Master.TotalPathsGAUGETotal number of files and directory in Alluxio namespace
Master.UfsJournalCatchupTimerTIMERThe timer statistics of journal catchup
Master.UfsJournalFailureRecoverTimerTIMERThe timer statistics of ufs journal failure recover
Master.UfsJournalInitialReplayTimeMsGAUGEThe process time of the ufs journal initial replay
Master.UnmountOpsCOUNTERTotal number of Unmount operations

Dynamically generated master metrics:

Metric NameDescription
Master.CapacityTotalTierTotal capacity in tier of the Alluxio file system in bytes
Master.CapacityUsedTierUsed capacity in tier of the Alluxio file system in bytes
Master.CapacityFreeTierFree capacity in tier of the Alluxio file system in bytes
Master.UfsSessionCount-Ufs:The total number of currently opened UFS sessions to connect to the given
Master..UFS:.UFS_TYPE:.User:The details UFS rpc operation done by the current master
Master.PerUfsOp.UFS:The aggregated number of UFS operation ran on UFS by leading master
Master.The duration statistics of RPC calls exposed on leading master

Worker Metrics

Default master metrics:

NameTypeDescription
Worker.AsyncCacheDuplicateRequestsCOUNTERTotal number of duplicated async cache request received by this worker
Worker.AsyncCacheFailedBlocksCOUNTERTotal number of async cache failed blocks in this worker
Worker.AsyncCacheRemoteBlocksCOUNTERTotal number of blocks that need to be async cached from remote source
Worker.AsyncCacheRequestsCOUNTERTotal number of async cache request received by this worker
Worker.AsyncCacheSucceededBlocksCOUNTERTotal number of async cache succeeded blocks in this worker
Worker.AsyncCacheUfsBlocksCOUNTERTotal number of blocks that need to be async cached from local source
Worker.BlockRemoverBlocksToRemovedCountCOUNTERThe total number of blocks removed from this worker by asynchronous block remover.
Worker.BlockRemoverRemovingBlocksSizeGAUGEThe size of blocks is removing from this worker by asynchronous block remover.
Worker.BlockRemoverTryRemoveBlocksSizeGAUGEThe size of blocks to be removed from this worker by asynchronous block remover.
Worker.BlockRemoverTryRemoveCountCOUNTERThe total number of blocks tried to be removed from this worker by asynchronous block remover.
Worker.BlocksAccessedCOUNTERTotal number of times any one of the blocks in this worker is accessed.
Worker.BlocksCachedGAUGETotal number of blocks used for caching data in an Alluxio worker
Worker.BlocksCancelledCOUNTERTotal number of aborted temporary blocks in this worker.
Worker.BlocksDeletedCOUNTERTotal number of deleted blocks in this worker by external requests.
Worker.BlocksEvictedCOUNTERTotal number of evicted blocks in this worker.
Worker.BlocksLostCOUNTERTotal number of lost blocks in this worker.
Worker.BlocksPromotedCOUNTERTotal number of times any one of the blocks in this worker moved to a new tier.
Worker.BytesReadDomainCOUNTERTotal number of bytes read from Alluxio storage via domain socket by this worker
Worker.BytesReadDomainThroughputMETERBytes read throughput from Alluxio storage via domain socket by this worker
Worker.BytesReadPerUfsCOUNTERTotal number of bytes read from a specific Alluxio UFS by this worker
Worker.BytesReadRemoteCOUNTERTotal number of bytes read from Alluxio storage managed by this worker and underlying UFS if data cannot be found in the Alluxio storage. This does not include short-circuit local reads and domain socket reads.
Worker.BytesReadRemoteThroughputMETERTotal number of bytes read from Alluxio storage managed by this worker and underlying UFS if data cannot be found in the Alluxio storage. This does not include short-circuit local reads and domain socket reads.
Worker.BytesReadUfsThroughputMETERBytes read throughput from all Alluxio UFSes by this worker
Worker.BytesWrittenDomainCOUNTERTotal number of bytes written to Alluxio storage via domain socket by this worker
Worker.BytesWrittenDomainThroughputMETERThroughput of bytes written to Alluxio storage via domain socket by this worker
Worker.BytesWrittenPerUfsCOUNTERTotal number of bytes written to a specific Alluxio UFS by this worker
Worker.BytesWrittenRemoteCOUNTERTotal number of bytes written to Alluxio storage or the underlying UFS by this worker. This does not include short-circuit local writes and domain socket writes.
Worker.BytesWrittenRemoteThroughputMETERBytes write throughput to Alluxio storage or the underlying UFS by this workerThis does not include short-circuit local writes and domain socket writes.
Worker.BytesWrittenUfsThroughputMETERBytes write throughput to all Alluxio UFSes by this worker
Worker.CapacityFreeGAUGETotal free bytes on all tiers of a specific Alluxio worker
Worker.CapacityTotalGAUGETotal capacity (in bytes) on all tiers of a specific Alluxio worker
Worker.CapacityUsedGAUGETotal used bytes on all tiers of a specific Alluxio worker

Dynamically generated master metrics:

Metric NameDescription
Worker.UfsSessionCount-Ufs:The total number of currently opened UFS sessions to connect to the given
Worker.The duration statistics of RPC calls exposed on workers

Client Metrics

Each client metric will be recorded with its local hostname or alluxio.user.app.id is configured. If alluxio.user.app.id is configured, multiple clients can be combined into a logical application.

NameTypeDescription
Client.BytesReadLocalCOUNTERTotal number of bytes short-circuit read from local storage by this client
Client.BytesReadLocalThroughputMETERBytes throughput short-circuit read from local storage by this client
Client.BytesWrittenLocalCOUNTERTotal number of bytes short-circuit written to local storage by this client
Client.BytesWrittenLocalThroughputMETERBytes throughput short-circuit written to local storage by this client
Client.BytesWrittenUfsCOUNTERTotal number of bytes write to Alluxio UFS by this client
Client.CacheBytesEvictedMETERTotal number of bytes evicted from the client cache.
Client.CacheBytesReadCacheMETERTotal number of bytes read from the client cache.
Client.CacheBytesReadExternalMETERTotal number of bytes read from external storage due to a cache miss on the client cache.
Client.CacheBytesRequestedExternalMETERTotal number of bytes the user requested to read which resulted in a cache miss. This number may be smaller than Client.CacheBytesReadExternal due to chunk reads.
Client.CacheBytesWrittenCacheMETERTotal number of bytes written to the client cache.
Client.CacheCleanupGetErrorsCOUNTERNumber of failures when cleaning up a failed cache read.
Client.CacheCleanupPutErrorsCOUNTERNumber of failures when cleaning up a failed cache write.
Client.CacheCreateErrorsCOUNTERNumber of failures when creating a cache in the client cache.
Client.CacheDeleteErrorsCOUNTERNumber of failures when deleting cached data in the client cache.
Client.CacheDeleteNonExistingPageErrorsCOUNTERNumber of failures when deleting pages due to absence.
Client.CacheDeleteNotReadyErrorsCOUNTERNumber of failures when when cache is not ready to delete pages.
Client.CacheDeleteStoreDeleteErrorsCOUNTERNumber of failures when deleting pages due to failed delete in page stores.
Client.CacheGetErrorsCOUNTERNumber of failures when getting cached data in the client cache.
Client.CacheGetNotReadyErrorsCOUNTERNumber of failures when cache is not ready to get pages.
Client.CacheGetStoreReadErrorsCOUNTERNumber of failures when getting cached data in the client cache due to failed read from page stores.
Client.CacheHitRateGAUGECache hit rate: (# bytes read from cache) / (# bytes requested).
Client.CachePagesCOUNTERTotal number of pages in the client cache.
Client.CachePagesEvictedMETERTotal number of pages evicted from the client cache.
Client.CachePutAsyncRejectionErrorsCOUNTERNumber of failures when putting cached data in the client cache due to failed injection to async write queue.
Client.CachePutBenignRacingErrorsCOUNTERNumber of failures when adding pages due to racing eviction. This error is benign.
Client.CachePutErrorsCOUNTERNumber of failures when putting cached data in the client cache.
Client.CachePutEvictionErrorsCOUNTERNumber of failures when putting cached data in the client cache due to failed eviction.
Client.CachePutInsufficientSpaceErrorsCOUNTERNumber of failures when putting cached data in the client cache due to insufficient space made after eviction.
Client.CachePutNotReadyErrorsCOUNTERNumber of failures when cache is not ready to add pages.
Client.CachePutStoreDeleteErrorsCOUNTERNumber of failures when putting cached data in the client cache due to failed deletes in page store.
Client.CachePutStoreWriteErrorsCOUNTERNumber of failures when putting cached data in the client cache due to failed writes to page store.
Client.CacheSpaceAvailableGAUGEAmount of bytes available in the client cache.
Client.CacheSpaceUsedGAUGEAmount of bytes used by the client cache.
Client.CacheSpaceUsedCountCOUNTERAmount of bytes used by the client cache as a counter.
Client.CacheStateCOUNTERState of the cache: 0 (NOT_IN_USE), 1 (READ_ONLY) and 2 (READ_WRITE)
Client.CacheStoreDeleteTimeoutCOUNTERNumber of timeouts when deleting pages from page store.
Client.CacheStoreGetTimeoutCOUNTERNumber of timeouts when reading pages from page store.
Client.CacheStorePutTimeoutCOUNTERNumber of timeouts when writing new pages to page store.
Client.CacheStoreThreadsRejectedCOUNTERNumber of rejection of I/O threads on submitting tasks to thread pool, likely due to unresponsive local file system.
Client.CacheUnremovableFilesCOUNTERAmount of bytes unusable managed by the client cache.

Process Common Metrics

The following metrics are collected on each instance (Master, Worker or Client).

JVM Attributes

Metric NameDescription
nameThe name of the JVM
uptimeThe uptime of the JVM
vendorThe current JVM vendor

Garbage Collector Statistics

Metric NameDescription
PS-MarkSweep.countTotal number of mark and sweep
PS-MarkSweep.timeThe time used to mark and sweep
PS-Scavenge.countTotal number of scavenge
PS-Scavenge.timeThe time used to scavenge

Memory Usage

Alluxio provides overall and detailed memory usage information. Detailed memory usage information of code cache, compressed class space, metaspace, PS Eden space, PS old gen, and PS survivor space is collected in each process.

A subset of the memory usage metrics are listed as following:

Metric NameDescription
total.committedThe amount of memory in bytes that is guaranteed to be available for use by the JVM
total.initThe amount of the memory in bytes that is available for use by the JVM
total.maxThe maximum amount of memory in bytes that is available for use by the JVM
total.usedThe amount of memory currently used in bytes
heap.committedThe amount of memory from heap area guaranteed to be available
heap.initThe amount of memory from heap area available at initialization
heap.maxThe maximum amount of memory from heap area that is available
heap.usageThe amount of memory from heap area currently used in GB
heap.usedThe amount of memory from heap area that has been used
pools.Code-Cache.usedUsed memory of collection usage from the pool from which memory is used for compilation and storage of native code
pools.Compressed-Class-Space.usedUsed memory of collection usage from the pool from which memory is use for class metadata
pools.PS-Eden-Space.usedUsed memory of collection usage from the pool from which memory is initially allocated for most objects
pools.PS-Survivor-Space.usedUsed memory of collection usage from the pool containing objects that have survived the garbage collection of the Eden space