Observability

Ozone provides multiple tools to get more information about the current state of the cluster.

Prometheus

Ozone has native Prometheus. Each internal metrics (collected by Hadoop metrics framework) published under the /prom HTTP context. (For example under http://localhost:9876/prom for SCM).

The Prometheus endpoint is turned on by default but can be turned off by the hdds.prometheus.endpoint.enabled configuration variable.

In a secure environment the page is guarded with SPNEGO authentication which is not supported by Prometheus. To enable monitoring in a secure environment a specific authentication token cen be configured

Example ozone-site.xml:

  1. <property>
  2. <name>hdds.prometheus.endpoint.token</name>
  3. <value>putyourtokenhere</value>
  4. </property>

Example prometheus configuration:

  1. scrape_configs:
  2. - job_name: ozone
  3. bearer_token: <putyourtokenhere>
  4. metrics_path: /prom
  5. static_configs:
  6. - targets:
  7. - "127.0.0.1:9876"

Distributed tracing

Distributed tracing can help to understand performance bottleneck with visualizing end-to-end performance.

Ozone uses jaeger tracing library to collect traces which can send tracing data to any compatible backend (Zipkin, …).

Tracing is turned off by default, but can be turned on with hdds.tracing.enabled from ozone-site.xml

  1. <property>
  2. <name>hdds.tracing.enabled</name>
  3. <value>true</value>
  4. </property>

Jager client can be configured with environment variables as documented here:

For example:

  1. JAEGER_SAMPLER_PARAM=0.01
  2. JAEGER_SAMPLER_TYPE=probabilistic
  3. JAEGER_AGENT_HOST=jaeger

This configuration will record 1% of the requests to limit the performance overhead. For more information about jaeger sampling check the documentation

ozone insight

Ozone insight is a swiss-army-knife tool to for checking the current state of Ozone cluster. It can show logging, metrics and configuration for a particular component.

To check the available components use ozone insight list:

  1. > ozone insight list
  2. Available insight points:
  3. scm.node-manager SCM Datanode management related information.
  4. scm.replica-manager SCM closed container replication manager
  5. scm.event-queue Information about the internal async event delivery
  6. scm.protocol.block-location SCM Block location protocol endpoint
  7. scm.protocol.container-location SCM Container location protocol endpoint
  8. scm.protocol.security SCM Block location protocol endpoint
  9. om.key-manager OM Key Manager
  10. om.protocol.client Ozone Manager RPC endpoint
  11. datanode.pipeline More information about one ratis datanode ring.

Configuration

ozone insight config can show configuration related to a specific component (supported only for selected components).

  1. > ozone insight config scm.replica-manager
  2. Configuration for `scm.replica-manager` (SCM closed container replication manager)
  3. >>> hdds.scm.replication.thread.interval
  4. default: 300s
  5. current: 300s
  6. There is a replication monitor thread running inside SCM which takes care of replicating the containers in the cluster. This property is used to configure the interval in which that thread runs.
  7. >>> hdds.scm.replication.event.timeout
  8. default: 30m
  9. current: 30m
  10. Timeout for the container replication/deletion commands sent to datanodes. After this timeout the command will be retried.

Metrics

ozone insight metrics can show metrics related to a specific component (supported only for selected components).

  1. > ozone insight metrics scm.protocol.block-location
  2. Metrics for `scm.protocol.block-location` (SCM Block location protocol endpoint)
  3. RPC connections
  4. Open connections: 0
  5. Dropped connections: 0
  6. Received bytes: 1267
  7. Sent bytes: 2420
  8. RPC queue
  9. RPC average queue time: 0.0
  10. RPC call queue length: 0
  11. RPC performance
  12. RPC processing time average: 0.0
  13. Number of slow calls: 0
  14. Message type counters
  15. Number of AllocateScmBlock: ???
  16. Number of DeleteScmKeyBlocks: ???
  17. Number of GetScmInfo: ???
  18. Number of SortDatanodes: ???

Logs

ozone insight logs can connect to the required service and show the DEBUG/TRACE log related to one specific component. For example to display RPC message:

  1. >ozone insight logs om.protocol.client
  2. [OM] 2020-07-28 12:31:49,988 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol ServiceList request is received
  3. [OM] 2020-07-28 12:31:50,095 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol CreateVolume request is received

Using -v flag the content of the protobuf message can also be displayed (TRACE level log):

  1. ozone insight logs -v om.protocol.client
  2. [OM] 2020-07-28 12:33:28,463 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is received:
  3. cmdType: CreateVolume
  4. traceID: ""
  5. clientId: "client-A31DF5C6ECF2"
  6. createVolumeRequest {
  7. volumeInfo {
  8. adminName: "hadoop"
  9. ownerName: "hadoop"
  10. volume: "vol1"
  11. quotaInBytes: 1152921504606846976
  12. volumeAcls {
  13. type: USER
  14. name: "hadoop"
  15. rights: "200"
  16. aclScope: ACCESS
  17. }
  18. volumeAcls {
  19. type: GROUP
  20. name: "users"
  21. rights: "200"
  22. aclScope: ACCESS
  23. }
  24. creationTime: 1595939608460
  25. objectID: 0
  26. updateID: 0
  27. modificationTime: 0
  28. }
  29. }
  30. [OM] 2020-07-28 12:33:28,474 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is processed. Response:
  31. cmdType: CreateVolume
  32. traceID: ""
  33. success: false
  34. message: "Volume already exists"
  35. status: VOLUME_ALREADY_EXISTS

Under the hood ozone insight uses HTTP endpoints to retrieve the required information (/conf, /prom and /logLevel endpoints). It’s not yet supported in secure environment.