Monitoring replication slave

Note: this recipe is working with ArangoDB 2.5, you need a collectd curl_json plugin with correct boolean type mapping.

Problem

How to monitor the slave status using the collectd curl_JSON plugin.

Solution

Since arangodb reports the replication status in JSON,integrating it with the collectd curl_JSON pluginshould be an easy exercise. However, only very recent versions of collectd will handle boolean flags correctly.

Our test master/slave setup runs with the master listening on tcp://127.0.0.1:8529 and the slave (which we query) listening on tcp://127.0.0.1:8530.They replicate a database by the name testDatabase.

Since replication appliers are active per database and our example doesn’t use the default _system, we need to specify its name in the URL like this: _db/testDatabase.

We need to parse a document from a request like this:

  1. curl --dump - http://localhost:8530/_db/testDatabase/_api/replication/applier-state

If the replication is not running the document will look like that:

  1. {
  2. "state": {
  3. "running": false,
  4. "lastAppliedContinuousTick": null,
  5. "lastProcessedContinuousTick": null,
  6. "lastAvailableContinuousTick": null,
  7. "safeResumeTick": null,
  8. "progress": {
  9. "time": "2015-11-02T13:24:07Z",
  10. "message": "applier shut down",
  11. "failedConnects": 0
  12. },
  13. "totalRequests": 1,
  14. "totalFailedConnects": 0,
  15. "totalEvents": 0,
  16. "totalOperationsExcluded": 0,
  17. "lastError": {
  18. "time": "2015-11-02T13:24:07Z",
  19. "errorMessage": "no start tick",
  20. "errorNum": 1413
  21. },
  22. "time": "2015-11-02T13:31:53Z"
  23. },
  24. "server": {
  25. "version": "2.7.0",
  26. "serverId": "175584498800385"
  27. },
  28. "endpoint": "tcp://127.0.0.1:8529",
  29. "database": "testDatabase"
  30. }

A running replication will return something like this:

  1. {
  2. "state": {
  3. "running": true,
  4. "lastAppliedContinuousTick": "1150610894145",
  5. "lastProcessedContinuousTick": "1150610894145",
  6. "lastAvailableContinuousTick": "1151639153985",
  7. "safeResumeTick": "1150610894145",
  8. "progress": {
  9. "time": "2015-11-02T13:49:56Z",
  10. "message": "fetching master log from tick 1150610894145",
  11. "failedConnects": 0
  12. },
  13. "totalRequests": 12,
  14. "totalFailedConnects": 0,
  15. "totalEvents": 2,
  16. "totalOperationsExcluded": 0,
  17. "lastError": {
  18. "errorNum": 0
  19. },
  20. "time": "2015-11-02T13:49:57Z"
  21. },
  22. "server": {
  23. "version": "2.7.0",
  24. "serverId": "175584498800385"
  25. },
  26. "endpoint": "tcp://127.0.0.1:8529",
  27. "database": "testDatabase"
  28. }

We create a simple collectd configuration in /etc/collectd/collectd.conf.d/slave_testDatabase.conf that matches our API:

  1. TypesDB "/etc/collectd/collectd.conf.d/slavestate_types.db"
  2. <Plugin curl_json>
  3. # Adjust the URL so collectd can reach your arangod slave instance:
  4. <URL "http://localhost:8530/_db/testDatabase/_api/replication/applier-state">
  5. # Set your authentication to that database here:
  6. # User "foo"
  7. # Password "bar"
  8. <Key "state/running">
  9. Type "boolean"
  10. </Key>
  11. <Key "state/totalOperationsExcluded">
  12. Type "counter"
  13. </Key>
  14. <Key "state/totalRequests">
  15. Type "counter"
  16. </Key>
  17. <Key "state/totalFailedConnects">
  18. Type "counter"
  19. </Key>
  20. </URL>
  21. </Plugin>

To get nice metric names, we specify our own types.db file in /etc/collectd/collectd.conf.d/slavestate_types.db:

  1. boolean value:ABSOLUTE:0:1

So, basically state/running will give you 0/1 if its (not / ) running through the collectd monitor.