Troubleshooting

The Gateway Won’t Start

If you cannot start the gateway (i.e., there is no existing pid),check to see if there is an existing .asok file from anotheruser. If an .asok file from another user exists and there is norunning pid, remove the .asok file and try to start theprocess again. This may occur when you start the process as a root user andthe startup script is trying to start the process as awww-data or apache user and an existing .asok ispreventing the script from starting the daemon.

The radosgw init script (/etc/init.d/radosgw) also has a verbose argument thatcan provide some insight as to what could be the issue:

  1. /etc/init.d/radosgw start -v

or

  1. /etc/init.d radosgw start --verbose

HTTP Request Errors

Examining the access and error logs for the web server itself isprobably the first step in identifying what is going on. If there isa 500 error, that usually indicates a problem communicating with theradosgw daemon. Ensure the daemon is running, its socket path isconfigured, and that the web server is looking for it in the properlocation.

Crashed radosgw process

If the radosgw process dies, you will normally see a 500 errorfrom the web server (apache, nginx, etc.). In that situation, simplyrestarting radosgw will restore service.

To diagnose the cause of the crash, check the log in /var/log/cephand/or the core file (if one was generated).

Blocked radosgw Requests

If some (or all) radosgw requests appear to be blocked, you can getsome insight into the internal state of the radosgw daemon viaits admin socket. By default, there will be a socket configured toreside in /var/run/ceph, and the daemon can be queried with:

  1. ceph daemon /var/run/ceph/client.rgw help
  2.  
  3. help list available commands
  4. objecter_requests show in-progress osd requests
  5. perfcounters_dump dump perfcounters value
  6. perfcounters_schema dump perfcounters schema
  7. version get protocol version

Of particular interest:

  1. ceph daemon /var/run/ceph/client.rgw objecter_requests
  2. ...

will dump information about current in-progress requests with theRADOS cluster. This allows one to identify if any requests are blockedby a non-responsive OSD. For example, one might see:

  1. { "ops": [
  2. { "tid": 1858,
  3. "pg": "2.d2041a48",
  4. "osd": 1,
  5. "last_sent": "2012-03-08 14:56:37.949872",
  6. "attempts": 1,
  7. "object_id": "fatty_25647_object1857",
  8. "object_locator": "@2",
  9. "snapid": "head",
  10. "snap_context": "0=[]",
  11. "mtime": "2012-03-08 14:56:37.949813",
  12. "osd_ops": [
  13. "write 0~4096"]},
  14. { "tid": 1873,
  15. "pg": "2.695e9f8e",
  16. "osd": 1,
  17. "last_sent": "2012-03-08 14:56:37.970615",
  18. "attempts": 1,
  19. "object_id": "fatty_25647_object1872",
  20. "object_locator": "@2",
  21. "snapid": "head",
  22. "snap_context": "0=[]",
  23. "mtime": "2012-03-08 14:56:37.970555",
  24. "osd_ops": [
  25. "write 0~4096"]}],
  26. "linger_ops": [],
  27. "pool_ops": [],
  28. "pool_stat_ops": [],
  29. "statfs_ops": []}

In this dump, two requests are in progress. The last_sent field isthe time the RADOS request was sent. If this is a while ago, it suggeststhat the OSD is not responding. For example, for request 1858, you couldcheck the OSD status with:

  1. ceph pg map 2.d2041a48
  2.  
  3. osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]

This tells us to look at osd.1, the primary copy for this PG:

  1. ceph daemon osd.1 ops
  2. { "num_ops": 651,
  3. "ops": [
  4. { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
  5. "received_at": "1331247573.344650",
  6. "age": "25.606449",
  7. "flag_point": "waiting for sub ops",
  8. "client_info": { "client": "client.4124",
  9. "tid": 1858}},
  10. ...

The flag_point field indicates that the OSD is currently waitingfor replicas to respond, in this case osd.0.

Java S3 API Troubleshooting

Peer Not Authenticated

You may receive an error that looks like this:

  1. [java] INFO: Unable to execute HTTP request: peer not authenticated

The Java SDK for S3 requires a valid certificate from a recognized certificateauthority, because it uses HTTPS by default. If you are just testing the CephObject Storage services, you can resolve this problem in a few ways:

  • Prepend the IP address or hostname with http://. For example, change this:
  1. conn.setEndpoint("myserver");

To:

  1. conn.setEndpoint("http://myserver")
  • After setting your credentials, add a client configuration and set theprotocol to Protocol.HTTP.
  1. AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
  2.  
  3. ClientConfiguration clientConfig = new ClientConfiguration();
  4. clientConfig.setProtocol(Protocol.HTTP);
  5.  
  6. AmazonS3 conn = new AmazonS3Client(credentials, clientConfig);

405 MethodNotAllowed

If you receive an 405 error, check to see if you have the S3 subdomain set up correctly.You will need to have a wild card setting in your DNS record for subdomain functionalityto work properly.

Also, check to ensure that the default site is disabled.

  1. [java] Exception in thread "main" Status Code: 405, AWS Service: Amazon S3, AWS Request ID: null, AWS Error Code: MethodNotAllowed, AWS Error Message: null, S3 Extended Request ID: null

Numerous objects in default.rgw.meta pool

Clusters created prior to jewel have a metadata archival feature enabled by default, using the default.rgw.meta pool.This archive keeps all old versions of user and bucket metadata, resulting in large numbers of objects in the default.rgw.meta pool.

Disabling the Metadata Heap

Users who want to disable this feature going forward should set the metadata_heap field to an empty string "":

  1. $ radosgw-admin zone get --rgw-zone=default > zone.json
  2. [edit zone.json, setting "metadata_heap": ""]
  3. $ radosgw-admin zone set --rgw-zone=default --infile=zone.json
  4. $ radosgw-admin period update --commit

This will stop new metadata from being written to the default.rgw.meta pool, but does not remove any existing objects or pool.

Cleaning the Metadata Heap Pool

Clusters created prior to jewel normally use default.rgw.meta only for the metadata archival feature.

However, from luminous onwards, radosgw uses Pool Namespaces within default.rgw.meta for an entirely different purpose, that is, to store user_keys and other critical metadata.

Users should check zone configuration before proceeding any cleanup procedures:

  1. $ radosgw-admin zone get --rgw-zone=default | grep default.rgw.meta
  2. [should not match any strings]

Having confirmed that the pool is not used for any purpose, users may safely delete all objects in the default.rgw.meta pool, or optionally, delete the entire pool itself.