Notes on RGW Sytem Object State

RGW raw object store has following structure:

// rgw/rgw_rados.h
struct RGWRawObjState {
  rgw_raw_obj obj;
  bool has_attrs{false};
  bool exists{false};
  uint64_t size{0};
  ceph::real_time mtime;
  uint64_t epoch;
  bufferlist obj_tag;
  bool has_data{false};
  bufferlist data;
  bool prefetch_data{false};
  uint64_t pg_ver{0};

  /* important! don't forget to update copy constructor */

  RGWObjVersionTracker objv_tracker;

  map<string, bufferlist> attrset;
  RGWRawObjState() {}

Written with StackEdit.

Advertisements

Notes on RGW Request Path

The principle class is RGWOp. It defines request state, RGWRados store pointer.
A RGW request struct req_state has

  • Ceph contect
  • op type info
  • account, bucket info
  • zonegroup name
  • RGWBucketInfo bucket_info
  • RGWUserInfo *user
Op Execution

RGWGetObj::execute() is the primary execution context under the class RGWGetObj. It uses interfaces of class RGWRados::Object to perfrom I/O ops. The read op carries various information such as zone id, pg version, mod_ptr, object size etc.
Next, the RGWRados::Object::prepare( ) is called.

Written with StackEdit.

Notes on RGW Manifest

RGW maintains a manifest of each object. The class RGWObjManifest implements the details with object head, tail placement.
Manifest is written as XATTRs along with RGWRados::Object::Write::_do_write_meta( ).

/**
 * Write/overwrite an object to the bucket storage.
 * bucket: the bucket to store the object in
 * obj: the object name/key
 * data: the object contents/value
 * size: the amount of data to write (data must be this long)
 * accounted_size: original size of data before compression, encryption
 * mtime: if non-NULL, writes the given mtime to the bucket storage
 * attrs: all the given attrs are written to bucket storage for the given object
 * exclusive: create object exclusively
 * Returns: 0 on success, -ERR# otherwise.
 */

Written with StackEdit.

Notes on Ceph librados Client

Cluster Connection

  • A client is an application that uses librados to connect to a Ceph cluster.

  • It needs a cluster object populatd with cluster info (cluster name, info from ceph.conf)

  • Then the client do a rados_connect and cluster handle is populated.

  • A cluster handle can bind with different pools.

Cluster IO context

  • The I/O happens on a pool so the connection needs to bind to a pool.

  • The connection to a pool gives the client an I/O context.

  • The client only species an object name/xattr and librados maps it to a PG & OSD in the cluster.

  • An obhect write to rados require key, value, and value size.

  • librados::bufferlist is primarily used for storing object value.

References

Written with StackEdit.

Ceph Outage with OSDs Heartbeat failure on Hammer (0.94.6)

Symptoms

  • The cluster went down after 24 OSDs were added and marked in simultaneously.
  • This was an erasure coded (10+5) RGW cluster on Hammer.
  • All the OSDs started failing and eventually 50% of the OSDs were down.
  • Manual efforts to bring them up failed and we saw heartbeat failures in OSDs log.
  • All OSD were consuming ~15G RAM and OSDs were hitting Out of memory errors.
2018-07-18 08:58:12.794311 7f4aa0925700 -1 
osd.127 206901 heartbeat_check: 
no reply from osd.55 since 
back 2018-07-18 08:45:13.647493 
front 2018-07-18 08:45:13.647493 
(cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794315 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.57 since back
  2018-07-18 08:45:42.452510 front 2018-07-18
   08:45:42.452510 (cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794321 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.82 since back 
 2018-07-18 08:45:13.647493 front 2018-07-18 
 08:45:13.647493
  (cutoff 2018-07-18 08:57:12.794247)
  • OSDs maps were out of sync
2018-07-18 08:56:52.668789 7f4886d7b700  
0 -- 10.33.49.153:6816/505502 >> 10.33.213.157:6801/2707
 pipe(0x7f4a4f39d000 sd=26 :13251
  s=1 pgs=233 cs=2 l=0 c=0x7f4a4f1b8980).connect
   claims to be 10.33.213.157:6801/1003787 not 
   10.33.213.157:6801/2707 - wrong node!   
  • An OSD has ~3000 threads, most of them in sleeping state.
  • Using GDB and getting a backtrace of all threads we found that most of the active threads were just Simple Messanger Pipe readers.
  • We were suspecting a memory leak in Ceph code.

Band-aid Fixes

  • Set norebalance, norecover, nobackfill

  • Adding swap memory to OSDs

  • Tuning heartbeat interval

  • Tuning OSD map sync and setting noout, nodown to let OSDs sync their maps.

$ sudo ceph daemon osd.148 status
{
    "cluster_fsid": "621d76ce-a208-42d6-a15b-154fcb09xcrt",
    "osd_fsid": "09650e4c-723e-45e0-b2ef-5b6d11a6da03",
    "whoami": 148,
    "state": "booting",
    "oldest_map": 156518,
    "newest_map": 221059,
    "num_pgs": 1295
}
  • Tuning OSD map cache size to 20
  • Finding processes other than Ceph
    • Processes consuming network, CPU, and RAM
    • Killing them
  • Starting OSDs one by one – that worked for us 🙂

RCA

  • The major culprit was a rogue process that was consuming massive network bandwidth on OSD nodes.
  • As network bandwidth was not enough, many messenger threads were just waiting.
  • The Simple Messanger threads are sync threads and would wait till they get through.
  • That is one of the reasons of an OSD having ~3000 threads and consuming ~15G of memory.
  • As network was saturated, OSDs heartbeat signals too were blocked and they were either committing suicide or dying of OOM.

References

Written with StackEdit.

Ceph Luminous build

struct RGWObjectCtx {
  RGWRados *store;
  void *user_ctx;

  RGWObjectCtxImpl<rgw_obj, RGWObjState> obj;
  RGWObjectCtxImpl<rgw_raw_obj, RGWRawObjState> raw;

  explicit RGWObjectCtx(RGWRados *_store) : store(_store), user_ctx(NULL), obj(store), raw(store) { }
  RGWObjectCtx(RGWRados *_store, void *_user_ctx) : store(_store), user_ctx(_user_ctx), obj(store), raw(store) { }
};

Written with StackEdit.