Ceph RGW Internals: Cache Coherence & Bucket Life Cycles

RGW Cache Coherence

Why RGW have a control pool? We will try to understand its use case and purpose in RGW for cache synchronization.


Creates watcher objects in RGW control pool

$ sudo rados ls -p .in-abc-1.rgw.control

The common assumption is that these objects are watched for any change and the threads sync their caches if there is a change on these objects.


class RGWPutLC: public RGWOp
class RGWPutLC_ObjStore_S3: public RGWPutLC_ObjStore

      if(is_lc_op()) {
        return new RGWPutLC_ObjStore_S3;
class RGWHandler_REST_Bucket_S3 : public RGWHandler_REST_S3 {
  bool is_acl_op() {
    return s->info.args.exists("acl");
  bool is_cors_op() {
      return s->info.args.exists("cors");
  bool is_lc_op() {
      return s->info.args.exists("lifecycle");
  bool is_obj_update_op() override {
    return is_acl_op() || is_cors_op();
  bool is_request_payment_op() {
    return s->info.args.exists("requestPayment");
  bool is_policy_op() {
    return s->info.args.exists("policy");

RGW OP Handler

RGWOp* RGWHandler_REST::get_op(RGWRados* store)
  RGWOp *op;
  switch (s->op) {
   case OP_GET:
     op = op_get();
   case OP_PUT:
     op = op_put();
   case OP_DELETE:
     op = op_delete();
   case OP_HEAD:
     op = op_head();
   case OP_POST:
     op = op_post();
   case OP_COPY:
     op = op_copy();
   case OP_OPTIONS:
     op = op_options();
     return NULL;

  if (op) {
    op->init(store, s, this);
  return op;
} /* get_op */

Takes an exclusive lock on the LC object for a given bucket shard. Next, it sets OID in OMAP Uses: rgw_cls_lc_set_entry()

RGWLC Invocation

Entry for LC: RGWRados::init_complete()

This function reads zone and zone group config and creates
connection to zone endpoint. It also creates io context with
root, GC, LC, objexp (log) and reshard pool.
GC used the RGWObjectExpirer object which uses the objexp (log) pool.

 lc = new RGWLC();
 lc->initialize(cct, this);

 if (use_lc_thread)

RGWLC class has LCWorker class

void RGWLC::initialize(CephContext *_cct, RGWRados *_store)

- creates LC object names as lc.0, lc.1,...,lc.31
- creates a cookie buffer
void RGWLC::start_processor()
  • Spawns LCWorker threads
  • Each thread calls lc->process()

// src/cls/rgw/cls_rgw_const.h
// The above file has CLS functions of RGW.

It stores the operation meta in struct cls_rgw_lc_obj_head.
This structure has two fields: time and a marker string.

The list of objects is retrieved from OMAP in rgw_cls_lc_list_entries()
The input is op (cls_rgw_lc_list_entries_op op) marker, filter prefix
and max entries. The function rgw_cls_lc_list_entries() get the list.
MAX_LC_LIST_ENTRIES in one read is 100.

The list entries get the bucket ID and for each entry bucket state
is set to uninitial in OMAP.

RGWLC::bucket_lc_process(string& shard_id)

Shard ID has tenant, bucket name and bucket ID.
A bucket must have RGW_ATTR_LC set for LC processing.

Design: Brave Device Sync

The Brave Browser offers a sync facility that keeps bookmarks and browsing history across Brave browser installations. So your data from phone, laptop or iPad could all become one. All at the same time respecting your privacy.

The current design of sync uses a device ID to identify a client. A client calls Brave server to store data. The first client creates the seed phrase of the sync chain.

The Brave server listens to these requests using a serverless component.

Brave Sync Design


What is TCP BBR?

What is BBR

A congestion control based on measuring the two parameters that
characterize a path: bottleneck bandwidth and round-trip propagation
time, or BBR.

Why Use BBR?

  • Better congestion control algorithm in TCP.
  • No need to change the client.
  • More effective in a high packet loss network!
    • The classic TCP uses binary exponential backoff and makes requests slow.

How BBR Work?

Instead of assuming a packet loss as congestion, BBR measures two parameters and forms a control loop.

  1. Bottleneck Bandwidth
  2. RoundTripTime of a packet

Bottleneck bandwidth is the max available bandwidth of a connection.

It creates a control loop to bump up/down Bottleneck bandwidth depending on average packet payload delivery over a time t.

On each ack: It updates the Bottleneck Bandwidth & RoundTrip time.
On each send: It tempers the pacing_rate to bump up/down the Bottleneck bandwidth.

BBR is a simple instance of a Max-plus control system, a new approach
to control based on nonstandard algebra.12 This approach allows the
adaptation rate [controlled by the max gain] to be independent of
the queue growth [controlled by the average gain].

Since the server can instrument the packets, monitor and decide the right load to a connection, the client needs not any change.


CUBIC also increases bottleneck bandwidth but doesn’t retract after hitting the plateau. BBR actively learns the packet RTT & adjusts bandwidth. Thus the queue length stays small and latencies smaller than CUBIC.


Understanding Ruby Symbols

Ruby is an interpreted language. It is dynamically typed and uses a new memory for a variable. A variable has a name and a value. Symbols are an optimized variable that holds single instance of memory. It is good for variables that assume the same values across the program such as hash table keys.

h = {'my_key' => 123}

The storage for my_key is allocated each time my_key is used. That’s waste of memory and many related bookkeeping tasks by the Ruby interpreter.

So declaring the key as a symbol makes sense as only one copy of my_key is kept in memory.

h = {:my_key => 123}

You have to use the : operator with each usage of a symbol.

irb(main):003:0> new_hash={:my_key => 123}
=> {:my_key=>123}

irb(main):004:0> new_hash[:my_key]
=> 123

# You must use the :

irb(main):005:0> new_hash[my_key]
NameError: undefined local variable or method `my_key' for main:Object
    from (irb):5
irb(main):006:0> new_hash['my_key']
=> nil

Written with StackEdit.

Why Docker is a Long Term Future for Platform?

What make Docker so popular and long-lasting?

  • Container is essentially OS level virtualization. Each application gets illusion of its own OS, having almost absolute control over it. Another advantage is that host OS knows about the container processes and hence can share its resources among hosted containers.
  • The concept of containers was started by FreeBSD, refined by Solaris and re-implemented by Linux.
  • Containers are better than two other levels of virtualization:
    • ABI/platform level, where application integrates with the platform (Google App Engine), doesn’t scale well.
    • Hardware level, where a virtual hardware runs the OS (e.g. virtual machines, hypervisors).
  • Docker containers run close to the real hardware, and host OS has knowledge of resource usage. Hence it’s an optimal sweet spot for virtualization.
  • Joyent SmartOS is built on OpenSolaris and provides Solaris features to Linux like Docker containers. It acheives that by allowing Linux APIs translated to Solaris APIs. Everything runs on the bare metal hence.
  • SmartOS containers get ZFS, Dtrace by default 🙂
  • SmartOS containers are very secure as they run in zones.

I hope to cover Smart OS design internals, Docker container on Linux details in next posts.


Linux Memory Management Tricks

Tips to Improve Dynamic Memory Performance

Instead of using memset() to initialize malloc()’ed memory, use calloc(). Because when you call memset(), VM system has to map the pages in to memory in order to zero initialize them. It’s very expensive and wasteful if you don’t intend to use the pages right away.

calloc() reserves the needed address space but does not zero initialize them unless memory is used. Hence it postpones the need to load pages in to memory. It also lets the system initialize pages as they’re used, as opposed to all at once.

  • Lazy allocation: A global(normal variable or a buffer) can be replaced with a static and a couple of functions to allow its access.

  • memcpy() & memmove() needs both blocks to be memory resident. Use them if size of blocks is small(>16KB), you would be using the blocks right away, if blocks are not page aligned, blocks overlap.
    But if you intend to postpone the use, you would increasing the working set of the application. For small amount of data, use memcpy().

  • To check heap dysfunctional behavior: $ MALLOC_CHECK_=1 ./a.out
    It’ll give an address related to each violation of dynamic memory routines.

  • Electric fence : Works very well with gdb

  • Libsafe: for libc routines

One More Reason to Avoid Ruby Language

Ruby is type unsafe language but it goes a step further and avoids checking dynamically too.

Consider this code

x = :abc
if x == 'abc'
  puts "Symbol and String are two different classes"
  puts x.class, 'abc'.class

# puts can print a symbol and string alike.
puts x

My Complaints

  • I’m new to Ruby. How could Ruby let a Symbol and String compare, in spite of being aware of their types? Like Python, it can throw an error.
  • How can puts print a Symbol as good as a String


Internals of Linux Process Signals

Linux Signals

  • A process in Linux is represented as task_struct.
  • A process A sends a signal to process B using system call kill() or kill -<sig num> <pid>.
  • Kernel updates the task_struct of the receiving process B. It changes the signal field with the passed value.
  • The scheduler checks the signals on a process before running the process. If the signal is set, it calls the signal handler.
  • A process can define its handlers for all signals except SIGKILL and SIGSTOP.
  • There is always a default signal handler for each signal.


Height of a Binary Tree: Recursion Unrolled & Explained

tags: >-
development, C++, tree, height, programming, recursion, internals, complete
categories: development

Height of a Binary Tree is the longest path in the tree.

int getHeight(node *root)
    if (root == NULL) {
        return -1;
    int leftHeight = getHeight(root->left);
    int rightHeight = getHeight(root->right);

    return (max(leftHeight, rightHeight) + 1);     

Suppose the tree is as follows:

       [8]  [12]
           [4] [5]

The recursion flow is as following:

1. getHeight(10)
2. getHeight(left of 10) ==> getHeight(8)
3. getHeight(left of 8) ==> getHeight(NULL)
4. getHeight(NULL) returns -1
5. leftHeight = -1 and flow goes back to line # 3
6. Now, it calls right of 8, getHeight(right of 8)
7. getHeight(NULL) returns -1 to rHeight
8. Both subtree of 8 are traversed and leftHeight = -1, rightHeight = -1
9. It compares both values and return the max + 1
10. max( -1, -1) + 1 = 0
11. The node 8 is left subtree of 10, and returns to line # 2.
12. At node 10, leftHeight = 0
13. Now, rightHeight is calculated by moving to right of 10.
14. getHeight(right of 10) ==> getHeight(12)
15. Similar to above, height at node 12 is calculated as 1.
16. At the end, node 12 returns its height at line #14.
17. Again we compare max (0, 1) + 1
18. The answer is 2   

Usually, all recursion steps are not visualized and just assumed to be working. This post tries to exhaust a recursion flow and help understand it better.

Notes on Dockerfile and Build Cache

Dockerfile is an instruction set to set up a new container. It looks like a BASH script that serially runs all the mentioned commands. The commands are predefined by Dockerfile syntax.

Unlike BASH script, Dockerfile runs and applies effects of a command to the output of the previous step. Each step of a Dockerfile creates, by default, a container which is kept hidden. You can list such ephemeral containers by running following command:

$ docker images -a 

All containers with <none> name are ephemeral.

Why Docker need Ephemeral Containers

  • Each ephemeral container acts as a cached output of a step in the Dockerfile.
  • Next container build would use the cached output instead of running the step again.

How it Works?

  • Each step starting from the from base checks if the next step has cached output.
  • The check is with the asked instruction and the instruction that was run by the cached output.
  • If instructions do not match, the cache is invalidated. The step is built normally.
  • To disable caching, provide the no-cache option.
    $ docker build --no-cache ...