Height of a Binary Tree: Recursion Unrolled & Explained

tags: >-
development, C++, tree, height, programming, recursion, internals, complete
categories: development

Height of a Binary Tree is the longest path in the tree.

int getHeight(node *root)
    if (root == NULL) {
        return -1;
    int leftHeight = getHeight(root->left);
    int rightHeight = getHeight(root->right);

    return (max(leftHeight, rightHeight) + 1);     

Suppose the tree is as follows:

       [8]  [12]
           [4] [5]

The recursion flow is as following:

1. getHeight(10)
2. getHeight(left of 10) ==> getHeight(8)
3. getHeight(left of 8) ==> getHeight(NULL)
4. getHeight(NULL) returns -1
5. leftHeight = -1 and flow goes back to line # 3
6. Now, it calls right of 8, getHeight(right of 8)
7. getHeight(NULL) returns -1 to rHeight
8. Both subtree of 8 are traversed and leftHeight = -1, rightHeight = -1
9. It compares both values and return the max + 1
10. max( -1, -1) + 1 = 0
11. The node 8 is left subtree of 10, and returns to line # 2.
12. At node 10, leftHeight = 0
13. Now, rightHeight is calculated by moving to right of 10.
14. getHeight(right of 10) ==> getHeight(12)
15. Similar to above, height at node 12 is calculated as 1.
16. At the end, node 12 returns its height at line #14.
17. Again we compare max (0, 1) + 1
18. The answer is 2   

Usually, all recursion steps are not visualized and just assumed to be working. This post tries to exhaust a recursion flow and help understand it better.


Notes on Dockerfile and Build Cache

Dockerfile is an instruction set to set up a new container. It looks like a BASH script that serially runs all the mentioned commands. The commands are predefined by Dockerfile syntax.

Unlike BASH script, Dockerfile runs and applies effects of a command to the output of the previous step. Each step of a Dockerfile creates, by default, a container which is kept hidden. You can list such ephemeral containers by running following command:

$ docker images -a 

All containers with <none> name are ephemeral.

Why Docker need Ephemeral Containers

  • Each ephemeral container acts as a cached output of a step in the Dockerfile.
  • Next container build would use the cached output instead of running the step again.

How it Works?

  • Each step starting from the from base checks if the next step has cached output.
  • The check is with the asked instruction and the instruction that was run by the cached output.
  • If instructions do not match, the cache is invalidated. The step is built normally.
  • To disable caching, provide the no-cache option.
    $ docker build --no-cache ...


xargs: ls: terminated by signal 13

The following command hits a problem with sig 13.

$ find . -type f|xargs ls -altr|head
-rw-r--r-- 1 root root  254111 Mar 17  2018 ./60/62
-rw-r--r-- 1 root root  135111 Mar 17  2018 ./60/66
xargs: ls: terminated by signal 13

Why did it fail?

$ strace find . -type f|xargs ls -altr|head
newfstatat(AT_FDCWD, "xyxyx", {st_mode=S_IFREG|0644, st_size=80241, ...}, AT_SYMLINK_NOFOLLOW) = 0
write(1, "40ee30e7c76a7541d61acb6ec4\n./60/"..., 4096) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=376666, si_uid=1190} ---
+++ killed by SIGPIPE +++

xargs process is writing to a closed pipe, gets a SIGPIPE and gets killed.

The head command is getting finished before xargs could finish. It closes the read end of the pipe and xargs get a SIGPIPE as the pipe is no more being read.

Written with StackEdit.

Go Runtime Scheduler Design Internals

Concurrency is one of the most exciting features of Go language. A single threaded program runs serially, but if you have tasks that can run concurrently, you create threads for such tasks. Threads execute independently and progress concurrently. Go supports creation of thousands of threads in an application!! How is it possible? It’s Go’s runtime. Go programs are compiled and the executable is self-contained (runtime is linked with the binary).

Let’s understand the design motivation of the Go runtime. The runtime must follow the resource constraints. The system must run multiple threads. CPU core can run only one thread a time and if there are more threads than available cores, threads are paused/resumes (context switched). During a context switch, thread execution state is preserved and another thread is loaded. Creation of a thread requires resources and hence there is a limit.

Under the constraints, Go runtime maximise CPU utilisation, minimises latencies and memory footprint.

Go provides concurrency with language primitives of Goroutines and channels. Using Goroutines applications can grow dynamically (forking new Goroutines). Channels are internal to Go runtime and system has no knowledge of channels.

Let’s understand Goroutines in detail. It is essentially a light weight thread, exists on the user space. Goroutines are frugal with resources. Unlike system threads, a Goroutine has a small stack and it grows as needed. It is one of the reasons that Go can support creation of thousands of Goroutines.

So how are thousands of Goroutines managed by the runtime? Instead of delegating the responsibility to the system scheduler, Go uses its own scheduler. It is coded in Go itself.

How does Go scheduler work? Let’s understand thread model used in applications. An application can use system thread, managed by the OS. These threads can make system call and access system resources (e.g. CPU). However, these threads are expensive as they consume system resources such as signal masks, PID, cgroup etc. Context switch is expensive as they trap in to kernel. In contrast, user threads are created and managed by the application, consume less resources and context switch is fast because it does not go through the kernel. A user thread needs a system thread to execute the code on CPU or accessing any other system resources.  

Now the next decision is to maintain a ratio of user threads and system threads. The first model is using N user threads and one system threads. It gives fast context switching but can’t use multiple cores (if available). Also if a user thread is blocked hence there is no available system thread, other user threads will wait. Other scheme is to have 1-on-1 mapping of user threads to system threads. It provides good CPU utilisation but context switching is slow. Then third option is to create many-to-many mapping.

Go takes the third option. Goroutines are distributed on a set of system/OS threads. There are three major entities in the Go scheduler: A set of machines (M), Goroutines (G) and processors (P). There are minor entities such as global and local run queue and thread cache.

Let’s understand “MGP”: Machines are a representation for OS threads. An OS thread is part of a pool of worker threads. On Linux, it is a standard POSIX thread. A M runs a set of Gs. A G Goroutine represents a user space Goroutine; it has its own IP, stack and blocking info. The P represents a logical entity for available processors or a scheduling context. Every worker (M) needs a P to run G. The number of P is pre-decided (GOMAXPROCS) and fixed during the run. 

A M runs a set of Gs. A G Goroutine represents a user space Goroutine; it has its own IP, stack and blocking info. The P represents a logical entity for available processors or a scheduling context. Every worker (M) needs a P to run G. The number of P is pre-decided (GOMAXPROCS) and fixed during the run. 

In this diagram, we have a setup with 2 Processors, defined by GOMAXPROCS. There are two worker machine (M). Each M has a local run queue. As new Goroutines are created and is runnable, they are added to the local run queue. If local run queue is full, new G’s are added to the global run queue. Idle G are kept in an idle queue.

What happens when a G makes a system call? The scheduler would know that G is blocked and hence its M is also blocked. P is not utilised, and can be used by some other M.

The scheduler takes back the P, and creates a new M, and assign to the new worker machine (M). The runnable queue is also moved to the new worker. When M’ comes back, it would try to find an idle P. If not possible, it would move the G’ to the global queue and park itself in Thread cache. The scheduler makes sure there are enough threads to run all contexts. There can be more than M even for a P=1, because a worker might get stuck in a syscall.

A scenario can occur in which a P is idle as its run queue is exhausted. In such was it would try to pick G’s from global queue. If global queue is also empty, what to do?  There are two major scheduling paradigms for distributing work. The first is work-sharing in which G would get distributed to other P’s. In contrast, work-stealing scheduler an idle processor would steal G’s from other P’s run queue. Till a P is busy, there is no G movement. So, an idle P would steal about half of Gs from another P. Work Stealing has better resource utilisation and lower migration of Gs.

In this diagram, we have an idle P, local and global run queue are empty, so it will randomly pick a P and steal half of Gs.

Go has a sophisticated scheduler, it offers massive concurrency and intelligent scheduling and always tries to achieve maximum utilisation, minimum latencies.

Design Problems of PostGres- Part II

The problems in PostGres DB are solved in MySQL to an extent. The design merits of MySQL are the following:

  • The primary index has a mapping of key to an offset in the disk. But all secondary index tables have a mapping of key to the primary index’s key.
Primary index  
| key | Disk offset|  
Secondary index  
| key | primary key |  
  • The drawback is that a lookup of the secondary table needs two hops.
  • On the other hand, a change in a row only needs modification of the primary index. It avoids changing the secondary index for every change in row’s data
  • MySQL replication uses logical information instead of data with physical information used in PostGres.
  • Logs are compact compared to PG.
  • MySQL manages its own cache, unlike PG that uses buffer cache.
  • PG buffer cache use is inefficient since a read operation would have to land in the buffer cache and hence a context switch.
  • PG uses a process per connection. It is expensive. MySQL uses threads for a connection and scales without too much resource consumption.

Design Problems of PostGres- Part I

This post is a quick summary of Why Uber moved from PostGres to MySQL.

PostGres Rows and CTID

  • PostGres provides transactions. Transactions need multiple versions of data. So PG is Multi Versioned DB.
  • PG considers each row immutable. Any change to a row creates a new row version.
  • A row is represented as an offset in disk, called ctid
  • Every row will have a unique ctid because they occupy a different space. However, multiple rows (ctid) can share same disk offset (e.g. multiple verisons of a row)
| Row 0 (ctid 1) |
                       ----------> Disk block offset x
| Row 1 (ctid 2) |

PostGres Index Management

  • Each index has key and values as CTID
  • Any change in a row’s data creates a new CTID. That would need changing all indexes.
  • This is expensive because
    • PG uses WAL technique so each write is at least twice
    • PG replicates WAL to secondary nodes.
    • The WAL has CTID and disk offset level information.
    • Replication across geographies is expensive since data volume is high.

PostGres Replication

  • At a secondary, while a transaction is in progress on a row, the WAL copy will wait.
  • If transaction runs for a long time, PG will terminate the transaction after a timeout.
  • So there are two problems:
    • Transactions can terminate unexpectedly
    • Replicas may be lagging the master more than exptected

Written with StackEdit.

Opera Browser: Architecture

Opera is built on Chromium project. Chromium is an open  source project that uses WebKit, a free and open source rendering engine. WebKit is open sourced by Apple.

So, Google Chrome and Opera resemble a lot. They use process for each tab and look similar. The benefit of using processes to handle tabs is:

  1. Better security for browser as processes can’t circumvent security rules easily.
  2. Identification of inactivated tabs and swapping them out of memory, to keep browser light-weight.

To understand more about rendering and Chromium architecture, please visit the following:

  1. https://www.chromium.org/developers/design-documents/multi-process-architecture
  2. https://www.chromium.org/developers/design-documents/displaying-a-web-page-in-chrome

File system in userspace (FUSE) on Linux : C implementation

  • BBFS is a good starting point to develop a file system in C
  • Application should keep in mind file offset before issuing a request. A write should be followed with a seek to offset zero, before issuing read.
  • FUSE 2.7 reads 8K data by default, in two 4K chunks
  • Read happens in last 4K and the first 4K data block


I had set the file size as 4MB in getattr () implementation.

int bb_getattr(const char *path, struct stat *statbuf)
    int retstat = 0;
    char fpath[PATH_MAX];
    bb_fullpath(fpath, path);

    memset(statbuf, 0, sizeof(struct stat));
    if (strcmp(path, "/") == 0) {
        statbuf->st_mode = S_IFDIR | 0755;
        statbuf->st_nlink = 2;
    } else {
        statbuf->st_mode = S_IFREG | 0444;
        statbuf->st_nlink = 1;
        statbuf->st_size = 4 * 1024* 1024;
    return retstat;

The sequence of calls and their arguments is as follows:

bb_getattr(path="/abcd.txt", statbuf=0xc5387960)
    bb_fullpath:  rootdir = "/tmp", path = "/abcd.txt", fpath = "/tmp/abcd.txt"
bb_open(path"/abcd.txt", fi=0xc5daaa50)
    bb_fullpath:  rootdir = "/tmp", path = "/abcd.txt", fpath = "/tmp/abcd.txt"
    flags = 0x00008002
    fh_old = 0x00000000
    writepage = 0
    direct_io = 0
    keep_cache = 0
    fh = 0x0000000000000001
    lock_owner = 0x0000000000000000
bb_write(path="/abcd.txt", buf=0xc4966050, size=10, offset=0, fi=0xc5387a50)
    flags = 0x00000000
    fh_old = 0x00000001
    writepage = 0
    direct_io = 0
    keep_cache = 0
    fh = 0x0000000000000001
    lock_owner = 0x0000000000000000
bb_read(path="/abcd.txt", buf=0x06ccbd90, size=12288, offset=4096, fi=0xc5daaa50)  <- Here
    flags = 0x00000000
    fh_old = 0x00000001
    writepage = 0
    direct_io = 0
    keep_cache = 0
    fh = 0x0000000000000001
    lock_owner = 0x0000000000000000

bb_read(path="/abcd.txt", buf=0x06ccbd90, size=4096, offset=0, fi=0xc5daaa50)
    flags = 0x00000000
    fh_old = 0x00000001
    writepage = 0
    direct_io = 0
    keep_cache = 0
    fh = 0x0000000000000001
    lock_owner = 0x0000000000000000

Dissecting Python List allocator (CPython)

List is one of the most frequently used data structure in Python applications. The code for Python list object allocator is present in Python-2.7.5/Objects/listobject.c.

Python allocates list objects in heap and uses underlying malloc based APIs. List is a container of Python object references. In plain C terminology. it is a dynamic array of pointers.

Let’s understand basic creation and insertion APIs.

  • list_resize(): A list is of type PyListObject and is equivalent to PyObject. Every resize request allocates more than asked size.

Allocation unit size is : (size_requested + (size_requested << 3 + (size_requested < 9 : 3 or 6) ).

It gives a growth pattern of (0, 4, 8, 16, 25, 35, 46, 58, 72, 88).

  • free_list: Python keeps PyList_MAXFREELIST (defined to value 80) of free lists to save on lists allocation and deallocation.
  • PyList_New(): A new list allocation requests is served from either a free list or a new GC tracked object creation of type PyListObject. Next, a malloc request is made for requested list size.

Size calculated: nbytes = size * sizeof(PyObject *);

Object allocation: PyListObject *op = free_list[numfree] OR PyObject_GC_New(PyListObject, &PyList_Type);

Memory allocation for references: op->ob_item = (PyObject **) PyMem_MALLOC(nbytes);