Why ‘cd’ is not an external command like ls?

cd command changes the current directory to desired valid path. The shell is a process that runs user commands. Each command is run by forking a new process and exec’ing. After exec, the child process dies.

cd changes the current environment. If cd is an external command, the usual flow of fork->exec would cause environment change in the child process; the parent remains unchanged.

Hence cd is built and implemented by the shell itself.

Written with StackEdit.


Go Runtime Scheduler Design Internals

Concurrency is one of the most exciting features of Go language. A single threaded program runs serially, but if you have tasks that can run concurrently, you create threads for such tasks. Threads execute independently and progress concurrently. Go supports creation of thousands of threads in an application!! How is it possible? It’s Go’s runtime. Go programs are compiled and the executable is self-contained (runtime is linked with the binary).

Let’s understand the design motivation of the Go runtime. The runtime must follow the resource constraints. The system must run multiple threads. CPU core can run only one thread a time and if there are more threads than available cores, threads are paused/resumes (context switched). During a context switch, thread execution state is preserved and another thread is loaded. Creation of a thread requires resources and hence there is a limit.

Under the constraints, Go runtime maximise CPU utilisation, minimises latencies and memory footprint.

Go provides concurrency with language primitives of Goroutines and channels. Using Goroutines applications can grow dynamically (forking new Goroutines). Channels are internal to Go runtime and system has no knowledge of channels.

Let’s understand Goroutines in detail. It is essentially a light weight thread, exists on the user space. Goroutines are frugal with resources. Unlike system threads, a Goroutine has a small stack and it grows as needed. It is one of the reasons that Go can support creation of thousands of Goroutines.

So how are thousands of Goroutines managed by the runtime? Instead of delegating the responsibility to the system scheduler, Go uses its own scheduler. It is coded in Go itself.

How does Go scheduler work? Let’s understand thread model used in applications. An application can use system thread, managed by the OS. These threads can make system call and access system resources (e.g. CPU). However, these threads are expensive as they consume system resources such as signal masks, PID, cgroup etc. Context switch is expensive as they trap in to kernel. In contrast, user threads are created and managed by the application, consume less resources and context switch is fast because it does not go through the kernel. A user thread needs a system thread to execute the code on CPU or accessing any other system resources.  

Now the next decision is to maintain a ratio of user threads and system threads. The first model is using N user threads and one system threads. It gives fast context switching but can’t use multiple cores (if available). Also if a user thread is blocked hence there is no available system thread, other user threads will wait. Other scheme is to have 1-on-1 mapping of user threads to system threads. It provides good CPU utilisation but context switching is slow. Then third option is to create many-to-many mapping.

Go takes the third option. Goroutines are distributed on a set of system/OS threads. There are three major entities in the Go scheduler: A set of machines (M), Goroutines (G) and processors (P). There are minor entities such as global and local run queue and thread cache.

Let’s understand “MGP”: Machines are a representation for OS threads. An OS thread is part of a pool of worker threads. On Linux, it is a standard POSIX thread. A M runs a set of Gs. A G Goroutine represents a user space Goroutine; it has its own IP, stack and blocking info. The P represents a logical entity for available processors or a scheduling context. Every worker (M) needs a P to run G. The number of P is pre-decided (GOMAXPROCS) and fixed during the run. 

A M runs a set of Gs. A G Goroutine represents a user space Goroutine; it has its own IP, stack and blocking info. The P represents a logical entity for available processors or a scheduling context. Every worker (M) needs a P to run G. The number of P is pre-decided (GOMAXPROCS) and fixed during the run. 

In this diagram, we have a setup with 2 Processors, defined by GOMAXPROCS. There are two worker machine (M). Each M has a local run queue. As new Goroutines are created and is runnable, they are added to the local run queue. If local run queue is full, new G’s are added to the global run queue. Idle G are kept in an idle queue.

What happens when a G makes a system call? The scheduler would know that G is blocked and hence its M is also blocked. P is not utilised, and can be used by some other M.

The scheduler takes back the P, and creates a new M, and assign to the new worker machine (M). The runnable queue is also moved to the new worker. When M’ comes back, it would try to find an idle P. If not possible, it would move the G’ to the global queue and park itself in Thread cache. The scheduler makes sure there are enough threads to run all contexts. There can be more than M even for a P=1, because a worker might get stuck in a syscall.

A scenario can occur in which a P is idle as its run queue is exhausted. In such was it would try to pick G’s from global queue. If global queue is also empty, what to do?  There are two major scheduling paradigms for distributing work. The first is work-sharing in which G would get distributed to other P’s. In contrast, work-stealing scheduler an idle processor would steal G’s from other P’s run queue. Till a P is busy, there is no G movement. So, an idle P would steal about half of Gs from another P. Work Stealing has better resource utilisation and lower migration of Gs.

In this diagram, we have an idle P, local and global run queue are empty, so it will randomly pick a P and steal half of Gs.

Go has a sophisticated scheduler, it offers massive concurrency and intelligent scheduling and always tries to achieve maximum utilisation, minimum latencies.

NAT Protocol Simplified Explanation

  • The purpose of NAT protocol is to reduce usage of public IPs
  • A host needs a public IP to connect to Internet
  • If the host is part of a LAN with a gateway router, a host can use private IP to make requests to public Internet.
  • The public Internet would see that all the requests are originating from a LAN (i.e. the gateway router)
  • A router has a local LAN IP and a public IP.
  • The request flows as following:
    • A local host in the LAN can make a request to a web server on Internet.
    • The host request goes frpm local host IP and port to the local gateway.
    • The gateway maintains a NAT table.
      • An entry in the NAT table will have the source and destination mapping
        Local Host IP | Local Host port | Gateway public IP | Gateway port |
        • The gateway creates a port that maps request to and from the local host to the public Internet web server.
        • Hence the public server would always see the gateway IP and port and LAN host would be anonymous.
      • NAT is essentially a kind of multiplexing local hosts requests over gateways single IP and multiple ports, assigned to each local host.


Written with StackEdit.

Design Problems of PostGres- Part II

The problems in PostGres DB are solved in MySQL to an extent. The design merits of MySQL are the following:

  • The primary index has a mapping of key to an offset in the disk. But all secondary index tables have a mapping of key to the primary index’s key.
Primary index  
| key | Disk offset|  
Secondary index  
| key | primary key |  
  • The drawback is that a lookup of the secondary table needs two hops.
  • On the other hand, a change in a row only needs modification of the primary index. It avoids changing the secondary index for every change in row’s data
  • MySQL replication uses logical information instead of data with physical information used in PostGres.
  • Logs are compact compared to PG.
  • MySQL manages its own cache, unlike PG that uses buffer cache.
  • PG buffer cache use is inefficient since a read operation would have to land in the buffer cache and hence a context switch.
  • PG uses a process per connection. It is expensive. MySQL uses threads for a connection and scales without too much resource consumption.

Design Problems of PostGres- Part I

This post is a quick summary of Why Uber moved from PostGres to MySQL.

PostGres Rows and CTID

  • PostGres provides transactions. Transactions need multiple versions of data. So PG is Multi Versioned DB.
  • PG considers each row immutable. Any change to a row creates a new row version.
  • A row is represented as an offset in disk, called ctid
  • Every row will have a unique ctid because they occupy a different space. However, multiple rows (ctid) can share same disk offset (e.g. multiple verisons of a row)
| Row 0 (ctid 1) |
                       ----------> Disk block offset x
| Row 1 (ctid 2) |

PostGres Index Management

  • Each index has key and values as CTID
  • Any change in a row’s data creates a new CTID. That would need changing all indexes.
  • This is expensive because
    • PG uses WAL technique so each write is at least twice
    • PG replicates WAL to secondary nodes.
    • The WAL has CTID and disk offset level information.
    • Replication across geographies is expensive since data volume is high.

PostGres Replication

  • At a secondary, while a transaction is in progress on a row, the WAL copy will wait.
  • If transaction runs for a long time, PG will terminate the transaction after a timeout.
  • So there are two problems:
    • Transactions can terminate unexpectedly
    • Replicas may be lagging the master more than exptected

Written with StackEdit.

Design Internals of UNIX Non Blocking Sockets

Fundamental of asynchronous programming: Non Blocking Sockets

Why async socket programming

  • An application can create thousands of sockets but creating a thread per socket becomes infeasible for a large number (e.g. 100k sockets)
  • An async framework consists non-blocking sockets and an event loop
    * The kernel allows applications to seek for an event
    * Application registers a callback to be called when the event occur
    * Callback runs by the same application thread after kernel intimates
      that an event happened on a socket
  • This model uses a single thread to poll many sockets for various events and calls the callback.
  • It is concurrent that is runs many tasks but not parallel execution of tasks.

How do a socket turn into non-blocking?

  • A socket is a file, read and written with standard POSIX calls.
  • A system call for read operation uses the kernel BIO (Block I/O). A BIO provides an interface to access a block device.

    FIONBIO int
    Set non-blocking I/O mode if the argument is non-zero. In non-
    blocking mode, read(2) or write(2) calls return -1 and set errno
    to EAGAIN immediately when no data is available (This is
    equivalent to fcntl() F_SETFL O_NONBLOCK and the fcntl() form
    should be preferred)

Written with StackEdit.

What is ETag?

A web client (a browser) requests a resource from a web server.
Multiple calls for a resource would hit the server everytime and
server sends the response with return code 200.

What if the requested data is unchanged for most of the calls? Could the client somehow help server with the previous value of the requested resource?


  • The purpose of ETag is to Save bandwidth and utilize client side caching.
  • The client gets a hash of content for the first time request of a resource.
  • The next call from the client has the ETag=Received hash value.
  • Server checks if the value of the requested respurce has modified. So it recalculates the hash and compares it against the received hash.
  • If Etag is same, server returns 304.

Written with StackEdit.

Golang Runtime and Concurrency

  • Golang uses a user-space component (runtime) linked to the executable.
  • The runtime is written in C.
  • It has implementation of scheduler, goroutine management and OS-threads management.
  • Per go process, there is a max limit of OS threads.
  • Go runtime schedules N goroutines on M OS threads
  • One goroutine runs exactly on one thread.
  • A goroutine can get blocked (e.g. on a syscall) and blocks the OS-thread too.


Linux Device Driver Development: Block Device Driver

It is my very first interaction with Linux kernel at device driver level. My objective is to develop a block device driver, very simple, that just forward I/O requests to a virtual device. This post explains my observations limited to attacking the problem.

Block v/s Character Device

Linux support block and character device drivers. Only block devices can host and support a filesystem. Block devices support random read/write operations. Each block is composed of sectors, usually 512 bytes long and uniquely addressable. Block is a logical entity. Filesystems usually use 4096 bytes blocks (8*512) or 8 sectors. In Linux kernel, a block device is represented as a logical entity (actually just a C structure). So, we can export anything as a device as long as we can facilitate read/writes operations on sector level.

Device driver is the layer that glues Linux kernel and the device. Kernel receives device targeted I/O requests from an application. All I/O requests pass through buffer cache and I/O scheduler. The latter arranges I/O requests optimally to improve seek time, assuming requests would run on a disk. In fact, Linux kernel has various I/O schedulers and hence multiple type of I/O request order could exist.

A device driver always implement a request queue. The Linux I/O scheduler enqueues requests in driver’s queue. How to serve these requests? That is device driver’s headache. The request queue is represented by the request_queue structure and is defined in “blkdev.h". Driver dequeues requests from this queue and send them to device. It then acknowledgement to each requests with error status.

If a device do not need optimal I/O order, it may opt for direct handing of I/O requests. An excellent example of such driver is loopback driver (loop.c, loop,h). It handles struct bio that stands for block I/O. A bio structure is a scatter gather list of page aligned buffer (usually 4K). Handling of bio structure is almost same as a struct req.

What are requirements for my driver


  • Runs on flash storage drives
  • Perform plain I/O forwarding
  • Minimal overhead, minimal code size

In my next post, I will discuss design of my driver.

Mongo DB: Good to know things

  • Mongo DB is a No-SQL, free, open-source solution that is highly scalable, highly available and high performance solution.
  • Engine is coded in C++
  • Works in a client-server model
  • Major components:
    • mongod: The storage server
    • mongos: The sharding server
    • config server(s):
      • Stores metadata that accomplish sharding
      • Is actually a mongod process
  • Mongo provides write operations durability with journaling (write ahead logging)
  • User data is seen as a database of collection of records
    • Collection is roughly similar to a table in RDBMS
    • Record could be map to a row in a table (incorrect but helps understanding)
  • Mongo stores data in BSON format (on-wire and on-disk)