Netlink Sockets: Linux Kernel-User communication (PART I)

  • Ubuntu 14.04, Kernel version 3.11
  • Netlink sockets provide full duplex, asynchronous, low-overhead communication channel between user-kernel space processes.
  • Other solutions such as ioctl(), sysfs, UDP sockets are either blocking (hence expensive) or slow (UDP has more overhead compared to Netlink) and complex.
  • Netlink can carry data buffers on a return trip from kernel to user and vice-verse.
  • By nature, netlink sockets are non-blocking.
  • It provides sender and receiver queues to handle burst of messages.
  • User space APIs are exactly like ordinary sockets. You have to specify socket family as AF_NETLINK.
  • Kernel space API is netlink_kernel_create().
  • Default netlink queue size is 208K. To set a higher size of queue, run the following commands:
#default buffer size = 212992
echo 425984 > /proc/sys/net/core/wmem_max
echo 425984 > /proc/sys/net/core/wmem_default  
echo 425984 > /proc/sys/net/core/rmem_default
echo 425984 > /proc/sys/net/core/rmem_max
  • Netlink do not care about the data buffer you wish to send/receive.
  • It allows unicasting, multicasting and broadcasting of messages.

Linux Device Driver Development: Block Device Driver

It is my very first interaction with Linux kernel at device driver level. My objective is to develop a block device driver, very simple, that just forward I/O requests to a virtual device. This post explains my observations limited to attacking the problem.

Block v/s Character Device

Linux support block and character device drivers. Only block devices can host and support a filesystem. Block devices support random read/write operations. Each block is composed of sectors, usually 512 bytes long and uniquely addressable. Block is a logical entity. Filesystems usually use 4096 bytes blocks (8*512) or 8 sectors. In Linux kernel, a block device is represented as a logical entity (actually just a C structure). So, we can export anything as a device as long as we can facilitate read/writes operations on sector level.

Device driver is the layer that glues Linux kernel and the device. Kernel receives device targeted I/O requests from an application. All I/O requests pass through buffer cache and I/O scheduler. The latter arranges I/O requests optimally to improve seek time, assuming requests would run on a disk. In fact, Linux kernel has various I/O schedulers and hence multiple type of I/O request order could exist.

A device driver always implement a request queue. The Linux I/O scheduler enqueues requests in driver’s queue. How to serve these requests? That is device driver’s headache. The request queue is represented by the request_queue structure and is defined in “blkdev.h". Driver dequeues requests from this queue and send them to device. It then acknowledgement to each requests with error status.

If a device do not need optimal I/O order, it may opt for direct handing of I/O requests. An excellent example of such driver is loopback driver (loop.c, loop,h). It handles struct bio that stands for block I/O. A bio structure is a scatter gather list of page aligned buffer (usually 4K). Handling of bio structure is almost same as a struct req.

What are requirements for my driver


  • Runs on flash storage drives
  • Perform plain I/O forwarding
  • Minimal overhead, minimal code size

In my next post, I will discuss design of my driver.

Linux FUSE Internals for developers

In this post, I will cover FUSE internals for FUSE 2.9.3.

  • Install package fuse and fuse-devel on CentOS.
  • getattr() is a must in a FUSE file-system. Any lame implementation is okay;
    • Just be careful of the file size in stat structure. If you forgot to compile user file system with 64-bit flags on. Otherwise the statst_size is signed int (32 – 1 bit field).
    • Your file size should not exceed > 2GB. Otherwise, it will be overflowed to zero.
  • In the user application, be careful with file I/O operations. A read () immediately followed by a write() would fetch you nothing. You should first lseek() to beginning of the file in your application.
  • FUSE has two modes of operation:
    • Single thread (very low performance, easy to debug)
    • Multi-thread (default operation)
  • Multi-thread spawns multiple threads during read operation. I observed almost single thread like behavior for writes.
  • Code to multi-thread I/O implementation is in lib/fuse_loop_mt.c.
  • FUSE uses worker threads to handle I/O requests, using struct fuse_worker. A worker thread is created in fuse_start_thread().
  • Each worker run fuse_do_work() function. This is an infinite loop and terminates only on session exit OR if number of active threads exceed than required.
  • User implementation of file system APIs are populated in const struct fuse_operations. It has address of all implemented APIs. FUSE ultimately calls these APIs for file system operations.
  • FUSE 2.7 reads 8K data by default, in two 4K chunks
    • Read happens in last 4K and the first 4K data block
  • An example:

    I had set the file size as 4MB in getattr () implementation. If you forget to compile with 64-bit flags, you will get zero length files.

    int bb_getattr(const char *path, struct stat *statbuf)
        int retstat = 0;
        memset(statbuf, 0, sizeof(struct stat));
        if (strcmp(path, "/") == 0) {
            statbuf->st_mode = S_IFDIR | 0755;
            statbuf->st_nlink = 2;
        } else {
            statbuf->st_mode = S_IFREG | 0444;
            statbuf->st_nlink = 1;
            statbuf->st_size = 4 * 1024* 1024;
        return retstat;

    The sequence of calls and their arguments is as follows:

    bb_getattr(path="/abcd.txt", statbuf=0xc5387960)
        rootdir = "/tmp", path = "/abcd.txt"
    bb_open(path"/abcd.txt", fi=0xc5daaa50)
        rootdir = "/tmp", path = "/abcd.txt"
        flags = 0x00008002
        fh_old = 0x00000000
        writepage = 0
        direct_io = 0
        keep_cache = 0
        fh = 0x0000000000000001
        lock_owner = 0x0000000000000000
    bb_write(path="/abcd.txt", buf=0xc4966050, size=10, offset=0, fi=0xc5387a50)
        flags = 0x00000000
        fh_old = 0x00000001
        writepage = 0
        direct_io = 0
        keep_cache = 0
        fh = 0x0000000000000001
        lock_owner = 0x0000000000000000
    bb_read(path="/abcd.txt", buf=0x06ccbd90, size=12288, offset=4096, fi=0xc5daaa50)  <- Here
        flags = 0x00000000
        fh_old = 0x00000001
        writepage = 0
        direct_io = 0
        keep_cache = 0
        fh = 0x0000000000000001
        lock_owner = 0x0000000000000000
    bb_read(path="/abcd.txt", buf=0x06ccbd90, size=4096, offset=0, fi=0xc5daaa50)
        flags = 0x00000000
        fh_old = 0x00000001
        writepage = 0
        direct_io = 0
        keep_cache = 0
        fh = 0x0000000000000001
        lock_owner = 0x0000000000000000
WRITE stack trace
(gdb) bt
#0  bb_write (path=0x7ffc68000990 "/test_file.0", buf=0x7ffff6f42060 "", size=4096, offset=4096, fi=0x7ffff6f40550) at bbfs.c:136
#1  0x00007ffff7dc885f in fuse_fs_write_buf (fs=0x280f090, path=0x7ffc68000990 "/test_file.0", buf=0x7ffff6f40580, off=4096, fi=0x7ffff6f40550)
    at fuse.c:1878
#2  0x00007ffff7dccb37 in fuse_lib_write_buf (req=0x7ffc680008c0, ino=2, buf=0x7ffff6f40580, off=4096, fi=0x7ffff6f40550) at fuse.c:3278
#3  0x00007ffff7dd461b in do_write_buf (req=0x7ffc680008c0, nodeid=2, inarg=0x7ffff6f42038, ibuf=0x7ffff6f40800) at fuse_lowlevel.c:1300
#4  0x00007ffff7dd7369 in fuse_ll_process_buf (data=0x280f220, buf=0x7ffff6f40800, ch=0x280ece0) at fuse_lowlevel.c:2437
#5  0x00007ffff7dd9aa5 in fuse_session_process_buf (se=0x280ed30, buf=0x7ffff6f40800, ch=0x280ece0) at fuse_session.c:87
#6  0x00007ffff7dd0f6a in fuse_do_work (data=0x7ffff00008c0) at fuse_loop_mt.c:117
#7  0x00000037bc2079d1 in start_thread () from /lib64/
#8  0x00000037bbee8b6d in clone () from /lib64/

READ stack trace
(gdb) bt
#0  bb_read (path=0x7ffff38c55b0 "/test_file.0", buf=0x7ffff38c56c0 "", size=4096, offset=8192, fi=0x7ffff79635d0) at bbfs.c:111
#1  0x00007ffff7dc841e in fuse_fs_read_buf (fs=0x280f090, path=0x7ffff38c55b0 "/test_file.0", bufp=0x7ffff7963578, size=4096, off=8192, 
    fi=0x7ffff79635d0) at fuse.c:1794
#2  0x00007ffff7dcca1d in fuse_lib_read (req=0x7ffff002a1e0, ino=2, size=4096, off=8192, fi=0x7ffff79635d0) at fuse.c:3252
#3  0x00007ffff7dd42c7 in do_read (req=0x7ffff002a1e0, nodeid=2, inarg=0x7ffff7965038) at fuse_lowlevel.c:1232
#4  0x00007ffff7dd73ce in fuse_ll_process_buf (data=0x280f220, buf=0x7ffff7963800, ch=0x280ece0) at fuse_lowlevel.c:2441
#5  0x00007ffff7dd9aa5 in fuse_session_process_buf (se=0x280ed30, buf=0x7ffff7963800, ch=0x280ece0) at fuse_session.c:87
#6  0x00007ffff7dd0f6a in fuse_do_work (data=0x280ee30) at fuse_loop_mt.c:117
#7  0x00000037bc2079d1 in start_thread () from /lib64/
#8  0x00000037bbee8b6d in clone () from /lib64/

Compiling FUSE based file system with your FUSE build

Suppose hello.c has implementation of file system APIs and your FUSE installation resides in /home/k/Desktop/my_fuse_2.9.3.

$gcc -g hello.c -o hi -D_FILE_OFFSET_BITS=64 -I/home/k/Desktop/my_fuse_2.9.3/include -lpthread
-L/home/k/Desktop/my_fuse_2.9.3/lib -lfuse -LLIBDIR=/home/k/my_fuse_2.9.3/lib
-Wl,-rpath -Wl,/home/k/my_fuse_2.9.3/lib

Linux kernel: interesting optimizations


Every process has a read-only page mapped to its address space. This page is at a fixed location and keeps value of time, updated at every tick of clock. Since page is mapped into process memory, no system call is required. Process can directly read the page. This page is called vsyscall page and is filled by kernel at the start time.

It is part of kernel version 2.6+.


Ubuntu on a Windows host: Alternative to VirtualBox and VMWare

Wubi is a cool alternative to VMPlayer and VirtualBox to run Ubuntu “almost” natively on your Windows system. It gives you a dual boot machine without partitioning your filesystem.

How to do it
• Install Wubi
• Plave your Ubuntu ISO in the _same_ place where your Wubi binaries are
• Install Ubuntu from Wubi installer

Now, after Ubuntu installation, and reboot, you will see a dual boot option of Windows and Ubuntu. Ubuntu runs on bare hardware except the disk accesses.

How it works
Wubi is based on loopback devices in Linux. A looback device exports a file as a device. You can mount this “file” and craete a file-system on it.
Wubi creates a file in your Windows NTFS file system (“root.disk”) which is exported as a loopback device in Ubuntu. This file is formatted to a file-system and used by Ubuntu.

In my Ubuntu system:

kanaujia@ubuntu:/tmp$ sudo mount
[sudo] password for kanaujia: 
/dev/loop0 on / type ext4 (rw,errors=remount-ro)

kanaujia@ubuntu:/tmp$ sudo losetup -a
/dev/loop0: [0801]:115068 (/host/ubuntu/disks/root.disk)

kanaujia@ubuntu:/tmp$ cat !$
cat /etc/fstab
# /etc/fstab: static file system information.
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    nodev,noexec,nosuid 0       0
/host/ubuntu/disks/root.disk /               ext4    loop,errors=remount-ro 0       1
/host/ubuntu/disks/swap.disk none            swap    loop,sw         0       0

That’s it! it is a simple concept used beautifully. I think if this setup has negative performance impact? I will find that out too later.

Anyway for fun, I experimented creating my own file-system with loop-back device:

Create a file with random data
kanaujia@ubuntu:/tmp$ dd if=/dev/urandom of=/home/kanaujia/Desktop/myfs bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.16245 s, 9.0 MB/s

Create a mount point
kanaujia@ubuntu:/tmp$ sudo mkdir /mnt/myfs

Update /etc/fstab
kanaujia@ubuntu:/tmp$ sudo vi /etc/fstab

Setup the loopback device
kanaujia@ubuntu:/tmp$ sudo losetup /dev/loop1 /home/kanaujia/Desktop/myfs

Format the device as a file-system
kanaujia@ubuntu:/tmp$ mkfs.ext3 -c /dev/loop1
mke2fs 1.42 (29-Nov-2011)
mkfs.ext3: Permission denied while trying to determine filesystem size
kanaujia@ubuntu:/tmp$ sudo mkfs.ext3 -c /dev/loop1
mke2fs 1.42 (29-Nov-2011)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
2560 inodes, 10240 blocks
512 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=10485760
2 block groups
8192 blocks per group, 8192 fragments per group
1280 inodes per group
Superblock backups stored on blocks: 

Checking for bad blocks (read-only test): done                                                 
Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (1024 blocks): done
Writing superblocks and filesystem accounting information: done

kanaujia@ubuntu:/tmp$ sudo mount /dev/loop1
kanaujia@ubuntu:/tmp$ mount
/dev/loop0 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
/dev/sda1 on /host type fuseblk (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
gvfs-fuse-daemon on /home/kanaujia/.gvfs type fuse.gvfs-fuse-daemon (rw,nosuid,nodev,user=kanaujia)
/dev/loop1 on /mnt/myfs type ext3 (rw,noexec,nosuid,nodev)

kanaujia@ubuntu:/tmp$ cd /mnt/
kanaujia@ubuntu:/mnt$ ls

kanaujia@ubuntu:/mnt$ cd myfs/

kanaujia@ubuntu:/mnt/myfs$ ls

kanaujia@ubuntu:/mnt/myfs$ ll
total 17
drwxr-xr-x 3 root root  1024 Jul 11 13:32 ./
drwxr-xr-x 3 root root  4096 Jul 11 13:30 ../
drwx------ 2 root root 12288 Jul 11 13:32 lost+found/

kanaujia@ubuntu:/mnt/myfs$ sudo touch hh
kanaujia@ubuntu:/mnt/myfs$ ls
hh  lost+found
kanaujia@ubuntu:/mnt/myfs$ ll
total 17
drwxr-xr-x 3 root root  1024 Jul 11 13:34 ./
drwxr-xr-x 3 root root  4096 Jul 11 13:30 ../
-rw-r--r-- 1 root root     0 Jul 11 13:34 hh
drwx------ 2 root root 12288 Jul 11 13:32 lost+found/

Loopback Devices in Linux

Android OS: Google DVM and Virtual machines

Published in LinuxForYou, Jun 2011


With the outburst of heterogeneous systems, the need for a scalable software system is very much required without compromising the cost of development and maintenance of the software. Virtual machine (VM) provides abstraction from the heterogeneity and presents a low cost and scalable software development environment. VM based modern programming languages like Java is the speaking example of this.

 In this article, we will try to understand the fundamental concepts of a VM by taking the example of Dalvik VM – one of the critical components of Google’s Android Operating System software stack.

The complete article is available here

How tail -f work?

“tail -f” is a special command in a way that it polls the specified file for any change and prints the new stuff on the fly. It is very helpful in observing logs and any event based data.

Ever wondered how tail achieves this?

“tail” opens the given file and obtains the file-descriptor. It opens it with xfreopen() -> freopen() -> fopen() call. It does its first round of fstat() on the file as well.

Once it has got the fd, it loops infinitely and do the following:

It does fstat() of the file and observes the mtime value. If the mtime value is changes from the last dumps the data. To print the latest data, it lseek() the file to the last reported file size.


A strace of tail on my Ubuntu 11.04 is as follows:

kanaujia@ubuntu:~/Desktop/FUSE/dedup$ strace tail -f ./ 
execve("/usr/bin/tail", ["tail", "-f", "./"], [/* 38 vars */]) = 0
brk(0)                                  = 0x9567000
open("./", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=103, ...}) = 0
_llseek(3, 0, [0], SEEK_CUR)            = 0
_llseek(3, 0, [103], SEEK_END)          = 0
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "FILE = open('./myfs/dood',\"w\")\np"..., 103) = 103
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "FILE = open('./myfs/dood',\"w\")\np"..., 103) = 103
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb775c000
fstat64(3, {st_mode=S_IFREG|0644, st_size=103, ...}) = 0
fstatfs64(3, 84, {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=4902319,
         f_bfree=3748629, f_bavail=3499605, f_files=1245184, f_ffree=1041484,
         f_fsid={598995932, 149996801}, f_namelen=255, f_frsize=4096}) = 0
inotify_init()                          = 4
inotify_add_watch(4, "./", IN_MODIFY|IN_ATTRIB|IN_DELETE_SELF|IN_MOVE_SELF) = 1
fstat64(3, {st_mode=S_IFREG|0644, st_size=103, ...}) = 0
read(4, ....

So, it is using fstat and inotify API that add a watch to an initialized inotify instance.

>Linux Scheduling: A Few Facts

>- Linux scheduler favors I/O bound processes. It uses dynamic priorities to schedule processes. So a process that has not got CPU for a long time, would get its priority increased and vice verse.

– Processes are moved to different queues and al processes on ready queue are assigned an ‘epoch’. The epoch is relevant for processes in ready queue only.

– Now each process is assigned a quantum which is the CPU time allotted to a process. If a process is blocked, it does not use its quantum and unused quantum is carry forward to next epoch. An epoch completes as soon as all processes in ready queue complete their quantum.

– Dynamic priority(“goodness”) of a process is calculated by base priority and quantum. Hence a I/O bound process which is blocked for a long time, gets its priority improved every time it saves its quantum while it was blocked.