Skip to content


How to Select a Random Key from a Hash Map in Constant Time?


A hashmap has a time complexity of O(1) for all operations.


You have to find a constant time method to uniformly random selection of a key in a Hash map.


  1. The map can grow to memory size.
  2. You can use any readymade hash map.


I’m discussing the pseudo code for a Python solution. All operations work in constant time except remove().

randMap = {}

def insert(k, v):
    randMap[k] = v    

The random must pick a random key. But how? We can store all the keys in a list and then run random() on the list indexes.

randMap = {}
indexMap = {}
keyList = []
last = 0
def insert(k, v):
    randMap[k] = v
    last += 1
    # Maintain index of each key
    indexMap[k] = len(keyList)    
def getRandom():
    start = 0
    end = len(keyList) - 1
    randomIndex = random(start, end)
    key = keyList[randomIndex]
    return key

How do you delete a key?

This is the crux of the problem. Deletion of a key would need us to change the index array too. That means we need to shift all elements of the array and complexity would shoot to O(n).

def remove(k):
    # Remove from the hash map
    randMap.pop(k, None)
    # Remove from key list
    index = indexMap[k]
    # This is O(n) operation

    # Pop the index too

How to improve deletion?

  1. Delete the entry and mark the entry invalid. The probability distribution does not change, however with more and more deletions, you have a sparse array. Thus you would need multiple getRandom() to get a valid key.
  2. Move the last element to the deleted element. Adjust the index of the last element.
def remove(k):
    # Remove from the hash map
    randMap.pop(k, None)

    index = indexMap[k]
    indexMap[last] = index
    # This is O(1) operation
    keyList[index] = keyLast[last]

    # Pop the index too
    last -= 1 

This is a constant time solution.

How to Write a Conference Scheduler

The conference schedule is an NP problem since it is O(2**n). A greedy approach is pragmatic to solve the problem. We can pick various strategies to prepare a schedule such as the biggest talk first, smallest talk first, random selection, etc.

The Python code uses the Longest talk first and random selection to prepare schedules.

Github Code

Written with StackEdit.

Notes on Python Decorator

Notes on Python Decorator

Python decorator is a function wrapper. It helps you wrap essential tasks around your functions such as logging, ratelimits, metrics updates etc. However, it has a peculiar behavior due to its dynamic implementation.

If you run the following code, what is the expected output?

def register(func):
    print("hello to decorator!")
    def another():
    ┆   print("wrapped now!")
    ┆   func()
    return another

def test():
    print("I am a test function")
    return ""

The interesting thing about this code is that it prints the line “hello to decorator” even without any function invocation.

But Why?

The reason is that during the decorator application to the function test interpreter would run all the code inside decorator and prepare a new function object. Decorator essentially creates new functions. Since the new function is created dynamically, Python VM runs all the code inside the wrapper (decorator) function and ultimately retunrs a new function object.

Another example

import sys

def first(func):
    print("hello sir!")
    return func

def second(func):
    for i in range(1,5):
    ┆   print(i)

    def another():
    ┆   print("hello sir!")
    ┆   func()
    return another

def test():
    print("I am a test function1")
    return ""

# Should cause conflict, but doesn't
def test():
    print("I am a test function1")
    return ""

# Another function with the same name
# Python VM does not complain of name conflict as
# the decorated function would assume a new name
def test():
    print("I am a test function2")
    return ""


$ python
hello sir!
hello sir!

Things to look for

  • There is no name collison problem for a decorated function
  • A decorated function get a name of parent function and and offset
  • Decorator code is run even before main() is called


Sample code

Written with StackEdit.

MySQL & Python- Error: 2006 mysql has gone away

MySQL & Python: Error: 2006 mysql has gone away

This problem occurs for multiple reasons such as DB connection problem. In our code, we hit this issue due to a subtle problem with DB cursor.

The code was as following:

with conn as cur:
        print "hello"
        print "sorry"

The above code would throw the error 2016 mysql has gone away exception. The problem lies in with conn as cur. This statement creates a cursor on the DB and the cursor object autmatically gets destroyed.
Here, we are closing the DB connection before the automatic destruction happened.

So since connection was invalid(closed), cursor deletion hit an exception.

The solution is to close the connection after cursor object deletion.

    with conn as cur:
        print "hello"
    print "sorry"

Written with StackEdit.

Google App Engine: Data store with async calls

  • Google Data Store provides single core CPU  (F class) for applications.
  • Hence heavily multi-threaded applications can’t scale enough
  • DataStoreprovidesasync calls for DB access. It has:
    • Async queries
    • Async transactions
    • Async read/writes
  • Documentation is good but lacks clarity on usage and benefits on async calls. It says that async queries can run in parallel after immediately returning a temporary result object. But how?
  • I think Data Store can time share multiple requests, hence improving responsiveness and fairness.
  • How to convert sync calls to async:
  • Please refer Source: Google Developer website
    • I got confused at async queries. It uses run() menthod on a query to make it async. A get() method is a sync call.
    • Difference between run() and get() is that:
      • run() returns an iterator over results
      • run() pre-fetches results with a batch read, so access to elements in results is faster
      • get() is a sync call and returns the first result from result set
      • run(limit=1) == get()


q1 = Salesperson.all().filter('date_of_hire <', one_month_ago)
# Returns instantly, query is executing in the background.
recent_hires =

# q1 and recent_hires are both iterable objects. Your application should iterate over 'recent_hires' to enjoy benefits of pre-fetched reads.
##### for h in recent_hires:
# instead of
##### for h in q1:

Dissecting Python List allocator (CPython)

List is one of the most frequently used data structure in Python applications. The code for Python list object allocator is present in Python-2.7.5/Objects/listobject.c.

Python allocates list objects in heap and uses underlying malloc based APIs. List is a container of Python object references. In plain C terminology. it is a dynamic array of pointers.

Let’s understand basic creation and insertion APIs.

  • list_resize(): A list is of type PyListObject and is equivalent to PyObject. Every resize request allocates more than asked size.

Allocation unit size is : (size_requested + (size_requested << 3 + (size_requested < 9 : 3 or 6) ).

It gives a growth pattern of (0, 4, 8, 16, 25, 35, 46, 58, 72, 88).

  • free_list: Python keeps PyList_MAXFREELIST (defined to value 80) of free lists to save on lists allocation and deallocation.
  • PyList_New(): A new list allocation requests is served from either a free list or a new GC tracked object creation of type PyListObject. Next, a malloc request is made for requested list size.

Size calculated: nbytes = size * sizeof(PyObject *);

Object allocation: PyListObject *op = free_list[numfree] OR PyObject_GC_New(PyListObject, &PyList_Type);

Memory allocation for references: op->ob_item = (PyObject **) PyMem_MALLOC(nbytes);

memory_profiler: NameError: name ‘profile’ is not defined

I was experimenting with Python’s memory_profiler module and suddenly started getting following error:

$ python -m memory_profiler
Traceback (most recent call last):
  File “/usr/lib/python2.7/”, line 162, in _run_module_as_main
    “__main__”, fname, loader, pkg_name)
  File “/usr/lib/python2.7/”, line 72, in _run_code
    exec code in run_globals
  File “/usr/local/lib/python2.7/dist-packages/”, line 14, in <module>
    import subprocess
  File “/usr/lib/python2.7/”, line 432, in <module>
    import pickle
  File “”, line 8, in <module>
NameError: name ‘profile’ is not defined
Error in sys.excepthook:
Traceback (most recent call last):
  File “/usr/lib/python2.7/dist-packages/”, line 66, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
  File “/usr/lib/python2.7/dist-packages/apport/”, line 1, in <module>
    from import Report
  File “/usr/lib/python2.7/dist-packages/apport/”, line 12, in <module>
    import subprocess, tempfile, os.path, urllib, re, pwd, grp, os
  File “/usr/lib/python2.7/”, line 432, in <module>
    import pickle
  File “”, line 8, in <module>
NameError: name ‘profile’ is not defined

Original exception was:
Traceback (most recent call last):
  File “/usr/lib/python2.7/”, line 162, in _run_module_as_main
    “__main__”, fname, loader, pkg_name)
  File “/usr/lib/python2.7/”, line 72, in _run_code
    exec code in run_globals
  File “/usr/local/lib/python2.7/dist-packages/”, line 14, in <module>
    import subprocess
  File “/usr/lib/python2.7/”, line 432, in <module>
    import pickle
  File “”, line 8, in <module>
NameError: name ‘profile’ is not defined

I checked the installation and validated the run-time environment, but everything looked fine. Then I took a closer look at the error traces.

  File “”, line 8, in <module>

NameError: name ‘profile’ is not defined

There was no pickling in my code! Why is pickle module in the error log?

Then I checked the current directory and saw a file with name “”, Bingo! I had created this file for another test and mistakenly named as “” that coincides with standard Python pickle module.

So, fix was to rename this file to a modest name and delete its “pyc” file.

Python Internals: Understanding Python data model (I)

Python sees everything as object. Every object has an identity, value and a type. Object identity and type are invariable.

Python Data Model
Python Data Model

Object type determines if value is mutable or otherwise.

Lifetime of object is based on reference count mechanism.

Object Container

Object containers are: list, dictionary, tuple, set. Containers keep reference (object identity) to objects. Mutability of container is based on references. The value of the referred object could be mutable.

Let’s see how they behave:

from sys import getrefcount

a = 1
b = 1

list1 = [] 
list2 = [] 

t1 = (a, b)
t2 = (a, b)

# a and b share reference to same object ID 
print "a=", id(a)
print "b=", id(b)

# Constant 1 has ref count of +2 (a and b)
print "getrefcount(1)=", getrefcount(1)

# Constant 1 has ref count of +3 now (a,b and c)
c = 1
print "getrefcount(1)=", getrefcount(1)

# Decrement the object ref count
del c
print "getrefcount(1)=", getrefcount(1)

# Default ref count of an unused new integer object is 3. But, why?
print "getrefcount(999999)=", getrefcount(999999)

print ""

# Mutable objects like list do not refer to same object ID, even
# though value of objects are same!
print "list1=", id(list1)
print "list2=", id(list2)
print "getrefcount(list1)=", getrefcount(list1)
print "getrefcount(list2)=", getrefcount(list2)

print ""

print "t1=", id(t1)
print "t2=" ,id(t2)

# Containers have default ref count of 2
print "getrefcount(t1)=", getrefcount(t1)
print "getrefcount(t2)=", getrefcount(t2)

# Changing contained object values do not modiy immutable container
a = 3
b = 10

print "t1=", id(t1)
print "t2=" ,id(t2)

# String literals are constant and are referred to like numbers
s1 = "hello"
s2 = "hello"

# s1 and s2 refer to same object
print "s1=", id(s1)
print "s2=" ,id(s2)
print "getrefcount(hello)=", getrefcount("hello")

# The first use of a literal uses 3 ref count
print "getrefcount(hello!!)=", getrefcount("hello!!")

s3 = "hello!"
print "getrefcount(hello!)=", getrefcount("hello!")

Parenthesize an expression in Python

    def pref(op):
        print "called with op", op
        ret = -1
        if op == '+':
            print "matched +"
            ret = 1
        if op == '-':
            print "matched -"
            ret = 2
        if op == '*':
            print "matched *"
            ret = 3
        if op == '/':
            print "matched /"
            ret = 4
        return ret
    def evaluate(expr, operand_stack, operator_stack):
        print "**In evaluate**"
        print operator_stack
        print operand_stack
        expr1 = operand_stack.pop()
        expr2 = operand_stack.pop()
        op    = operator_stack.pop()
        # Parenthesize the expression
        expr = "(" + expr2 + op + expr1 + ")"
        print "expr1", expr1
        print "expr2", expr2
        print "expr", expr
        # Push the result back on the stack
        print operator_stack
        print operand_stack
        print "**Out evaluate**"
        return expr
    def looper(str, expr, operator_stack, operand_stack):
        l = 0
        cnt = len(str)
        # Loop over the input string
        while  l < cnt:
            if str[l] in ('+', '-', '*', '/'):
                print "operator found: op, index", str[l], l
                print operator_stack, len(operator_stack)
                x = len(operator_stack) - 1
                if x > 0:
                    print "Comparing:", operator_stack[x], str[l]
                    # If op on stack has higher preference than the op in question
                    if (pref(operator_stack[x]) > pref(str[l])):
                        expr = evaluate(expr, operand_stack, operator_stack)
                # Add the operand to operand stack
            l += 1
        print operator_stack
        print operand_stack
        print "Take care of last elements"
        op_cnt = len(operator_stack)
        while op_cnt:
            expr = evaluate(expr, operand_stack, operator_stack)
            op_cnt -= 1
        print operator_stack
        print operand_stack
    if __name__ == '__main__':
        str = "a+c*d-e/w*x+a-s"
        cnt = len(str)
        operand_stack  = []
        operator_stack  = []
        expr = ""
        looper(str, expr, operator_stack, operand_stack)
        print "Output=>", operand_stack[0]

Pylucene- Part II: Searching index

In the last post, we discussed how to create an index over a directory. Now, let’s search our index.

from lucene import \
            QueryParser, IndexSearcher, IndexReader, StandardAnalyzer, \
        TermPositionVector, SimpleFSDirectory, File, MoreLikeThis, \
            VERSION, initVM, Version
import sys

FIELD_CONTENTS = "contents"
FIELD_PATH = "path"

QUERY_STRING = "lucene and restored"

STORE_DIR = "/home/kanaujia/lucene_index"

if __name__ == '__main__':
    print 'lucene', VERSION

    # Get handle to index directory
    directory = SimpleFSDirectory(File(STORE_DIR))

    # Creates a searcher searching the provided index.
    ireader  =, True)

    # Implements search over a single IndexReader.
    # Use a single instance and use it across queries
    # to improve performance.
    searcher = IndexSearcher(ireader)

    # Get the analyzer
    analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)

    # Constructs a query parser. We specify what field to search into.
    queryParser = QueryParser(Version.LUCENE_CURRENT,
                              FIELD_CONTENTS, analyzer)

    # Create the query
    query = queryParser.parse(QUERY_STRING)

    # Run the query and get top 50 results
    topDocs =, 50)

    # Get top hits
    scoreDocs = topDocs.scoreDocs
    print "%s total matching documents." % len(scoreDocs)

    for scoreDoc in scoreDocs:
        doc = searcher.doc(scoreDoc.doc)
        print doc.get(FIELD_PATH)

Pylucene- Part I: Creating index

How to write a simple index generator with pylucene

  1 import lucene
  3 if __name__ == '__main__':
  4     INDEX_DIR = "/home/kanaujia/lucene_index"
  6     # Initialize lucene and JVM
  7     lucene.initVM()
  9     print "lucene version is:", lucene.VERSION
 11     # Get the analyzer
 12     analyzer = lucene.StandardAnalyzer(lucene.Version.LUCENE_CURRENT)
 14     # Get index storage
 15     store = lucene.SimpleFSDirectory(lucene.File(INDEX_DIR))
 17     # Get index writer
 18     writer = lucene.IndexWriter(store, analyzer, True, lucene.IndexWriter.MaxFieldLength.LIMITED)
 20     try:
 21         # create a document that would we added to the index
 22         doc = lucene.Document()
 24         # Add a field to this document
 25         field = lucene.Field("title", "India", lucene.Field.Store.YES, lucene.Field.Index.ANALYZED)
 27         # Add this field to the document
 28         doc.add(field)
 30         # Add the document to the index
 31         writer.addDocument(doc)
 32     except Exception, e:
 33         print "Failed in indexDocs:", e


  • An index is created with an IndexWriter
  • An index is a collection of documents
  • A document represents a file, or data in terms of fields
  • A field is a tuple of field name, data

Let’s understand the above program:

  1. We provide a location of index as INDEX_DIR = “/home/kanaujia/lucene_index”
  2. Start and initialize the Java VM
  3. Get the lucene’s standard analyzer for fields
  4. This example keeps the index on disk, so the SimpleFSDirectory class is used to get a handle to this index.
  5. IndexWriter creates and maintains an index. The constructor is as follows:

IndexWriter(Directory d, Analyzer a, boolean create, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)

  • Directory is handle to index location
  • ‘create’ tells if a new index object is created for every user request
# Get index writer
    writer = lucene.IndexWriter(store, analyzer, True, lucene.IndexWriter.MaxFieldLength.LIMITED)
  • Create a document that would become part in the index
  • Create a field, add it to a document.
  • Add the document to the index.
  • Run the program
kanaujia@ubuntu:~/work/Py$ python
lucene version is: 3.6.1
kanaujia@ubuntu:~/work/Py$ ls /home/kanaujia/lucene_index/
_0.fdt  _0.fdx  write.lock

Python goof ups of a C/C++ programmer

Python is a new age language when compared to good old C. Writing code in Python needs a subtle shift from C mindset. Python offers so many things ready-made that make you feel that you wrote very less code. Anyway, I goofed up while using a very common feature of C: pass by reference.

Python too offers it, but with a caveat: you should know mutable and immutable data objects. When you pass a mutable object like list, dictionary; they can be changed. But, if you pass an immutable data object like string, tuple; they are kept unchanged in caller’s scope. This is very similar to passing “references” (C++) or constant pointers.

  1 class Node:
  2     def __init__(self, value, data=[]):
  3         self.char = value
  4         # We intend to keep a list of values for a key
  5 = data
  6         # XXX: List of Node
  7         self.children = []
 10 n = Node('a')
 11 m = Node('b')
 13 ("north")
 14 ("south")
 16 print
 17 print

The output of the above code is:

ubuntu@ubuntu:~/Desktop/ToKeep/cprogs/pycon$ python
['north', 'south']
['north', 'south']

I don’t quite understand why Python keeps a single instance of default argument(list). Nonetheless, it is interesting.

This thread on Stackoverflow is very informational.

  • Omitting “this” pointer

In C++, “this” which is pointer to the object is passed implicitly. Python has similar keyword “self” for this. But, contrary to an implicit understanding, “self” should be an argument to a function, else you would see an error:

NameError: global name ‘self’ is not defined

  1 class pydbg:
  2     def __init__(self, modname=None):
  3         self.level = 0
  4         self.modname = modname
  6     def DLOG(self, lvl=None, fmt=None, *args):
  7         self.level = lvl
  8         print '%s' %self.modname
  9         print fmt
 10         for arg in args:
 11             print arg

So, always add the “self” argument while adding a function in your class.

%d bloggers like this: