Pylucene- Part III: Hightlighting in Search

The search result can be customized to highlight the phrases that contain the requested keyword. The following code uses “Highlighter” class from Pylucene. We emit result in HTML formatted syntax.

from lucene import \
            QueryParser, IndexSearcher, IndexReader, StandardAnalyzer, \
        TermPositionVector, SimpleFSDirectory, File, SimpleSpanFragmenter, Highlighter, \
    QueryScorer, StringReader, SimpleHTMLFormatter, \
            VERSION, initVM, Version
import sys

FIELD_CONTENTS = "contents"
FIELD_PATH = "path"
#QUERY_STRING = "lucene and restored"
QUERY_STRING = sys.argv[1]
STORE_DIR = "/home/kanaujia/lucene_index"

if __name__ == '__main__':
    initVM()
    print 'lucene', VERSION

    # Get handle to index directory
    directory = SimpleFSDirectory(File(STORE_DIR))

    # Creates a searcher searching the provided index.
    ireader  = IndexReader.open(directory, True)

    # Implements search over a single IndexReader.
    # Use a single instance and use it across queries
    # to improve performance.
    searcher = IndexSearcher(ireader)

    # Get the analyzer
    analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)

    # Constructs a query parser.
    queryParser = QueryParser(Version.LUCENE_CURRENT, FIELD_CONTENTS, analyzer)

    # Create a query
    query = queryParser.parse(QUERY_STRING)

    topDocs = searcher.search(query, 50)

    # Get top hits
    scoreDocs = topDocs.scoreDocs
    print "%s total matching documents." % len(scoreDocs)

    HighlightFormatter = SimpleHTMLFormatter();
    query_score = QueryScorer (query)

    highlighter = Highlighter(HighlightFormatter, query_score)

    # Set the fragment size. We break text in to fragment of 64 characters
    fragmenter  = SimpleSpanFragmenter(query_score, 64);
    highlighter.setTextFragmenter(fragmenter); 

    for scoreDoc in scoreDocs:
        doc = searcher.doc(scoreDoc.doc)
    text = doc.get(FIELD_CONTENTS)
    ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
        print doc.get(FIELD_PATH)
        print highlighter.getBestFragments(ts, text, 3, "...")
    print ""

The code is an extension of search code discussed in Part-II.

We create a HTML formatter with SimpleHTMLFormatter. We create a QueryScorer to iterate over resulting documents in non-decreasing doc ID.

HighlightFormatter = SimpleHTMLFormatter();
query_score = QueryScorer (query)

highlighter = Highlighter(HighlightFormatter, query_score)

We break the text content into 64 bytes character set.
fragmenter  = SimpleSpanFragmenter(query_score, 64);
highlighter.setTextFragmenter(fragmenter);

for scoreDoc in scoreDocs:
doc = searcher.doc(scoreDoc.doc)
text = doc.get(FIELD_CONTENTS)
ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
print doc.get(FIELD_PATH)

Now we set number of lines for phrases in a document.

print highlighter.getBestFragments(ts, text, 3, “…”)

Results

kanaujia@ubuntu:~/work/Py/pylucy2/pylucy$ python searcher_highlight.py hello
lucene 3.6.1
50 total matching documents.
/home/kanaujia/Dropbox/PyConIndia/fsMgr/root/hello
hi hello

/home/kanaujia/Dropbox/PyConIndia/fsMgr.v4/root/hello
hi hello

/home/kanaujia/Dropbox/.dropbox.cache/2012-09-27/hello (deleted 505bda23-9-e8756a51)
hi hello

/home/kanaujia/Dropbox/PyConIndia/fsMgr.v1/root/hello.html

Hello htmls

{% module Hello() %}
Advertisements

Pylucene- Part I: Creating index

How to write a simple index generator with pylucene

  1 import lucene
  2 
  3 if __name__ == '__main__':
  4     INDEX_DIR = "/home/kanaujia/lucene_index"
  5 
  6     # Initialize lucene and JVM
  7     lucene.initVM()
  8 
  9     print "lucene version is:", lucene.VERSION
 10 
 11     # Get the analyzer
 12     analyzer = lucene.StandardAnalyzer(lucene.Version.LUCENE_CURRENT)
 13 
 14     # Get index storage
 15     store = lucene.SimpleFSDirectory(lucene.File(INDEX_DIR))
 16 
 17     # Get index writer
 18     writer = lucene.IndexWriter(store, analyzer, True, lucene.IndexWriter.MaxFieldLength.LIMITED)
 19 
 20     try:
 21         # create a document that would we added to the index
 22         doc = lucene.Document()
 23 
 24         # Add a field to this document
 25         field = lucene.Field("title", "India", lucene.Field.Store.YES, lucene.Field.Index.ANALYZED)
 26 
 27         # Add this field to the document
 28         doc.add(field)
 29 
 30         # Add the document to the index
 31         writer.addDocument(doc)
 32     except Exception, e:
 33         print "Failed in indexDocs:", e

Fundamentals

  • An index is created with an IndexWriter
  • An index is a collection of documents
  • A document represents a file, or data in terms of fields
  • A field is a tuple of field name, data

Let’s understand the above program:

  1. We provide a location of index as INDEX_DIR = “/home/kanaujia/lucene_index”
  2. Start and initialize the Java VM
  3. Get the lucene’s standard analyzer for fields
  4. This example keeps the index on disk, so the SimpleFSDirectory class is used to get a handle to this index.
  5. IndexWriter creates and maintains an index. The constructor is as follows:

IndexWriter(Directory d, Analyzer a, boolean create, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl)

  • Directory is handle to index location
  • ‘create’ tells if a new index object is created for every user request
# Get index writer
    writer = lucene.IndexWriter(store, analyzer, True, lucene.IndexWriter.MaxFieldLength.LIMITED)
  • Create a document that would become part in the index
  • Create a field, add it to a document.
  • Add the document to the index.
  • Run the program
kanaujia@ubuntu:~/work/Py$ python example1.py
lucene version is: 3.6.1
kanaujia@ubuntu:~/work/Py$ ls /home/kanaujia/lucene_index/
_0.fdt  _0.fdx  write.lock

Pylucene: Installation on Ubuntu

If you want to install pylucene automatically, try Synaptic package manager or apt-get. This installation gives you Pylucene2.3. This is old. If you wish to get the latest pylucene 3.6 or higher, please look for manual installation discussed in this post.

Automatic installation (pylucene 2.3)

  • Install everything mentioned here
  • sudo apt-get install pylucene
  • sudo apt-get install python-dev
  • I borrowed a test program from here. The immediate error you would see if you try to run a pylucene based program is:
kanaujia@ubuntu:~/work/Py$ python myluce.py 
Traceback (most recent call last):
  File "myluce.py", line 11, in 
    import lucene
  File "/usr/lib/python2.7/dist-packages/lucene/__init__.py", line 2, in 
    import os, _lucene
ImportError: libjvm.so: cannot open shared object file: No such file or directory
  • Run following:
    $ ldconfig -p | grep libjvm

    If you find nothing, see the next point.

  • Make sure you have Java JDK/ JRE available on your machine.
    root@ubuntu:~# find / -type f -name libjvm.so 
            /usr/lib/jvm/java-6-openjdk-i386/jre/lib/i386/cacao/libjvm.so
    /usr/lib/jvm/java-6-openjdk-i386/jre/lib/i386/server/libjvm.so
    /usr/lib/jvm/java-6-openjdk-i386/jre/lib/i386/client/libjvm.so
    /usr/lib/jvm/java-6-openjdk-i386/jre/lib/i386/jamvm/libjvm.so
  • Export the path to your library.
    root@ubuntu:~/work/Py$ export LD_LIBRARY_PATH=
    /usr/lib/jvm/java-6-openjdk-i386/jre/lib/i386/server:$LD_LIBRARY_PATH
  • And, you are done 🙂
    root@ubuntu:~/work/Py$ python myluce.py
    Usage:
    myluce.py <field_name> <index_url>

Installing Pylucene 3.0 or higher version manually

  1. Install everything mentioned here
  2. wget https://bootstrap.pypa.io/ez_setup.py -O - | python
  3. sudo apt-get install python-dev
  4. Get the package from http://lucene.apache.org/pylucene
  5. Find out the JVM path on your machine
  6. $ sudo update-java-alternatives -l
    java-1.6.0-openjdk-i386 1061 /usr/lib/jvm/java-1.6.0-openjdk-i386
  7. Now goto your pylucene package, unzip and untar.

Note: I am following instructions at: http://lucene.apache.org/pylucene/install.html

  • cd ./pylucene-3.6.1-2/
  • pushd jcc
  • <edit setup.py to match your environment>

Open setup.py, and search for “JDK”. Update the JDK path to what you found in step 3.

  • 51 JDK = {
     52     'darwin': JAVAHOME,
     53     'ipod': '/usr/include/gcc',
     54     #'linux2': '/usr/lib/jvm/java-6-openjdk',
     55     'linux2': '/usr/lib/jvm/java-1.6.0-openjdk-i386',
     56     'sunos5': '/usr/jdk/instances/jdk1.6.0',
     57     'win32': JAVAHOME,
     58     'mingw32': JAVAHOME,
     59     'freebsd7': '/usr/local/diablo-jdk1.6.0'
     60 }
    • python setup.py build
    • sudo python setup.py install
    • popd
    • <edit Makefile to match your environment>
    $ vi Makefile

    Uncomment following:

     48 # Mac OS X 10.6 (64-bit Python 2.6, Java 1.6)
     49 PREFIX_PYTHON=/usr
     50 ANT=ant
     51 PYTHON=$(PREFIX_PYTHON)/bin/python
     52 JCC=$(PYTHON) -m jcc.__main__ --shared --arch x86_64
     53 NUM_FILES=4
    • make

    This ‘make’ is very very slow. On my dual core laptop, it took about 15 minutes to complete.

    • sudo make install
    • make test (look for failures)
    • Done… sigh (Thanks a bunch to this blog)