PyLucene in Action - Part I

Written by Jaganadh Gopinadhan

PyLucene is a Python wrapper aroung the Java Lucene. The goal of this tool is use Lucene’s text indexing and searching capabilities from Python. It is compatible with the latest version of Java Lucene. PyLucene is embeds a Java VM with Lucene into Python process. More details on PyLucene can be found at http://lucene.apache.org/pylucene/.

In this blog post I am going to demonstrate how to build a search index and query a search index with PyLucene. You can see the installation instruction for PyLucene im my previous blog post …. 1) Creating an index with Pylucene

I am using the below given code to create an index with the PyLucene

#!/usr/bin/env python
import os,sys,glob
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field, \
StandardAnalyzer, IndexWriter, Version
"""
Example of Indexing with PyLucene 3.0
"""
def luceneIndexer(docdir,indir):

	"""

	Index Documents from a dirrcory

	"""

	lucene.initVM()

	DIRTOINDEX = docdir

	INDEXIDR = indir

	indexdir = SimpleFSDirectory(File(INDEXIDR))

	analyzer = StandardAnalyzer(Version.LUCENE_30)

	index_writer = IndexWriter(indexdir,analyzer,True,\

	IndexWriter.MaxFieldLength(512))

	for tfile in glob.glob(os.path.join(DIRTOINDEX,'*.txt')):

		print "Indexing: ", tfile

		document = Document()

		content = open(tfile,'r').read()

		document.add(Field("text",content,Field.Store.YES,\

		Field.Index.ANALYZED))

		index_writer.addDocument(document)

		print "Done: ", tfile

	index_writer.optimize()

	print index_writer.numDocs()

	index_writer.close()

You have to supply two parameter to the luceneIndexer(). a) A path to the directory to where the documents for indexing is stored b) A path to the directory where the index can be saved

2) Querying an index with Pylucene

The below given code is for querying an index with PyLucene

#!/usr/bin/env python
import sys
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field,\
StandardAnalyzer, IndexSearcher, Version, QueryParser
"""
PyLucene retriver simple example
"""
INDEXDIR = "./MyIndex"
def luceneRetriver(query):

	lucene.initVM()

	indir = SimpleFSDirectory(File(INDEXDIR))

	lucene_analyzer = StandardAnalyzer(Version.LUCENE_30)

	lucene_searcher = IndexSearcher(indir)

	my_query = QueryParser(Version.LUCENE_30,"text",\

	lucene_analyzer).parse(query)

	MAX = 1000

	total_hits = lucene_searcher.search(my_query,MAX)

	print "Hits: ",total_hits.totalHits

	for hit in total_hits.scoreDocs:

		print "Hit Score: ",hit.score, "Hit Doc:",hit.doc, "Hit String:",hit.toString()

		doc = lucene_searcher.doc(hit.doc)

		print doc.get("text").encode("utf-8")

luceneRetriver("really cool restaurant")

In the code I have manually specified the index dir “"”INDEXDIR = “./MyIndex” “””. Instead of this one can receive the index directory as a command line parameter (sys.argv) too.

When using the function luceneRetriver() you have to give a query as parameter.

The source code is available in bitbucket https://bitbucket.org/jaganadhg/pyluceneia

Migrated from my old blog jaganadhg.freeflux.net

Written on September 1, 2010
The Opinions Expressed In This Post Are My Own And Not Necessarily Those Of My Employer.
[ Python  Natural Language Processing  ]