PyLucene in Action - Part I
Written by Jaganadh GopinadhanPyLucene is a Python wrapper aroung the Java Lucene. The goal of this tool is use Lucene’s text indexing and searching capabilities from Python. It is compatible with the latest version of Java Lucene. PyLucene is embeds a Java VM with Lucene into Python process. More details on PyLucene can be found at http://lucene.apache.org/pylucene/.
In this blog post I am going to demonstrate how to build a search index and query a search index with PyLucene. You can see the installation instruction for PyLucene im my previous blog post …. 1) Creating an index with Pylucene
I am using the below given code to create an index with the PyLucene
#!/usr/bin/env python
import os,sys,glob
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field, \
StandardAnalyzer, IndexWriter, Version
"""
Example of Indexing with PyLucene 3.0
"""
def luceneIndexer(docdir,indir):
"""
Index Documents from a dirrcory
"""
lucene.initVM()
DIRTOINDEX = docdir
INDEXIDR = indir
indexdir = SimpleFSDirectory(File(INDEXIDR))
analyzer = StandardAnalyzer(Version.LUCENE_30)
index_writer = IndexWriter(indexdir,analyzer,True,\
IndexWriter.MaxFieldLength(512))
for tfile in glob.glob(os.path.join(DIRTOINDEX,'*.txt')):
print "Indexing: ", tfile
document = Document()
content = open(tfile,'r').read()
document.add(Field("text",content,Field.Store.YES,\
Field.Index.ANALYZED))
index_writer.addDocument(document)
print "Done: ", tfile
index_writer.optimize()
print index_writer.numDocs()
index_writer.close()
You have to supply two parameter to the luceneIndexer(). a) A path to the directory to where the documents for indexing is stored b) A path to the directory where the index can be saved
2) Querying an index with Pylucene
The below given code is for querying an index with PyLucene
#!/usr/bin/env python
import sys
import lucene
from lucene import SimpleFSDirectory, System, File, Document, Field,\
StandardAnalyzer, IndexSearcher, Version, QueryParser
"""
PyLucene retriver simple example
"""
INDEXDIR = "./MyIndex"
def luceneRetriver(query):
lucene.initVM()
indir = SimpleFSDirectory(File(INDEXDIR))
lucene_analyzer = StandardAnalyzer(Version.LUCENE_30)
lucene_searcher = IndexSearcher(indir)
my_query = QueryParser(Version.LUCENE_30,"text",\
lucene_analyzer).parse(query)
MAX = 1000
total_hits = lucene_searcher.search(my_query,MAX)
print "Hits: ",total_hits.totalHits
for hit in total_hits.scoreDocs:
print "Hit Score: ",hit.score, "Hit Doc:",hit.doc, "Hit String:",hit.toString()
doc = lucene_searcher.doc(hit.doc)
print doc.get("text").encode("utf-8")
luceneRetriver("really cool restaurant")
In the code I have manually specified the index dir “"”INDEXDIR = “./MyIndex” “””. Instead of this one can receive the index directory as a command line parameter (sys.argv) too.
When using the function luceneRetriver() you have to give a query as parameter.
The source code is available in bitbucket https://bitbucket.org/jaganadhg/pyluceneia
Migrated from my old blog jaganadhg.freeflux.net
Python
Natural Language Processing
]