Book Review- Python Text Processing with NLTK 2.0 Cookbook by Jacob Perkins

Python Text Processing with NLTK 2.0 Cookbook by Jacob Perkins is one of the latest books published by Packt in the Open Source series. The book is meant for people who started learning and practicing the Natural Language Tool Kit(NLTK).NLTK is an Open Source Python library to learn practice and implement Natural Language Processing techniques. The software is licensed under the Apache Software license. It is one of the most widely recommended tool kit for beginners in NLP to make their hands dirty. The toolkit is part of syllabus for many institutions around the globe where Natural Language Processing/ Computational Linguistics courses are offered. Perkins book work is the second book published on the toolkit NLTK. The first book is written by core developers of NLTK; Steven Bird, Ewan Klein, and Edward Loper, published by O’rielly. Steven et.all’s book is a comprehensive introduction to the toolkit with basic Python lessons. People who has gone through the book may definitely like the new book by Perkin. The book is must have desktop reference for students, professionals, and faculty members interested in the area of NLP, Computational Linguistics and NLTK. Perkins handles the topic in an elegant way. Most of the people who searched for some NLTK tips might have gone through the author’s blog. He maintains same simplicity and explanation style and hands-om approach throughout the book; which makes the reader to digest the topic with much easiness. The book is a collection of practical and working recipes related to NLTK.

The first chapter of the book “Tokenizing Text and WordNet Basics” deals with tokenizing text in to words sentences and paragraphs. The chapter also deals with tips and tricks with WordNet module in NLTK. Perkin discusses about Word Sense Disambiguation(WSD) techniques in this chapter. The missile part in WordNet is the use of wordnet ‘ic’ function. Tips for extracting collocations from a corpus is also included in the first chapter. The chapter “Replacing and Correcting Words”(IInd chapter) discusses stemming , lemmatization and spelling correction. He introduces another Python module named Python-Enchant for discussing about the spell checking technique. The chapter also discusses techniques like replaces negation with antonyms and replacement of repeating characters. The third chapter deals with Corpora. This chapter mainly discusses how to load user generated corpora in to NLTK with corpus readers implemented in NTLK. The most attracting part of this chapter is discussion about MonngoDB blackened for corpus reader in NLTK. MongoDB is a text based DB, which belongs to the NoSQL family. This part will be very useful for students in NLP and working professionals. The fourth chapter deals with POS Tagging techniques. It discusses mainly about training different POS taggers and using it. It is also quiet useful for people who would like to extend the functionality in NLTK for their projects and people who is interested to extend POS taggers in language other than English. Some part of this chapter content was published in the authors blog before one year. Chapter five of the book deals with Chunking and Chinking techniques with NLTK. Named Entity Identification and Extraction techniques are also discussed in this chapter. It gaves good insight to train NLTK chunking module for custom chunking tasks. With the help of this chapter I was able to create a small named entity extraction script with some Indian names. The sixth chapter is named as “Transforming Chunks and Trees” which deals with verb form correction, plural to singular correction, word filtering, and playing trees structures. Many time I saw that people used to raise question about handling tree data in NLTK. I think people can refer this chapter for getting good insight to play with NLTK parse tree data. The seventh chapter deals with most wanted topic of the time “Text Classification”. Some part of this chapter appeared as blog post in Perkin’s blog. There was many requests in freelancing web sites for text classification with NLTK. I found that some of them were not bided too. The chapter discusses the task of Text Classification in details with all the classification implementations available in NLTK. Training NLTK classifier is discussed very clearly. Apart form the classifier training, classification the chapter discusses classifier evaluation and tuning too. The eight chapter a revolutionary one which deals with Distributed data processing and handling large scale data with NLTK. I was not able to fully play with the total code in this chapter (Yes I worked out the code in other chapters and it was quite exciting. It contributed to my professional life too) . This chapter will be really helpful for industry people who is looking for to adopt NLTK in to NLP projects. Some basic insights of the contents in this chapter was also published in Perkin’s blog. After Nithin Madanini’s talk in US Python Conference on corpus processing with Dumbo and NLTK I think this is the only existing resource for practical large scale data processing with NLTK. The ninth and last chapter is about Parsing Scientific data with Python. This chapter deals with some Python modules rather than the NLTK tool. It discusses about URL extraction, timezone look-up, character conversion etc.. This chapter is good for people who plays with web data processing like harvesting. There is an appendix for the book which contains “Penn Treebank”. It give list of all tags with its frequency in treebank corpus.

For the last three four years I am using NLTK to teach and develop prototypes of NLP applications. I was very much when I went through each of the recipes in this book. The author provides UML diagrams for the modules in NLTK which helps the reader to get good insight on the functionality of each module. This will be a good book not only for students and practitioners but also for people would like to contribute to NLTK project too. Also this book will help students in NLP and Computational Linguistics to do their projects with NLTK and Python. I give 9 out of 10 for the book. Natural Language Processing students, teachers, professional hurry and bag a copy of this book.

Thanks to Packt publihsres for the review copy of the book.

Migrated from my old blog jaganadhg.freeflux.net

Written on December 14, 2010
The Opinions Expressed In This Post Are My Own And Not Necessarily Those Of My Employer.
[ life  ]