Thoughts on text mining

Written by Jaganadh Gopinadhan

I was reading the book ‘Practical Text Mining with Perl’ by ‘Roger Bilisoly’, which is published by ‘Willy’. It is a nice book to learn text mining through the Perl programming language for beginners . So many practical examples are give in the text. Most of the examples are familiar to me, because I am using Perl and Python for so many years. Suddenly I thought that why cant I work out the examples in Python too!! Practical text mining with Python. I am not going to write a text book :-) . Just working out the examples in Perl and Python.

In page 22 of the book the author gives a perl code for extracting words from text. In the resulted text no punctuation marks will be there. I am reproducing the code (Not the exact code in the text book.)

#!/usr/bin/env perl

$f = $ARGV[0];
open( FILE, $f ) or die("File not found or can't read\n");
while (<FILE>) {
    chomp;
    @words = split(/\s+/);
    foreach $word (@words) {
        if ( $word =~ /(\w+)/ ) {
            print "$1 \n";
        }
    }
}

The same thing can be implemented in Python in two different ways.

#!/usr/bin/env python

import sys
import re

txt = open(sys.argv[1],'r').read()
words = txt.split()

for word in words:
    cword = re.search('\w+',word)
    print cword.group()

This code have a problem. Suppose the text contains some word like “gr8one” the program will throw error “AttributeError: ‘NoneType’ object has no attribute ‘group’”. I don’t know whether it is my error :-).

So the second implementation is given below.

#!/usr/bin/env python

import sys
import string

txt = open(sys.argv[1],'r').read()
for punct in string.punctuation:
    txt = txt.replace(punct," ")

words = txt.split()

for word in words:
    print word

Hey if you have any suggestions on the programs pleas put a comment.

Happy Hacking !!!!!!!

Migrated from my old blog jaganadhg.freeflux.net

Written on September 18, 2009
The Opinions Expressed In This Post Are My Own And Not Necessarily Those Of My Employer.
[ Python  Perl  Text Mining  Natural Language Processing  ]