Finding Bigrams with t-score in Perl

In this post I will tell how we can find bigrams with tscore from an English text. If you are interested to read more on Bigram see the wiki article at http://en.wikipedia.org/wiki/Bigram.

There is a module called Lingua::EN::Bigram in the CPAN directory. I know that if you are a Perl NLP man you might have created your own perl code for doing the same. Let it be there. We can see how it can be made possible with the above said module.

To install the module run ‘cpan’ as root(Assumes that you are connected to Internet). Type ‘install Lingua::EN::Bigram’ in the cpan terminal and follow the instructions. If the installation is ok then copy and paste below given code to your favourite editor and save it as ‘bigram.pl’ .

#!/usr/bin/env perl
use Lingua::EN::Bigram;

$bigram = new Lingua::EN::Bigram;

while (<>) {
    $text = $_;
    $text =~ tr/A-Z/a-z/;
    $text =~ s/\.//g;
    $text =~ s/\,//g;
    $text =~ s/\?//g;
    $text =~ s/\!//g;
    $text =~ s/\://g;
    $text =~ s/\;//g;
    $bigram->text($text);
    $tscore = $bigram->tscore;
    foreach ( sort { $$tscore{$b} <=> $$tscore{$a} } keys %$tscore ) {
        print "$$tscore{$_} \t " . "$_\n";
    }
}

To run the program, open terminal and go to the directory where the bigram.pl is located. Type perl bigram.pl <your_textfile> .

Migrated from my old blog jaganadhg.freeflux.net

Written on July 13, 2009

The Opinions Expressed In This Post Are My Own And Not Necessarily Those Of My Employer.

[ Text Processing Natural Language Processing ]