Small Mosaic


Categories:

books
career
codinghorrors
comics
events
geekstuff
justdont
languages
languages/bash
linkshot
magazines
meta
misctech
movies
nottech
operatingsystems
operatingsystems/linux
operatingsystems/linux/debian
operatingsystems/solaris
paranoidadmin
perl
presentations
programming
python
ruby
security
security/apache
security/tools
serversmells
services
services/dns
sites
specifications
sysadmin
tools
tools/commandline
tools/firefox
tools/gui
tools/network
tools/online
tools/online/greasemonkey
unixdaemon

Archives:

July 20093
June 20091
April 20093
March 20097
February 20094
January 200917
December 20081
November 20084
October 20085
September 20084
August 200812
July 20089
April 20084
March 20081
February 20081
January 200815
August 20072
June 20079
May 20076
April 20078
March 200731
February 20073
January 200721
December 20061
November 20064
October 20066
September 200632
August 200617
July 200614
June 20069
May 200613
March 200611
February 200616
January 200611
December 20051
November 20056
October 200519
September 200525
August 200516
July 200516
June 200513
May 20052
April 200519
March 200531
February 200520
January 200531
December 200421
November 200430
October 200432
September 200418
August 20047
July 200414
June 20045

Sun, 04 Jan 2009

Simple Stemming with Perl
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form.
-- Wikipedia article on Stemming

Ever used a website that allowed you to tag content? Ever ended up accidently using slightly different tags? Something like graphs and graphing or blog and blogs? (I hope so, otherwise it's just me...) To spot some of the more obvious overlaps you can stem each of the words and look for a common base. Where one's found there is the possibility of mistaken duplication. For example if you passed hunts, hunted and hunting through a stemmer each would return 'hunt'. If you want to try for yourself there are online stemmers available.

As a more concrete example let's look at the wonderful service del.icio.us. You upload your own bookmarks, tag them with a number of keywords and can then group, sort and search them by your own defined terms. Except I have a habit of tagging articles about similar topics with nearly, but not quite the same tag.

The perl code below shows how easy it is (using Lingua::Stem from CPAN) to run your own data through a stemmer and look for overlaps. There are implementations in most languages (PyStemmer is also very nice) and the wikipedia article is actually a very easy to follow introduction.


#!/usr/bin/perl -w
use strict;
use warnings;
use Lingua::Stem;
use Net::Delicious;

my $del = Net::Delicious->new(
                               {
                                 user => "username",
                                 pswd => "password"
                               }
                             );

my $stemmer = Lingua::Stem->new( -locale => 'EN-UK' );

my %stems;
for my $tag ( $del->tags() ) {
  my $stemmed = $stemmer->stem( $tag->tag );

  push( @{ $stems{$stemmed->[0]} },  $tag->tag );
}

for my $stemmed (sort keys %stems ) {
  # we only care about base words with more than one tag associated
  next unless ( scalar @{ $stems{$stemmed} } > 1);

  print "Possible duplicates -\n";
  print "  --  ";
  print join(" : ", @{ $stems{$stemmed} }), "\n";
}


Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2009/01/04 19:32 | /perl | Permanent link to this entry | This entry + same date


books career codinghorrors events geekstuff justdont languages/bash magazines meta misctech movies nottech operatingsystems/linux operatingsystems/linux/debian operatingsystems/solaris perl presentations programming python ruby security security/apache security/tools serversmells services/dns sites specifications sysadmin tools/commandline tools/firefox tools/gui tools/network tools/online tools/online/greasemonkey unixdaemon

Copyright © 2000-2005 Dean Wilson XML feed logo