Thursday, January 3, 2008

NLTK

I used this toolkit for my NLP project and although there were many features that did not work as i expected it to i found it really useful. The toolkit is written in python and python is a very easy and user-friendly language to learn.

Although, i knew a bit of python and used it extensively in the first semester for all the NLP assignments, i realised the actual utility and convenience of python w.r.t NLP tasks when i read the guide provided with the NLTK toolkit.

Although i am yet to use all the features provided in the NLTK , i have used the stemmers and different types of probability distributions.The learning curve for me was around a week including learning python part. Initially, i wondered if it really was worth all the effort as i could easily have implemented the algorithms in python or any other language.

The plus point was once i learnt how to use the toolkit, making enhancements took no longer than 5 minutes and in the end i could get quite a lot done.

The clean_html API of NLTK did not work. I either found the output contained the HTML tags or the text had disappeared! Further, since it uses the underlying HTML parser, its not resilient to malformed pages on the internet.

I found it easier to write my own code for implementing the Naive Bayes method. The NLTK provides many methods too. I would say its definitely been worth trying out the natural language toolkit and recommend it!

You can download NLTK at the following site:
http://nltk.sourceforge.net/

2 comments:

Paul Bone said...

Hi Nisha,

I'm an NLTK developer investigating the bug with clean_html() you refer to in this post.

I'm unable to reproduce the problem you've reported. If you could check that this problem occurs for you an a recent version of NLTK (some time has passed since you made this post) and send us some test cases that highlight the problems you're seeing it would be most helpful.

See http://code.google.com/p/nltk/issues/detail?id=23

Thanks.

Mariana Soffer said...

I loved this tool, is great. You can use it now with Jython