Friday, January 25, 2008

Managed to clean the blogs sufficiently

Well, for those blogs that could not be cleaned even after using TIDY i fell back on regular expressions. These regular expressions were present in the NLTK guide present and they seem to work. I can use another regular expression to get rid of the spaces and the like. However, my problem is now separating posts from each other...

I also happen to have another issue which is the some of these blogs contain links to their entries and hence i am not able to harvest data from these blogs... I need to look further into how i can harvest data from these blogs.

Some blogs get updated pretty often and others dont get updated often. I do not want to end up getting duplicated data and right now all i am thinking is how do i separate this duplicate data from the raw text that i have. I could have tried to use the parser but unfortunately each node is named differently with different blogs and i simply cant enumerate all the possible options.

I also want to get rid of all the advertisements and archive dates for these will end up being spurious features. How do i get rid of that? The data collection and cleaning part is really the toughest part of all. Without the data i really cannot analyze it nor run the machine learning algorithms on it.

Thursday, January 24, 2008

Extracting text from HTML

This has been a task that i have been at for so many months trying to find the perfect solution to extract text from an HTML webpage. I have tried so many options of which for Windows Emsa HTMLRem is definitely good. However, since most of my work is in Linux i was not too thrilled with the idea of extracting data on Windows and thereafter ftping it to Linux.

Yesterday was therefore spent trying to look at many options. The NLTK toolkits clean_html API works for a few websites and also used HTML Tidy before using the clean_html API. This approach worked for some websites and did not for other websites.

I now have to try some other technique probably regular expressions... As they say the data collection and cleaning part is the most difficult part for any task.

Thursday, January 3, 2008

NLTK

I used this toolkit for my NLP project and although there were many features that did not work as i expected it to i found it really useful. The toolkit is written in python and python is a very easy and user-friendly language to learn.

Although, i knew a bit of python and used it extensively in the first semester for all the NLP assignments, i realised the actual utility and convenience of python w.r.t NLP tasks when i read the guide provided with the NLTK toolkit.

Although i am yet to use all the features provided in the NLTK , i have used the stemmers and different types of probability distributions.The learning curve for me was around a week including learning python part. Initially, i wondered if it really was worth all the effort as i could easily have implemented the algorithms in python or any other language.

The plus point was once i learnt how to use the toolkit, making enhancements took no longer than 5 minutes and in the end i could get quite a lot done.

The clean_html API of NLTK did not work. I either found the output contained the HTML tags or the text had disappeared! Further, since it uses the underlying HTML parser, its not resilient to malformed pages on the internet.

I found it easier to write my own code for implementing the Naive Bayes method. The NLTK provides many methods too. I would say its definitely been worth trying out the natural language toolkit and recommend it!

You can download NLTK at the following site:
http://nltk.sourceforge.net/