Friday, January 16, 2009

w-get Web Crawler

I was looking at getting a web-crawler which would be really easy to use, linux-based and have a zero learning curve and found w-get really good. I remember trying out all those other open source crawlers and getting lost in trying them out.
http://en.wikipedia.org/wiki/Wget

HTMLParser

Of late, i have been trying to collect data and hey presto i am back to the same old problem of cleaning up my data. I decided that regular expressions would not serve me well in my task of dividing into subposts and extracting tags and titles and storing them.

I decided to try out the python HTMLParser which allows me to build on it and extract the posts. Everything was fine until a malformed tag is discovered. Ideally i would rather that the tag was ignored for it really didnt matter to me if i miss collecting on one post. The problem is that python stops processing the entire posts.

I need to see if tidy can correct these malformed tags. Some of these are as simple as missing a space between two attributes. A human eye can see it with the syntax features on but i suppose its too much intelligence to expect from a HTML Parser.

What have i ended up doing? Manual correction to which i need to find a solution. I certainly cant spend all my time doing this. I should look at HTML correction tools available out there.

That reminds me - i should download Python 3.0 and check it out. Its not supposed to be backward compatible and NLTK has not yet been developed for Python 3.0.