Friday, January 16, 2009

HTMLParser

Of late, i have been trying to collect data and hey presto i am back to the same old problem of cleaning up my data. I decided that regular expressions would not serve me well in my task of dividing into subposts and extracting tags and titles and storing them.

I decided to try out the python HTMLParser which allows me to build on it and extract the posts. Everything was fine until a malformed tag is discovered. Ideally i would rather that the tag was ignored for it really didnt matter to me if i miss collecting on one post. The problem is that python stops processing the entire posts.

I need to see if tidy can correct these malformed tags. Some of these are as simple as missing a space between two attributes. A human eye can see it with the syntax features on but i suppose its too much intelligence to expect from a HTML Parser.

What have i ended up doing? Manual correction to which i need to find a solution. I certainly cant spend all my time doing this. I should look at HTML correction tools available out there.

That reminds me - i should download Python 3.0 and check it out. Its not supposed to be backward compatible and NLTK has not yet been developed for Python 3.0.

2 comments:

Anonymous said...

If you don't mind Java, have a look at Jericho HTML parser. I use it and it is quite decent: http://jerichohtml.sourceforge.net/doc/index.html

Nisha said...

Thanks Alex. It definitely sounds like something useful and may work for me!