Of late, i have been trying to collect data and hey presto i am back to the same old problem of cleaning up my data. I decided that regular expressions would not serve me well in my task of dividing into subposts and extracting tags and titles and storing them.
I decided to try out the python HTMLParser which allows me to build on it and extract the posts. Everything was fine until a malformed tag is discovered. Ideally i would rather that the tag was ignored for it really didnt matter to me if i miss collecting on one post. The problem is that python stops processing the entire posts.
I need to see if tidy can correct these malformed tags. Some of these are as simple as missing a space between two attributes. A human eye can see it with the syntax features on but i suppose its too much intelligence to expect from a HTML Parser.
What have i ended up doing? Manual correction to which i need to find a solution. I certainly cant spend all my time doing this. I should look at HTML correction tools available out there.
That reminds me - i should download Python 3.0 and check it out. Its not supposed to be backward compatible and NLTK has not yet been developed for Python 3.0.
Friday, January 16, 2009
Subscribe to:
Post Comments (Atom)
2 comments:
If you don't mind Java, have a look at Jericho HTML parser. I use it and it is quite decent: http://jerichohtml.sourceforge.net/doc/index.html
Thanks Alex. It definitely sounds like something useful and may work for me!
Post a Comment