Thursday, January 24, 2008

Extracting text from HTML

This has been a task that i have been at for so many months trying to find the perfect solution to extract text from an HTML webpage. I have tried so many options of which for Windows Emsa HTMLRem is definitely good. However, since most of my work is in Linux i was not too thrilled with the idea of extracting data on Windows and thereafter ftping it to Linux.

Yesterday was therefore spent trying to look at many options. The NLTK toolkits clean_html API works for a few websites and also used HTML Tidy before using the clean_html API. This approach worked for some websites and did not for other websites.

I now have to try some other technique probably regular expressions... As they say the data collection and cleaning part is the most difficult part for any task.

2 comments:

Unknown said...

not exactly sure what you are looking for, but if you want only the visible text, you might want to check out lynx as a starting point:

lynx -dump -nonumbers http://http://news.bbc.co.uk/

to support utf-8, edit the lynx.cfg file by adding/modifying the following:

CHARACTER_SET:utf-8
ASSUME_CHARSET:utf-8

then try something like:

lynx -dump -nonumbers http://news.bbc.co.uk/hi/arabic/news/

its a simple and painless way to grab text without the headaches associated with cleaning and parsing html, and its scripting friendly. let their browser do the dirty work.

i'm trying to do something similar and just getting started. if you have any tips, etc., they are more than welcome!

mike garbus

atlas245 said...

Nice post on extracting data, simple and too the point :), For extracting data i use python for simple things,data extraction can be a time consuming process but for larger projects like documents, the web, or files i tried "extracting data" which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs