Friday, January 25, 2008

Managed to clean the blogs sufficiently

Well, for those blogs that could not be cleaned even after using TIDY i fell back on regular expressions. These regular expressions were present in the NLTK guide present and they seem to work. I can use another regular expression to get rid of the spaces and the like. However, my problem is now separating posts from each other...

I also happen to have another issue which is the some of these blogs contain links to their entries and hence i am not able to harvest data from these blogs... I need to look further into how i can harvest data from these blogs.

Some blogs get updated pretty often and others dont get updated often. I do not want to end up getting duplicated data and right now all i am thinking is how do i separate this duplicate data from the raw text that i have. I could have tried to use the parser but unfortunately each node is named differently with different blogs and i simply cant enumerate all the possible options.

I also want to get rid of all the advertisements and archive dates for these will end up being spurious features. How do i get rid of that? The data collection and cleaning part is really the toughest part of all. Without the data i really cannot analyze it nor run the machine learning algorithms on it.

No comments: