Wednesday, October 8, 2008

Topic Detection

One of the sub-tasks in determining the topic is determining who is being talked about in a particular text. I had tried Named Entity Recognition using the Open-NLP toolkit and it looks like it did not identify all the "names" in the text. A simple frequency based approach seems to have worked better. What did i do - well, eliminate all the stopwords there are numerous stopword lists available on the list and thereafter create a frequency distribution table of the remaining words and looked at the really frequent words - words occuring above mean + 5 SD. This gave mostly proper nouns and were indeed about the person being talked about. Therefore, the named-entity recogniton tool too should have detected all these instances of words although i still need to look at the same further.

I tried using my summarization project that i did the last semester but that does not happen to suit this task. Perhaps, i should look at the text and try and look at a distinguishing feature... I am yet to try out the topic toolbox.

Wednesday, October 1, 2008

Topic Detection

Of late, i have been trying to extract topics from a given text. I found that there were a few tools that are available but am yet to figure out how to get them working. The problem with these tools are that they are written in different languages and every time one wants to use a new tool one needs to learn a new language. Matlab Topic Modelling Toolbox 1.3.2 http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm seems to be an interesting tool to explore. The issue i face with most of the tools is that they were developed with a certain assumption in mind. We are therefore never sure if the results are accurate for our problem so there i am trying to find a simple solution me being the lazy person that i am.