Wednesday, October 8, 2008

Topic Detection

One of the sub-tasks in determining the topic is determining who is being talked about in a particular text. I had tried Named Entity Recognition using the Open-NLP toolkit and it looks like it did not identify all the "names" in the text. A simple frequency based approach seems to have worked better. What did i do - well, eliminate all the stopwords there are numerous stopword lists available on the list and thereafter create a frequency distribution table of the remaining words and looked at the really frequent words - words occuring above mean + 5 SD. This gave mostly proper nouns and were indeed about the person being talked about. Therefore, the named-entity recogniton tool too should have detected all these instances of words although i still need to look at the same further.

I tried using my summarization project that i did the last semester but that does not happen to suit this task. Perhaps, i should look at the text and try and look at a distinguishing feature... I am yet to try out the topic toolbox.

5 comments:

Unknown said...

Hi
I read your blog. You wrote about summary generation of a document. I am a Final year student of Software Engineering and i have a similar project to implement. Can you please guide me about this.
Waiting for your response before 10 March 09' Please :(...
Thanks
faffe
P.s reply me on my e-mail id please....farheena87@hotmail.com

alind sharma said...

thats a nice collection of nlp work. I am also interested in doing nlp for text categoristation. AS is evident from your posts you are using some python. Just wanna know that is python gud enough in terms of speed for nlp. What are your experince with regards to this (if you are using python for doing core nlp). can ip.m. me at alindsharma@gmail.com if you have time.

Nisha said...

Mehwish - Well, i hope your project went well. I was tied up with my research and its been some time i looked this blog up.

Alind - i use python extensively and it is definitely slower in terms of speed. Luckily for me, processing speed and storage space is not too much of an issue. I find python more intuitive for NLP

Abhishek said...

have you tried using the OpenCalais webservice which is free? I am sure it does a pretty good job for identifying names.

Nisha said...

Thanks Abhishek. I will take a look at it.