Natural Language Processing: Support Vector Machine for Topic Detection

Wednesday, February 11, 2009

Support Vector Machine for Topic Detection

I decided to use the support vector machine for topic detection and well the results are not too good. The problem that i am facing is the long faced problem of imbalanced data sets. I can find a ratio of 1:1000 or more for positive:negative examples. I was curious to know what the result would be and in the process manage to fill up 2 TB to just get a training file. I am now doing some research on how to go ahead and deal with the imbalance in data. There are 2 techniques that are not to be used with not such good results - oversampling(replicate the positive samples) or undersampling(use a fewer of negative samples)

3 comments:

Jason M. Adams said...: Hey, you can always achieve 99.9% accuracy by just choosing negative all the time.. :) Of course, your recall is nil in that case.

Sounds like a good time to get to know your data -- maybe there are some unusual features that you can uncover that would be more helpful.; February 11, 2009 at 9:26 PM
Unknown said...: The reason for your dilemma is due to the absence of data matching. Differentiate the data and match positive components with negative components. You will have to invert the odd data by viewing its source or source family.; March 2, 2009 at 7:58 AM
Nisha said...: Jason and Research & Solutions - Thank you for the cmments. I am looking into new features that will hopefully distinguish; April 12, 2009 at 9:44 AM

Natural Language Processing

Wednesday, February 11, 2009

Support Vector Machine for Topic Detection

3 comments:

Blog Archive

About Me