Wednesday, February 11, 2009

Support Vector Machine for Topic Detection

I decided to use the support vector machine for topic detection and well the results are not too good. The problem that i am facing is the long faced problem of imbalanced data sets. I can find a ratio of 1:1000 or more for positive:negative examples. I was curious to know what the result would be and in the process manage to fill up 2 TB to just get a training file. I am now doing some research on how to go ahead and deal with the imbalance in data. There are 2 techniques that are not to be used with not such good results - oversampling(replicate the positive samples) or undersampling(use a fewer of negative samples)

3 comments:

Jason M. Adams said...

Hey, you can always achieve 99.9% accuracy by just choosing negative all the time.. :) Of course, your recall is nil in that case.

Sounds like a good time to get to know your data -- maybe there are some unusual features that you can uncover that would be more helpful.

Unknown said...

The reason for your dilemma is due to the absence of data matching. Differentiate the data and match positive components with negative components. You will have to invert the odd data by viewing its source or source family.

Nisha said...

Jason and Research & Solutions - Thank you for the cmments. I am looking into new features that will hopefully distinguish