Wednesday, February 11, 2009

Support Vector Machine for Topic Detection

I decided to use the support vector machine for topic detection and well the results are not too good. The problem that i am facing is the long faced problem of imbalanced data sets. I can find a ratio of 1:1000 or more for positive:negative examples. I was curious to know what the result would be and in the process manage to fill up 2 TB to just get a training file. I am now doing some research on how to go ahead and deal with the imbalance in data. There are 2 techniques that are not to be used with not such good results - oversampling(replicate the positive samples) or undersampling(use a fewer of negative samples)