Monday, April 5, 2010


I finally got a paper published and the plus point is that this was my first submission. The paper is titled "Automatic Detection of Tags in Political Blogs" and will appear in the NAACL Social Media Conference 2010.

Thursday, October 22, 2009

Choosing an advisor

Although, this is not really related to NLP i am putting this out in the hope that it will help all those students who go through the quandry of choosing the right advisor.

Choosing the right advisor sometimes can make the difference between completing your Ph.D and not. Therefore, when one chooses advisors they need to be very careful and its best that when one chooses their advisors they choose them right.

Steps to consider
1) Make a list of areas that you are interested in doing research in. Make a list of the faculty members that work in those areas.
2) Find out the students who work under these professors. They would be the ones who would actually tell you how the advisor works and his advising style. Ask them questions in particular:
1) How is his funding? Does he have many grant proposals in the pipeline? When is his funding going to end? If the professor is not working on any grant proposals you will be paying for your PhD course yourself !
2) Does the professor have significant progress in the last two years. If he hasnt published papers in the last two years, its quite likely that he either just does not work towards publishing papers or does not advise his students enough. There may be rare cases where the advisor may be very particular about the work. However, two years is definitely enough time to publish one paper.
3) Does the professor have any students? Have students left this advisor? If so, steer clear of the professor. The fact that the professor did not manage to retain any PhD student means something is wrong with him.
4) Find out the attitude of the professors - does he bother to be available for his students, does he bother to answer questions, does he bother to review students work? Beleive me, i have had the worst experience working with a professor who never bothered to find out if his students made progress, worked at all. This feedback is something that you would normally get on the grape vine - get in touch with other PhD students preferably many and someone would have heard of it.
5) How often do the professors students graduate? It would give you an idea of the amount of time required to do your PhD under the professor. Also, look at the quality of journals and conferences that the professor has published in. That will give you an idea of the quality of work that you would need to survive with the professor.
6) Find an advisor who works with your style - close supervision etc. I always think that having an advisor is kind of like getting married. The decision should be given a lot of thought.

Now just like marriages go bad, what do you do when you end up with a bad advisor? Do you stick around or look for another advisor. In most situations, one can come to a compromise and still complete PhD. Theres no advisor who is totally perfect.

However, if you see the warning signs i mentioned about the best time to switch would be immediately in the second semester and at the most by one year. If switching to another advisor by yourself does not walk, get the Department Head or the Dean involved. After all your aim is to complete your PhD.

Wednesday, August 26, 2009

Opinion Mining and Sentiment Analysis by Bo Pang and Lillian Lee

For those of you interested in opinion mining and sentiment analysis, the book Opinion Mining and Sentiment Analysis by Bo Pang and Lillian Lee seems to be interesting and explains the concepts very simply.

There's an author formatted version available at the following link:

Google Books allows for an access to parts of the book. Worth reading.

Friday, August 21, 2009

Writing for Computer Science by Justin Zobel

I have absolutely fallen in love with the book Writing for Computer Science by Justin Zobel. I have been going about my research all this years learning through experience things that are specified in the book. Had i read the book, it would have been a much easier task for me.

This book is good for newbies to research as well as experienced researchers. It has many checklists that i am using for my research and find it easier and although i am yet to read the entire book, it provides on how to write a paper for computer science. According to me, its a must have for anyone who needs to do some writing for computer science and conduct research.

Wednesday, February 11, 2009

Support Vector Machine for Topic Detection

I decided to use the support vector machine for topic detection and well the results are not too good. The problem that i am facing is the long faced problem of imbalanced data sets. I can find a ratio of 1:1000 or more for positive:negative examples. I was curious to know what the result would be and in the process manage to fill up 2 TB to just get a training file. I am now doing some research on how to go ahead and deal with the imbalance in data. There are 2 techniques that are not to be used with not such good results - oversampling(replicate the positive samples) or undersampling(use a fewer of negative samples)

Friday, January 16, 2009

w-get Web Crawler

I was looking at getting a web-crawler which would be really easy to use, linux-based and have a zero learning curve and found w-get really good. I remember trying out all those other open source crawlers and getting lost in trying them out.


Of late, i have been trying to collect data and hey presto i am back to the same old problem of cleaning up my data. I decided that regular expressions would not serve me well in my task of dividing into subposts and extracting tags and titles and storing them.

I decided to try out the python HTMLParser which allows me to build on it and extract the posts. Everything was fine until a malformed tag is discovered. Ideally i would rather that the tag was ignored for it really didnt matter to me if i miss collecting on one post. The problem is that python stops processing the entire posts.

I need to see if tidy can correct these malformed tags. Some of these are as simple as missing a space between two attributes. A human eye can see it with the syntax features on but i suppose its too much intelligence to expect from a HTML Parser.

What have i ended up doing? Manual correction to which i need to find a solution. I certainly cant spend all my time doing this. I should look at HTML correction tools available out there.

That reminds me - i should download Python 3.0 and check it out. Its not supposed to be backward compatible and NLTK has not yet been developed for Python 3.0.