Tuesday, April 22, 2008

OpenNLP tools

I used the Open NLP tools for the following:
1) Tokenizing
2) Sentence Detection
3) Shallow parsing
4) Named Entity recognition

Although, i didnt find it pretty accurate for the Named Entity recognition task i found that it did a pretty good job on the Tokenizing, Sentence Detection and Shallow parsing part.

However, the model files dont seem to work on Windows Vista as it is unable to uncompress the same nor read the same. With Linux, it works perfectly.
Since i have not modified the same, using it as a command line option or calling from my program serves my purpose .

http://opennlp.sourceforge.net/

Summarization

As part of another course that i am taking up, i decided to do a project on summarization . Why did i choose summarization? Well, grade wise there were lots of things that one could choose to do and one could choose not to do. Submitting a project proposal was therefore easier and also the instructor was happy as we told him what we were planning to do and how we would proceed with it etc etc.

However, summarization is more than that. In todays world, we have an explosion of information and almost all of us are either too busy or too lazy to go through entire documents. If we do a google search how many of us even bother to go to the second page, leave alone the tenth page. Google search uses the page rank algorithm that is based on the number of trusted links. How many websites linkto you ! Well, the information ideally could be on a website that no one links to. However, i suppose it all works out because most of the times the information is present on the so-called "trusted websites".

Looking at it in this context i see the following uses for summarization:
1) Automatic generation of abstracts
For all those scientific papers that we write, it would be wonderful if we just feed in the content and out comes the abstract that we just need to insert into the paper. However, there is no general technique that is totally available. Further, precision and recall and accuracy in the field of NLP has always been quite low.
2)Summarization of news articles
Feeds are the best example of these. They take the so-called "important" news and give it to you and should you be more interested you can go ahead and read the entire article.
3) Information retrieval
Summarizations helps us in performing information retrieval better - we could enter the key words and retrieve a list of documents. We could further refine these keywords by extracting a summary and refining the key words

Summarization can broadly be divided into
1) Single document summarization
2) Multi-document summarization

Of the two multi-document summarization is more difficult to acheive. The reason for this is we would like to ensure that there is no overlapping and yet the same thing can be in written in many ways so how do we figure out what sentences are the same !
Single document summarization as the name mentions involves extracting a summary of a single document

Again summarization can now further be divided into
1)Extractive summarization
2) Abstractive summarization

Extractive summarization is where we simply extract the sentences and do not modify them. The key to this kind of summarization is extracting the "right" sentences that convey most of the information about the topic. Again, with single document summarization this is easier whereas with multi-document summarization, since we do not wish to have redundancies in the summaries we need to take care of that. Again how do we order these sentences. Deciding the order is a major task. Which sentence should come first and from which document? In case of single document summarization it is generally the order in which it appears.

Another category of summarization would be:
1) Summary in the same language from the source
2) Summary in a different language from the source

This approach is similiar to machine translation where we need to represent the source documents as concepts, find a summary and represent the same in another language

Further, summaries are also classified by the way they are displayed
1) A paragraph
2) Key sentences highlighted
3) A list

That pretty much covers an overview of summarization. We decided to go in for the single-document, extractive and paragraph style summarization in the same language.

Therefore the task that we had before us was to select the most relevant sentences from the document and display them in a paragraph form. Reviewing the current literature the main techniques used are
1) Key words
These are the words that occur frequently in the text. Basically, if a word is important it appears many times in the text
2) Position of the sentence
In general the information is expected to be towards the begining of the text especially in newspaper texts. Therefore, extracting earlier sentences is almost always a good bet for a summary
3) Title words
The logic behind using title words is that title words are in general important words since they are supposed to convey the gist or topic of the story. Therefore, the sentences that contain these words should be given importance
4) Cue words
There are certain words that indicate importance . Examples of these words are "More importantly" etc. Selecting sentences which contain these cue words would be a good bet

In our project we tried approaches based on key words and title words alone and a combined approach based on key words, title words and position. We found that the combined approach in general worked much better.

We chose those words that exceeded mean +k* standard deviation as key words.

The other approach is based on the concept of lexical chains. Lexical chains are group of related words. These words could be related by a hyponymy-hypernymy relationship (is a relationship) or meronymy-holonymy relationship (part-whole relationship) or (synonymy-antonymy) relationship.

A cohesive text would have successive sentences talking about the same topic and therefore a sentence that contains many words that are related to the same topic would be a good bet. In order to implement this we used WordNet and the NLTK package.

I also used the Open-NLP toolkit to extract the nouns. Wordnet is mostly noun-based and therefore the relationships were found between nouns. We followed the same algorithm by Regina Barzilay in "Using Lexical Chains for Text Summarization, Barzilay et al "

However, there were a few variations - we considered only the first three semantic senses for a word and secondly we considered only the hypernymy-hyponymy relationship.
Our threshold for selecting lexical chains was mean+standard deviation.
The heuristic we followed was to extract the first sentence that contained a representative member.

Surprisingly, or maybe not so surprisingly summaries were much shorter. We compared the summaries with the Microsoft Word AutoSummarize feature and the summaries generated by word are much longer.

In general, the project was a good experience. I learnt a few things along the line
1) The OpenNLP model files do not work for windows where as they work perfectly for Linux.
2) NLTK does not have a ready to use chunker or parser, i need to write my own rules
3) NLTK has a wordnet package which is useful in getting the distances between hypernyms and hyponyms and nothing

You may take a look at the ppt here
http://www.utdallas.edu/~khassanali/summarization.ppt