One of the sub-tasks in determining the topic is determining who is being talked about in a particular text. I had tried Named Entity Recognition using the Open-NLP toolkit and it looks like it did not identify all the "names" in the text. A simple frequency based approach seems to have worked better. What did i do - well, eliminate all the stopwords there are numerous stopword lists available on the list and thereafter create a frequency distribution table of the remaining words and looked at the really frequent words - words occuring above mean + 5 SD. This gave mostly proper nouns and were indeed about the person being talked about. Therefore, the named-entity recogniton tool too should have detected all these instances of words although i still need to look at the same further.
I tried using my summarization project that i did the last semester but that does not happen to suit this task. Perhaps, i should look at the text and try and look at a distinguishing feature... I am yet to try out the topic toolbox.
Wednesday, October 8, 2008
Wednesday, October 1, 2008
Topic Detection
Of late, i have been trying to extract topics from a given text. I found that there were a few tools that are available but am yet to figure out how to get them working. The problem with these tools are that they are written in different languages and every time one wants to use a new tool one needs to learn a new language. Matlab Topic Modelling Toolbox 1.3.2 http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm seems to be an interesting tool to explore. The issue i face with most of the tools is that they were developed with a certain assumption in mind. We are therefore never sure if the results are accurate for our problem so there i am trying to find a simple solution me being the lazy person that i am.
Tuesday, April 22, 2008
OpenNLP tools
I used the Open NLP tools for the following:
1) Tokenizing
2) Sentence Detection
3) Shallow parsing
4) Named Entity recognition
Although, i didnt find it pretty accurate for the Named Entity recognition task i found that it did a pretty good job on the Tokenizing, Sentence Detection and Shallow parsing part.
However, the model files dont seem to work on Windows Vista as it is unable to uncompress the same nor read the same. With Linux, it works perfectly.
Since i have not modified the same, using it as a command line option or calling from my program serves my purpose .
http://opennlp.sourceforge.net/
1) Tokenizing
2) Sentence Detection
3) Shallow parsing
4) Named Entity recognition
Although, i didnt find it pretty accurate for the Named Entity recognition task i found that it did a pretty good job on the Tokenizing, Sentence Detection and Shallow parsing part.
However, the model files dont seem to work on Windows Vista as it is unable to uncompress the same nor read the same. With Linux, it works perfectly.
Since i have not modified the same, using it as a command line option or calling from my program serves my purpose .
http://opennlp.sourceforge.net/
Summarization
As part of another course that i am taking up, i decided to do a project on summarization . Why did i choose summarization? Well, grade wise there were lots of things that one could choose to do and one could choose not to do. Submitting a project proposal was therefore easier and also the instructor was happy as we told him what we were planning to do and how we would proceed with it etc etc.
However, summarization is more than that. In todays world, we have an explosion of information and almost all of us are either too busy or too lazy to go through entire documents. If we do a google search how many of us even bother to go to the second page, leave alone the tenth page. Google search uses the page rank algorithm that is based on the number of trusted links. How many websites linkto you ! Well, the information ideally could be on a website that no one links to. However, i suppose it all works out because most of the times the information is present on the so-called "trusted websites".
Looking at it in this context i see the following uses for summarization:
1) Automatic generation of abstracts
For all those scientific papers that we write, it would be wonderful if we just feed in the content and out comes the abstract that we just need to insert into the paper. However, there is no general technique that is totally available. Further, precision and recall and accuracy in the field of NLP has always been quite low.
2)Summarization of news articles
Feeds are the best example of these. They take the so-called "important" news and give it to you and should you be more interested you can go ahead and read the entire article.
3) Information retrieval
Summarizations helps us in performing information retrieval better - we could enter the key words and retrieve a list of documents. We could further refine these keywords by extracting a summary and refining the key words
Summarization can broadly be divided into
1) Single document summarization
2) Multi-document summarization
Of the two multi-document summarization is more difficult to acheive. The reason for this is we would like to ensure that there is no overlapping and yet the same thing can be in written in many ways so how do we figure out what sentences are the same !
Single document summarization as the name mentions involves extracting a summary of a single document
Again summarization can now further be divided into
1)Extractive summarization
2) Abstractive summarization
Extractive summarization is where we simply extract the sentences and do not modify them. The key to this kind of summarization is extracting the "right" sentences that convey most of the information about the topic. Again, with single document summarization this is easier whereas with multi-document summarization, since we do not wish to have redundancies in the summaries we need to take care of that. Again how do we order these sentences. Deciding the order is a major task. Which sentence should come first and from which document? In case of single document summarization it is generally the order in which it appears.
Another category of summarization would be:
1) Summary in the same language from the source
2) Summary in a different language from the source
This approach is similiar to machine translation where we need to represent the source documents as concepts, find a summary and represent the same in another language
Further, summaries are also classified by the way they are displayed
1) A paragraph
2) Key sentences highlighted
3) A list
That pretty much covers an overview of summarization. We decided to go in for the single-document, extractive and paragraph style summarization in the same language.
Therefore the task that we had before us was to select the most relevant sentences from the document and display them in a paragraph form. Reviewing the current literature the main techniques used are
1) Key words
These are the words that occur frequently in the text. Basically, if a word is important it appears many times in the text
2) Position of the sentence
In general the information is expected to be towards the begining of the text especially in newspaper texts. Therefore, extracting earlier sentences is almost always a good bet for a summary
3) Title words
The logic behind using title words is that title words are in general important words since they are supposed to convey the gist or topic of the story. Therefore, the sentences that contain these words should be given importance
4) Cue words
There are certain words that indicate importance . Examples of these words are "More importantly" etc. Selecting sentences which contain these cue words would be a good bet
In our project we tried approaches based on key words and title words alone and a combined approach based on key words, title words and position. We found that the combined approach in general worked much better.
We chose those words that exceeded mean +k* standard deviation as key words.
The other approach is based on the concept of lexical chains. Lexical chains are group of related words. These words could be related by a hyponymy-hypernymy relationship (is a relationship) or meronymy-holonymy relationship (part-whole relationship) or (synonymy-antonymy) relationship.
A cohesive text would have successive sentences talking about the same topic and therefore a sentence that contains many words that are related to the same topic would be a good bet. In order to implement this we used WordNet and the NLTK package.
I also used the Open-NLP toolkit to extract the nouns. Wordnet is mostly noun-based and therefore the relationships were found between nouns. We followed the same algorithm by Regina Barzilay in "Using Lexical Chains for Text Summarization, Barzilay et al "
However, there were a few variations - we considered only the first three semantic senses for a word and secondly we considered only the hypernymy-hyponymy relationship.
Our threshold for selecting lexical chains was mean+standard deviation.
The heuristic we followed was to extract the first sentence that contained a representative member.
Surprisingly, or maybe not so surprisingly summaries were much shorter. We compared the summaries with the Microsoft Word AutoSummarize feature and the summaries generated by word are much longer.
In general, the project was a good experience. I learnt a few things along the line
1) The OpenNLP model files do not work for windows where as they work perfectly for Linux.
2) NLTK does not have a ready to use chunker or parser, i need to write my own rules
3) NLTK has a wordnet package which is useful in getting the distances between hypernyms and hyponyms and nothing
You may take a look at the ppt here
http://www.utdallas.edu/~khassanali/summarization.ppt
However, summarization is more than that. In todays world, we have an explosion of information and almost all of us are either too busy or too lazy to go through entire documents. If we do a google search how many of us even bother to go to the second page, leave alone the tenth page. Google search uses the page rank algorithm that is based on the number of trusted links. How many websites linkto you ! Well, the information ideally could be on a website that no one links to. However, i suppose it all works out because most of the times the information is present on the so-called "trusted websites".
Looking at it in this context i see the following uses for summarization:
1) Automatic generation of abstracts
For all those scientific papers that we write, it would be wonderful if we just feed in the content and out comes the abstract that we just need to insert into the paper. However, there is no general technique that is totally available. Further, precision and recall and accuracy in the field of NLP has always been quite low.
2)Summarization of news articles
Feeds are the best example of these. They take the so-called "important" news and give it to you and should you be more interested you can go ahead and read the entire article.
3) Information retrieval
Summarizations helps us in performing information retrieval better - we could enter the key words and retrieve a list of documents. We could further refine these keywords by extracting a summary and refining the key words
Summarization can broadly be divided into
1) Single document summarization
2) Multi-document summarization
Of the two multi-document summarization is more difficult to acheive. The reason for this is we would like to ensure that there is no overlapping and yet the same thing can be in written in many ways so how do we figure out what sentences are the same !
Single document summarization as the name mentions involves extracting a summary of a single document
Again summarization can now further be divided into
1)Extractive summarization
2) Abstractive summarization
Extractive summarization is where we simply extract the sentences and do not modify them. The key to this kind of summarization is extracting the "right" sentences that convey most of the information about the topic. Again, with single document summarization this is easier whereas with multi-document summarization, since we do not wish to have redundancies in the summaries we need to take care of that. Again how do we order these sentences. Deciding the order is a major task. Which sentence should come first and from which document? In case of single document summarization it is generally the order in which it appears.
Another category of summarization would be:
1) Summary in the same language from the source
2) Summary in a different language from the source
This approach is similiar to machine translation where we need to represent the source documents as concepts, find a summary and represent the same in another language
Further, summaries are also classified by the way they are displayed
1) A paragraph
2) Key sentences highlighted
3) A list
That pretty much covers an overview of summarization. We decided to go in for the single-document, extractive and paragraph style summarization in the same language.
Therefore the task that we had before us was to select the most relevant sentences from the document and display them in a paragraph form. Reviewing the current literature the main techniques used are
1) Key words
These are the words that occur frequently in the text. Basically, if a word is important it appears many times in the text
2) Position of the sentence
In general the information is expected to be towards the begining of the text especially in newspaper texts. Therefore, extracting earlier sentences is almost always a good bet for a summary
3) Title words
The logic behind using title words is that title words are in general important words since they are supposed to convey the gist or topic of the story. Therefore, the sentences that contain these words should be given importance
4) Cue words
There are certain words that indicate importance . Examples of these words are "More importantly" etc. Selecting sentences which contain these cue words would be a good bet
In our project we tried approaches based on key words and title words alone and a combined approach based on key words, title words and position. We found that the combined approach in general worked much better.
We chose those words that exceeded mean +k* standard deviation as key words.
The other approach is based on the concept of lexical chains. Lexical chains are group of related words. These words could be related by a hyponymy-hypernymy relationship (is a relationship) or meronymy-holonymy relationship (part-whole relationship) or (synonymy-antonymy) relationship.
A cohesive text would have successive sentences talking about the same topic and therefore a sentence that contains many words that are related to the same topic would be a good bet. In order to implement this we used WordNet and the NLTK package.
I also used the Open-NLP toolkit to extract the nouns. Wordnet is mostly noun-based and therefore the relationships were found between nouns. We followed the same algorithm by Regina Barzilay in "Using Lexical Chains for Text Summarization, Barzilay et al "
However, there were a few variations - we considered only the first three semantic senses for a word and secondly we considered only the hypernymy-hyponymy relationship.
Our threshold for selecting lexical chains was mean+standard deviation.
The heuristic we followed was to extract the first sentence that contained a representative member.
Surprisingly, or maybe not so surprisingly summaries were much shorter. We compared the summaries with the Microsoft Word AutoSummarize feature and the summaries generated by word are much longer.
In general, the project was a good experience. I learnt a few things along the line
1) The OpenNLP model files do not work for windows where as they work perfectly for Linux.
2) NLTK does not have a ready to use chunker or parser, i need to write my own rules
3) NLTK has a wordnet package which is useful in getting the distances between hypernyms and hyponyms and nothing
You may take a look at the ppt here
http://www.utdallas.edu/~khassanali/summarization.ppt
Friday, February 15, 2008
Named Entity Recognition Tools
The past few weeks i have been experimenting with Named Entity Recognition tools. In particular, i tried out the opennlp tool suite and the name recognizer was pretty dismal. It really didnt recognise everything well and i wasnt sure if i should use the same in my research purposes. I guess i will either have to develop my own tools or use something else.
A shame since i did spend a little effort on trying to figure out how to use these tools and thereafter only to see that they dont perform as well as i expected it to. Perhaps i expected a lot for i do know that named entity recognition is not easy and of course there will always be an ambiguity in recognizing names.
Lets see how it works out. My adviser has asked me to read a few papers on the named entity recognition and i need to see if this will lead me to an overall idea on named entity recognition and also on how easy or difficult it is to get the kind of results that i am expecting.
A shame since i did spend a little effort on trying to figure out how to use these tools and thereafter only to see that they dont perform as well as i expected it to. Perhaps i expected a lot for i do know that named entity recognition is not easy and of course there will always be an ambiguity in recognizing names.
Lets see how it works out. My adviser has asked me to read a few papers on the named entity recognition and i need to see if this will lead me to an overall idea on named entity recognition and also on how easy or difficult it is to get the kind of results that i am expecting.
Friday, January 25, 2008
Managed to clean the blogs sufficiently
Well, for those blogs that could not be cleaned even after using TIDY i fell back on regular expressions. These regular expressions were present in the NLTK guide present and they seem to work. I can use another regular expression to get rid of the spaces and the like. However, my problem is now separating posts from each other...
I also happen to have another issue which is the some of these blogs contain links to their entries and hence i am not able to harvest data from these blogs... I need to look further into how i can harvest data from these blogs.
Some blogs get updated pretty often and others dont get updated often. I do not want to end up getting duplicated data and right now all i am thinking is how do i separate this duplicate data from the raw text that i have. I could have tried to use the parser but unfortunately each node is named differently with different blogs and i simply cant enumerate all the possible options.
I also want to get rid of all the advertisements and archive dates for these will end up being spurious features. How do i get rid of that? The data collection and cleaning part is really the toughest part of all. Without the data i really cannot analyze it nor run the machine learning algorithms on it.
I also happen to have another issue which is the some of these blogs contain links to their entries and hence i am not able to harvest data from these blogs... I need to look further into how i can harvest data from these blogs.
Some blogs get updated pretty often and others dont get updated often. I do not want to end up getting duplicated data and right now all i am thinking is how do i separate this duplicate data from the raw text that i have. I could have tried to use the parser but unfortunately each node is named differently with different blogs and i simply cant enumerate all the possible options.
I also want to get rid of all the advertisements and archive dates for these will end up being spurious features. How do i get rid of that? The data collection and cleaning part is really the toughest part of all. Without the data i really cannot analyze it nor run the machine learning algorithms on it.
Thursday, January 24, 2008
Extracting text from HTML
This has been a task that i have been at for so many months trying to find the perfect solution to extract text from an HTML webpage. I have tried so many options of which for Windows Emsa HTMLRem is definitely good. However, since most of my work is in Linux i was not too thrilled with the idea of extracting data on Windows and thereafter ftping it to Linux.
Yesterday was therefore spent trying to look at many options. The NLTK toolkits clean_html API works for a few websites and also used HTML Tidy before using the clean_html API. This approach worked for some websites and did not for other websites.
I now have to try some other technique probably regular expressions... As they say the data collection and cleaning part is the most difficult part for any task.
Yesterday was therefore spent trying to look at many options. The NLTK toolkits clean_html API works for a few websites and also used HTML Tidy before using the clean_html API. This approach worked for some websites and did not for other websites.
I now have to try some other technique probably regular expressions... As they say the data collection and cleaning part is the most difficult part for any task.
Thursday, January 3, 2008
NLTK
I used this toolkit for my NLP project and although there were many features that did not work as i expected it to i found it really useful. The toolkit is written in python and python is a very easy and user-friendly language to learn.
Although, i knew a bit of python and used it extensively in the first semester for all the NLP assignments, i realised the actual utility and convenience of python w.r.t NLP tasks when i read the guide provided with the NLTK toolkit.
Although i am yet to use all the features provided in the NLTK , i have used the stemmers and different types of probability distributions.The learning curve for me was around a week including learning python part. Initially, i wondered if it really was worth all the effort as i could easily have implemented the algorithms in python or any other language.
The plus point was once i learnt how to use the toolkit, making enhancements took no longer than 5 minutes and in the end i could get quite a lot done.
The clean_html API of NLTK did not work. I either found the output contained the HTML tags or the text had disappeared! Further, since it uses the underlying HTML parser, its not resilient to malformed pages on the internet.
I found it easier to write my own code for implementing the Naive Bayes method. The NLTK provides many methods too. I would say its definitely been worth trying out the natural language toolkit and recommend it!
You can download NLTK at the following site:
http://nltk.sourceforge.net/
Although, i knew a bit of python and used it extensively in the first semester for all the NLP assignments, i realised the actual utility and convenience of python w.r.t NLP tasks when i read the guide provided with the NLTK toolkit.
Although i am yet to use all the features provided in the NLTK , i have used the stemmers and different types of probability distributions.The learning curve for me was around a week including learning python part. Initially, i wondered if it really was worth all the effort as i could easily have implemented the algorithms in python or any other language.
The plus point was once i learnt how to use the toolkit, making enhancements took no longer than 5 minutes and in the end i could get quite a lot done.
The clean_html API of NLTK did not work. I either found the output contained the HTML tags or the text had disappeared! Further, since it uses the underlying HTML parser, its not resilient to malformed pages on the internet.
I found it easier to write my own code for implementing the Naive Bayes method. The NLTK provides many methods too. I would say its definitely been worth trying out the natural language toolkit and recommend it!
You can download NLTK at the following site:
http://nltk.sourceforge.net/
Subscribe to:
Posts (Atom)