Sentiment analysis using OpenNLP document categorizer

We will talk again about sentiment analysis, this time we will solve the problem using a different approach. Instead of naive Bayes, we will use Apache OpenNLP and more precisely, the Document Categorizer.

If you need to know more about sentiment analysis, you can read the following article: Sentiment analysis using Mahout naive Bayes. Also if you are new to Apache OpenNLP you can read the the following article: Getting started with Apache OpenNLP.

About Apache OpenNLP Document Categorizer

The Apache OpenNLP Document Categorizer can be used to classify text into pre-defined categories. This is achieved by using the maximum entropy algorithm, also named MaxEnt. The algorithm constructs a model based on the same information as the naive Bayes algorithm, but uses a different approach toward building the model. While naive Bayes assumes the feature independence, MaxEnt uses multinomial logistic regression to determine the right category for a given text. To understand how the regression algorithm works you can see the following article: Simple linear regression using JFreeChart. For logistic regression see: Logistic regression using Apache Mahout.

The entropy is a term used in the context of information theory and measures the uncertainty of an information content.  Let’s consider the example of a coin toss (source Wikipedia). When the coin is fair, that is, when the probability of heads is the same as the probability of tails, then the entropy of the coin toss is as high as it could be. This is because there is no way to predict the outcome of the coin toss ahead of time—the best we can do is predict that the coin will come up heads, and our prediction will be correct with probability 1/2. Such a coin toss has one bit of entropy since there are two possible outcomes that occur with equal probability, and learning the actual outcome contains one bit of information. Contrarily, a coin toss with a coin that has two heads and no tails has zero entropy since the coin will always come up heads, and the outcome can be predicted perfectly.

The Maximum Entropy principle can be formulated as follows: given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. The same principle is used also by this OpenNLP algorithm:  from all the models that fit our training data, selects the one which has the largest entropy.

Java project for sentiment analysis using OpenNLP Document Categorizer

This project will use the same input file as in Sentiment analysis using Mahout naive Bayes. The tweets  file contains 100 lines, each line having the category (1 for positive and 0 for negative) and the tweet text.

Prerequisites:

Create the Maven project:

mvn archetype:generate \
-DarchetypeGroupId=org.apache.maven.archetypes \ 
-DgroupId=com.technobium \ 
-DartifactId=opennlp-categorizer \
-DinteractiveMode=false

Rename the default created App class to OpenNLPCategorizer using the following command:

mv opennlp-categorizer/src/main/java/com/technobium/App.java \
opennlp-categorizer/src/main/java/com/technobium/OpenNLPCategorizer.java

Add the SLF4J and OpenNLP libraries to this project:

cd opennlp-categorizer
nano pom.xml

Add the following lines to the dependencies section:

<dependencies>
    ...
    <dependency>
        <groupId>org.slf4j</groupId>
	<artifactId>slf4j-simple</artifactId>
	<version>1.7.7</version>
    </dependency>
    <dependency>
	<groupId>org.apache.opennlp</groupId>
	<artifactId>opennlp-tools</artifactId>
	<version>1.5.3</version>
    </dependency>
</dependencies>

Create an input folder and copy the file containing the training data, tweets.txt.

mkdir input

Edit the OpenNLPCategorizer class file and add the code from the following code:

package com.technobium;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

public class OpenNLPCategorizer {
	DoccatModel model;

	public static void main(String[] args) {
		OpenNLPCategorizer twitterCategorizer = new OpenNLPCategorizer();
		twitterCategorizer.trainModel();
		twitterCategorizer.classifyNewTweet("Have a nice day!");
	}

	public void trainModel() {
		InputStream dataIn = null;
		try {
			dataIn = new FileInputStream("input/tweets.txt");
			ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
			ObjectStream sampleStream = new DocumentSampleStream(lineStream);
			// Specifies the minimum number of times a feature must be seen
			int cutoff = 2;
			int trainingIterations = 30;
			model = DocumentCategorizerME.train("en", sampleStream, cutoff,
					trainingIterations);
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			if (dataIn != null) {
				try {
					dataIn.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
	}

	public void classifyNewTweet(String tweet) {
		DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
		double[] outcomes = myCategorizer.categorize(tweet);
		String category = myCategorizer.getBestCategory(outcomes);

		if (category.equalsIgnoreCase("1")) {
			System.out.println("The tweet is positive :) ");
		} else {
			System.out.println("The tweet is negative :( ");
		}
	}
}

Run the class by using the following command:

mvn compile
mvn exec:java -Dexec.mainClass="com.technobium.OpenNLPCategorizer"

The output should be something like this:

 29:  ... loglikelihood=-25.45302428490664	0.9310344827586207
 30:  ... loglikelihood=-25.13375829600653	0.9310344827586207
The tweet is positive :)

GitHub repository for this project: https://github.com/technobium/opennlp-categorizer

Conclusion

OpenNLP document categorizer can be successfully used for classifying texts according to their sentiment polarity. While naive Bayes assumes features/words independence, the maximum entropy algorithm behind OpenNLP document categorizer uses the maximum entropy algorithm.

References

https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html

“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/

http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node2.html

http://en.wikipedia.org/wiki/Entropy_(information_theory)

4 Comments

Add a Comment

Your email address will not be published. Required fields are marked *