Sentiment analysis using Mahout naive Bayes

Sentiment analysis or opinion mining is the identification of subjective information from text. This tutorial will show how to do sentiment analysis on Twitter feeds using the naive Bayes classification algorithm available on Apache Mahout. Although far from a production ready implementation, this simple demo Java application will help you understand how to use Mahout’s naive Bayes algorithm to classify text. We will start by explaining the problem, we will then see how naive Bayes can help us solve this problem and at the end we will build a working sample to see the algorithm in action. Basic Java programming knowledge is required for this tutorial.

Naive Bayes for sentiment analysis

Sentiment analysis aims to detect the attitude of a text. A simple subtask of sentiment analysis is to determine the polarity of the text: positive, negative or neutral. In this tutorial we concentrate on detecting if a short text like a Twitter message is positive or negative.  For example:

  • for the tweet “Have a nice day!” the algorithm should tell us that this is a positive message.
  • for the tweet “I had a bad day” the algorithm should tell us that this is a negative message.

From the machine learning domain point of view this can be seen as a classification task and naive Bayes is an algorithm which suits well this kind of task.

The naive Bayes algorithm uses probabilities to decide which class best matches for a given input text. The classification decision is based on a model obtained after the training process. Model training is done by analysing the relationship between the words in the training text and their classification categories. The algorithm is considered naive because it assumes that the value of a particular feature is independent of the value of any other feature, given the class variable.  For example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features (Naive Bayes classifier on Wikipedia).

Each text we will classify contains words noted with Wi (i=1..n) . For each word Wi from the training data set we can extract the following probabilities (noted with P):

P(Wi given Positive) = (The number of positive Texts with the Wi) / The number of positive Texts

P(Wi given Negative) = (The number of negative Texts with the Wi) / The number of negative Texts

For the entire test set we will have:

P(Positive) = (The number of positive Texts) / The total number of Texts

P(Negative) = (The number of negative Texts) / The number of Texts

For calculating the probability of a Text being positive or negative, given the containing words we will use the following theorem:

P(Positive given Text) = P(Text given Positive) x P(Positive) / P(Text)

P(Negative given Text) = P(Text given Negative) x P(Negative) / P(Text)

As P(Text) is 1, as each Text will be present once in the training set, we will have:

P(Positive given Text) = P(Text given Positive) x P(Positive) = P(W1 give Positive) x P(W2 given Positive) x … x P(Wn given Positive ) x P(Positive)

P(Negative given Text) = P(Text given Negative) x P(Negative) = P(W1 give Negative) x P(W2 given Negative) x … x P(Wn given Negative ) x P(Negative)

At the end one will compare P(Positive given Text) and P(Negative given Text) and the term with the higher probability will decide if the text is positive or negative. To increase the quality of the classifier, instead of using raw term frequency we will use TF-IDF. This way the least significant words are ignored when calculating the probabilities.

Java project for sentiment analysis

The project will use 100 tweets as input for the training phase. The input file will contain 100 lines, each line having the category (1 for positive and 0 for negative) and the tweet text.

In a real world project this dataset must contain millions of tweets for a accurate results. The initial data is usually split in training and test data, but for this simple demo we will use all the data for training.

Prerequisites:

Create the Maven project:

mvn archetype:generate \
-DarchetypeGroupId=org.apache.maven.archetypes \
-DgroupId=com.technobium \
-DartifactId=mahout-naive-bayes \
-DinteractiveMode=false

Rename the default created App class to NaiveBayes using the following command:

mv mahout-naive-bayes/src/main/java/com/technobium/App.java \
mahout-naive-bayes/src/main/java/com/technobium/NaiveBayes.java

Add the Mahout and SLF4J libraries to this project:

cd mahout-naive-bayes
nano pom.xml

Add the following lines to the dependencies section:

<dependencies>
    ...
    <dependency>
        <groupId>org.apache.mahout</groupId>
        <artifactId>mahout-core</artifactId>
        <version>0.9</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>1.7.7</version>
    </dependency>
</dependencies>

In the same file, after the dependencies section, add the following configuration which makes sure the code is compiled using Java 1.7

<build>
    <plugins>
	<plugin>
	    <artifactId>maven-compiler-plugin</artifactId>
		<configuration>
		    <source>1.7</source>
		    <target>1.7</target>
		</configuration>
	</plugin>
    </plugins>
</build>

Create an input folder and copy the file containing the training data, tweets.txt.

mkdir input

Edit the NaiveBaiyes class file and add the following code:

package com.technobium;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.Vector.Element;
import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;
import org.apache.mahout.vectorizer.TFIDF;

import com.google.common.collect.ConcurrentHashMultiset;
import com.google.common.collect.Multiset;

public class NaiveBayes {

	Configuration configuration = new Configuration();

	String inputFilePath = "input/tweets.txt";
	String sequenceFilePath = "input/tweets-seq";
	String labelIndexPath = "input/labelindex";
	String modelPath = "input/model";
	String vectorsPath = "input/tweets-vectors";
	String dictionaryPath = "input/tweets-vectors/dictionary.file-0";
	String documentFrequencyPath = "input/tweets-vectors/df-count/part-r-00000";

	public static void main(String[] args) throws Throwable {
		NaiveBayes nb = new NaiveBayes();
		nb.inputDataToSequenceFile();
		nb.sequenceFileToSparseVector();
		nb.trainNaiveBayesModel();
		nb.classifyNewTweet("Have a nice day!");
	}

	public void inputDataToSequenceFile() throws Exception {
		BufferedReader reader = new BufferedReader(
				new FileReader(inputFilePath));
		FileSystem fs = FileSystem.getLocal(configuration);
		Path seqFilePath = new Path(sequenceFilePath);
		fs.delete(seqFilePath, false);
		SequenceFile.Writer writer = SequenceFile.createWriter(fs,
				configuration, seqFilePath, Text.class, Text.class);
		int count = 0;
		try {
			String line;
			while ((line = reader.readLine()) != null) {
				String[] tokens = line.split("\t");
				writer.append(new Text("/" + tokens[0] + "/tweet" + count++),
						new Text(tokens[1]));
			}
		} finally {
			reader.close();
			writer.close();
		}
	}

	void sequenceFileToSparseVector() throws Exception {
		SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();
		svfsf.run(new String[] { "-i", sequenceFilePath, "-o", vectorsPath,
				"-ow" });
	}

	void trainNaiveBayesModel() throws Exception {
		TrainNaiveBayesJob trainNaiveBayes = new TrainNaiveBayesJob();
		trainNaiveBayes.setConf(configuration);
		trainNaiveBayes.run(new String[] { "-i",
				vectorsPath + "/tfidf-vectors", "-o", modelPath, "-li",
				labelIndexPath, "-el", "-c", "-ow" });
	}

	private void classifyNewTweet(String tweet) throws IOException {
		System.out.println("Tweet: " + tweet);

		Map<String, Integer> dictionary = readDictionary(configuration,
				new Path(dictionaryPath));
		Map<Integer, Long> documentFrequency = readDocumentFrequency(
				configuration, new Path(documentFrequencyPath));

		Multiset<String> words = ConcurrentHashMultiset.create();

		// Extract the words from the new tweet using Lucene
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
		TokenStream tokenStream = analyzer.tokenStream("text",
				new StringReader(tweet));
		CharTermAttribute termAttribute = tokenStream
				.addAttribute(CharTermAttribute.class);
		tokenStream.reset();
		int wordCount = 0;
		while (tokenStream.incrementToken()) {
			if (termAttribute.length() > 0) {
				String word = tokenStream.getAttribute(CharTermAttribute.class)
						.toString();
				Integer wordId = dictionary.get(word);
				// If the word is not in the dictionary, skip it
				if (wordId != null) {
					words.add(word);
					wordCount++;
				}
			}
		}
		tokenStream.end();
		tokenStream.close();

		int documentCount = documentFrequency.get(-1).intValue();

		// Create a vector for the new tweet (wordId => TFIDF weight)
		Vector vector = new RandomAccessSparseVector(10000);
		TFIDF tfidf = new TFIDF();
		for (Multiset.Entry<String> entry : words.entrySet()) {
			String word = entry.getElement();
			int count = entry.getCount();
			Integer wordId = dictionary.get(word);
			Long freq = documentFrequency.get(wordId);
			double tfIdfValue = tfidf.calculate(count, freq.intValue(),
					wordCount, documentCount);
			vector.setQuick(wordId, tfIdfValue);
		}

		// Model is a matrix (wordId, labelId) => probability score
		NaiveBayesModel model = NaiveBayesModel.materialize(
				new Path(modelPath), configuration);
		StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(
				model);

		// With the classifier, we get one score for each label.The label with
		// the highest score is the one the tweet is more likely to be
		// associated to
		Vector resultVector = classifier.classifyFull(vector);
		double bestScore = -Double.MAX_VALUE;
		int bestCategoryId = -1;
		for (Element element : resultVector.all()) {
			int categoryId = element.index();
			double score = element.get();
			if (score > bestScore) {
				bestScore = score;
				bestCategoryId = categoryId;
			}
			if (categoryId == 1) {
				System.out.println("Probability of being positive: " + score);
			} else {
				System.out.println("Probability of being negative: " + score);
			}
		}
		if (bestCategoryId == 1) {
			System.out.println("The tweet is positive :) ");
		} else {
			System.out.println("The tweet is negative :( ");
		}
		analyzer.close();
	}

	public static Map<String, Integer> readDictionary(Configuration conf,
			Path dictionnaryPath) {
		Map<String, Integer> dictionnary = new HashMap<String, Integer>();
		for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(
				dictionnaryPath, true, conf)) {
			dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
		}
		return dictionnary;
	}

	public static Map<Integer, Long> readDocumentFrequency(Configuration conf,
			Path documentFrequencyPath) {
		Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
		for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(
				documentFrequencyPath, true, conf)) {
			documentFrequency
					.put(pair.getFirst().get(), pair.getSecond().get());
		}
		return documentFrequency;
	}
}

Run the class by using the following command:

mvn compile
mvn exec:java -Dexec.mainClass="com.technobium.NaiveBayes"

The output should be something like this:

Tweet: Have a nice day!
Probability of being negative: -40.40761445098948
Probability of being positive: -27.75439166678417
The tweet is positive :)

As you can see in the mail method, we start by transforming the input file to sequence file format. This file format is used by Hadoop which is further used by Mahout for parallel processing.

The next method, sequenceFileToSparseVector(), uses the previously created sequence file to create SparseVectors. These vectors contain the TFIDF measurement for the words in the tweets and will be used to train the classifier.

The trainNaiveBayesModel() method creates the model file, staring from the TFIDF vectors.

The last method, classifyNewTweet, takes a new tweet, creates the TFIDF vector from the words and calculates the probability for this tweet of being positive or negative. The highest probability decides the polarity of the tweet.

GitHub repository for this project: https://github.com/technobium/mahout-naive-bayes

Conclusion

For every business it is important to gather feedback about the own product and services. Reviews, ratings, comments, recommendations, tweets, blogs etc. are a rich source of information which can help a company improve and evolve. In this context, sentiment analysis is a valuable tool which helps by automating the process of extracting the sentiment from different content sources. Beside companies, the political parties are also increasingly interested in this kind of analysis to extract opinion polarity from tweets, Facebook messages and blogs.

References

“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/

https://mahout.apache.org/users/classification/bayesian.html

https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

9 Comments

Add a Comment

Your email address will not be published. Required fields are marked *