Sentiment analysis using Mahout naive Bayes

Leonard Giura | 18 January 2015 | Machine learning, Mahout, software | 9 Comments

Sentiment analysis or opinion mining is the identification of subjective information from text. This tutorial will show how to do sentiment analysis on Twitter feeds using the naive Bayes classification algorithm available on Apache Mahout. Although far from a production ready implementation, this simple demo Java application will help you understand how to use Mahout’s naive Bayes algorithm to classify text. We will start by explaining the problem, we will then see how naive Bayes can help us solve this problem and at the end we will build a working sample to see the algorithm in action. Basic Java programming knowledge is required for this tutorial.

Naive Bayes for sentiment analysis

Sentiment analysis aims to detect the attitude of a text. A simple subtask of sentiment analysis is to determine the polarity of the text: positive, negative or neutral. In this tutorial we concentrate on detecting if a short text like a Twitter message is positive or negative. For example:

for the tweet “Have a nice day!” the algorithm should tell us that this is a positive message.
for the tweet “I had a bad day” the algorithm should tell us that this is a negative message.

From the machine learning domain point of view this can be seen as a classification task and naive Bayes is an algorithm which suits well this kind of task.

The naive Bayes algorithm uses probabilities to decide which class best matches for a given input text. The classification decision is based on a model obtained after the training process. Model training is done by analysing the relationship between the words in the training text and their classification categories. The algorithm is considered naive because it assumes that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features (Naive Bayes classifier on Wikipedia).

Each text we will classify contains words noted with Wi (i=1..n) . For each word Wi from the training data set we can extract the following probabilities (noted with P):

P(Wi given Positive) = (The number of positive Texts with the Wi) / The number of positive Texts

P(Wi given Negative) = (The number of negative Texts with the Wi) / The number of negative Texts

For the entire test set we will have:

P(Positive) = (The number of positive Texts) / The total number of Texts

P(Negative) = (The number of negative Texts) / The number of Texts

For calculating the probability of a Text being positive or negative, given the containing words we will use the following theorem:

P(Positive given Text) = P(Text given Positive) x P(Positive) / P(Text)

P(Negative given Text) = P(Text given Negative) x P(Negative) / P(Text)

As P(Text) is 1, as each Text will be present once in the training set, we will have:

P(Positive given Text) = P(Text given Positive) x P(Positive) = P(W1 give Positive) x P(W2 given Positive) x … x P(Wn given Positive ) x P(Positive)

P(Negative given Text) = P(Text given Negative) x P(Negative) = P(W1 give Negative) x P(W2 given Negative) x … x P(Wn given Negative ) x P(Negative)

At the end one will compare P(Positive given Text) and P(Negative given Text) and the term with the higher probability will decide if the text is positive or negative. To increase the quality of the classifier, instead of using raw term frequency we will use TF-IDF. This way the least significant words are ignored when calculating the probabilities.

Java project for sentiment analysis

The project will use 100 tweets as input for the training phase. The input file will contain 100 lines, each line having the category (1 for positive and 0 for negative) and the tweet text.

In a real world project this dataset must contain millions of tweets for a accurate results. The initial data is usually split in training and test data, but for this simple demo we will use all the data for training.

Prerequisites:

Linux or Mac
Java 1.7
Apache Maven 3

Create the Maven project:

mvn archetype:generate \
-DarchetypeGroupId=org.apache.maven.archetypes \
-DgroupId=com.technobium \
-DartifactId=mahout-naive-bayes \
-DinteractiveMode=false

Rename the default created App class to NaiveBayes using the following command:

mv mahout-naive-bayes/src/main/java/com/technobium/App.java \
mahout-naive-bayes/src/main/java/com/technobium/NaiveBayes.java

Add the Mahout and SLF4J libraries to this project:

cd mahout-naive-bayes
nano pom.xml

Add the following lines to the dependencies section:

<dependencies>
    ...
    <dependency>
        <groupId>org.apache.mahout</groupId>
        <artifactId>mahout-core</artifactId>
        <version>0.9</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>1.7.7</version>
    </dependency>
</dependencies>

In the same file, after the dependencies section, add the following configuration which makes sure the code is compiled using Java 1.7

<build>
    <plugins>
	<plugin>
	    <artifactId>maven-compiler-plugin</artifactId>
		<configuration>
		    <source>1.7</source>
		    <target>1.7</target>
		</configuration>
	</plugin>
    </plugins>
</build>

Create an input folder and copy the file containing the training data, tweets.txt.

mkdir input

Edit the NaiveBaiyes class file and add the following code:

package com.technobium;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.Vector.Element;
import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;
import org.apache.mahout.vectorizer.TFIDF;

import com.google.common.collect.ConcurrentHashMultiset;
import com.google.common.collect.Multiset;

public class NaiveBayes {

	Configuration configuration = new Configuration();

	String inputFilePath = "input/tweets.txt";
	String sequenceFilePath = "input/tweets-seq";
	String labelIndexPath = "input/labelindex";
	String modelPath = "input/model";
	String vectorsPath = "input/tweets-vectors";
	String dictionaryPath = "input/tweets-vectors/dictionary.file-0";
	String documentFrequencyPath = "input/tweets-vectors/df-count/part-r-00000";

	public static void main(String[] args) throws Throwable {
		NaiveBayes nb = new NaiveBayes();
		nb.inputDataToSequenceFile();
		nb.sequenceFileToSparseVector();
		nb.trainNaiveBayesModel();
		nb.classifyNewTweet("Have a nice day!");
	}

	public void inputDataToSequenceFile() throws Exception {
		BufferedReader reader = new BufferedReader(
				new FileReader(inputFilePath));
		FileSystem fs = FileSystem.getLocal(configuration);
		Path seqFilePath = new Path(sequenceFilePath);
		fs.delete(seqFilePath, false);
		SequenceFile.Writer writer = SequenceFile.createWriter(fs,
				configuration, seqFilePath, Text.class, Text.class);
		int count = 0;
		try {
			String line;
			while ((line = reader.readLine()) != null) {
				String[] tokens = line.split("\t");
				writer.append(new Text("/" + tokens[0] + "/tweet" + count++),
						new Text(tokens[1]));
			}
		} finally {
			reader.close();
			writer.close();
		}
	}

	void sequenceFileToSparseVector() throws Exception {
		SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();
		svfsf.run(new String[] { "-i", sequenceFilePath, "-o", vectorsPath,
				"-ow" });
	}

	void trainNaiveBayesModel() throws Exception {
		TrainNaiveBayesJob trainNaiveBayes = new TrainNaiveBayesJob();
		trainNaiveBayes.setConf(configuration);
		trainNaiveBayes.run(new String[] { "-i",
				vectorsPath + "/tfidf-vectors", "-o", modelPath, "-li",
				labelIndexPath, "-el", "-c", "-ow" });
	}

	private void classifyNewTweet(String tweet) throws IOException {
		System.out.println("Tweet: " + tweet);

		Map<String, Integer> dictionary = readDictionary(configuration,
				new Path(dictionaryPath));
		Map<Integer, Long> documentFrequency = readDocumentFrequency(
				configuration, new Path(documentFrequencyPath));

		Multiset<String> words = ConcurrentHashMultiset.create();

		// Extract the words from the new tweet using Lucene
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
		TokenStream tokenStream = analyzer.tokenStream("text",
				new StringReader(tweet));
		CharTermAttribute termAttribute = tokenStream
				.addAttribute(CharTermAttribute.class);
		tokenStream.reset();
		int wordCount = 0;
		while (tokenStream.incrementToken()) {
			if (termAttribute.length() > 0) {
				String word = tokenStream.getAttribute(CharTermAttribute.class)
						.toString();
				Integer wordId = dictionary.get(word);
				// If the word is not in the dictionary, skip it
				if (wordId != null) {
					words.add(word);
					wordCount++;
				}
			}
		}
		tokenStream.end();
		tokenStream.close();

		int documentCount = documentFrequency.get(-1).intValue();

		// Create a vector for the new tweet (wordId => TFIDF weight)
		Vector vector = new RandomAccessSparseVector(10000);
		TFIDF tfidf = new TFIDF();
		for (Multiset.Entry<String> entry : words.entrySet()) {
			String word = entry.getElement();
			int count = entry.getCount();
			Integer wordId = dictionary.get(word);
			Long freq = documentFrequency.get(wordId);
			double tfIdfValue = tfidf.calculate(count, freq.intValue(),
					wordCount, documentCount);
			vector.setQuick(wordId, tfIdfValue);
		}

		// Model is a matrix (wordId, labelId) => probability score
		NaiveBayesModel model = NaiveBayesModel.materialize(
				new Path(modelPath), configuration);
		StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(
				model);

		// With the classifier, we get one score for each label.The label with
		// the highest score is the one the tweet is more likely to be
		// associated to
		Vector resultVector = classifier.classifyFull(vector);
		double bestScore = -Double.MAX_VALUE;
		int bestCategoryId = -1;
		for (Element element : resultVector.all()) {
			int categoryId = element.index();
			double score = element.get();
			if (score > bestScore) {
				bestScore = score;
				bestCategoryId = categoryId;
			}
			if (categoryId == 1) {
				System.out.println("Probability of being positive: " + score);
			} else {
				System.out.println("Probability of being negative: " + score);
			}
		}
		if (bestCategoryId == 1) {
			System.out.println("The tweet is positive :) ");
		} else {
			System.out.println("The tweet is negative :( ");
		}
		analyzer.close();
	}

	public static Map<String, Integer> readDictionary(Configuration conf,
			Path dictionnaryPath) {
		Map<String, Integer> dictionnary = new HashMap<String, Integer>();
		for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(
				dictionnaryPath, true, conf)) {
			dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
		}
		return dictionnary;
	}

	public static Map<Integer, Long> readDocumentFrequency(Configuration conf,
			Path documentFrequencyPath) {
		Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
		for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(
				documentFrequencyPath, true, conf)) {
			documentFrequency
					.put(pair.getFirst().get(), pair.getSecond().get());
		}
		return documentFrequency;
	}
}

Run the class by using the following command:

mvn compile
mvn exec:java -Dexec.mainClass="com.technobium.NaiveBayes"

The output should be something like this:

Tweet: Have a nice day!
Probability of being negative: -40.40761445098948
Probability of being positive: -27.75439166678417
The tweet is positive :)

As you can see in the mail method, we start by transforming the input file to sequence file format. This file format is used by Hadoop which is further used by Mahout for parallel processing.

The next method, sequenceFileToSparseVector(), uses the previously created sequence file to create SparseVectors. These vectors contain the TFIDF measurement for the words in the tweets and will be used to train the classifier.

The trainNaiveBayesModel() method creates the model file, staring from the TFIDF vectors.

The last method, classifyNewTweet, takes a new tweet, creates the TFIDF vector from the words and calculates the probability for this tweet of being positive or negative. The highest probability decides the polarity of the tweet.

GitHub repository for this project: https://github.com/technobium/mahout-naive-bayes

Conclusion

For every business it is important to gather feedback about the own product and services. Reviews, ratings, comments, recommendations, tweets, blogs etc. are a rich source of information which can help a company improve and evolve. In this context, sentiment analysis is a valuable tool which helps by automating the process of extracting the sentiment from different content sources. Beside companies, the political parties are also increasingly interested in this kind of analysis to extract opinion polarity from tweets, Facebook messages and blogs.

References

“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/

https://mahout.apache.org/users/classification/bayesian.html

https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

About The Author

leo

9 Comments

Mary 3 September 2018 Reply

Ca you please help me in this.

When I run the code I get the following error:
[INFO] — exec-maven-plugin:1.6.0:java (default-cli) @ mahout-naive-bayes —
Sep 03, 2018 3:27:29 AM org.apache.hadoop.util.NativeCodeLoader
WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Sep 03, 2018 3:27:29 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
[WARNING]
java.lang.ArrayIndexOutOfBoundsException: 1
at com.technobium.NaiveBayes.inputDataToSequenceFile(NaiveBayes.java:69)
at com.technobium.NaiveBayes.main(NaiveBayes.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
[INFO] ————————————————————————
[INFO] BUILD FAILURE
[INFO] ————————————————————————
[INFO] Total time: 1.559 s
[INFO] Finished at: 2018-09-03T03:27:29+03:00
[INFO] Final Memory: 15M/174M
[INFO] ————————————————————————
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project mahout-naive-bayes: An exception occured while executing the Java class. 1 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Mary 3 September 2018 Reply

Ca you please help me in this.

When I run the code I get the following error:
[INFO] — exec-maven-plugin:1.6.0:java (default-cli) @ mahout-naive-bayes —
Sep 03, 2018 3:27:29 AM org.apache.hadoop.util.NativeCodeLoader
WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Sep 03, 2018 3:27:29 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
[WARNING]
java.lang.ArrayIndexOutOfBoundsException: 1
at com.technobium.NaiveBayes.inputDataToSequenceFile(NaiveBayes.java:69)
at com.technobium.NaiveBayes.main(NaiveBayes.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
[INFO] ————————————————————————
[INFO] BUILD FAILURE
[INFO] ————————————————————————
[INFO] Total time: 1.559 s
[INFO] Finished at: 2018-09-03T03:27:29+03:00
[INFO] Final Memory: 15M/174M
[INFO] ————————————————————————
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project mahout-naive-bayes: An exception occured while executing the Java class. 1 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

I have solved the issue, the input file had extra line.
Thanks a lot for the code

Reira 18 October 2016 Reply

Is it possible to have 3 tweet categories (positive,negative,neutral) instead of just using 2 categories (positive,negative) ? if possible, which part that should change?
And what is the code for splitting the data into training and testing?
Ankit 29 November 2015 Reply

Hello There,
i am devling a system on sentimental analysis and i am using the above code to classify the facebook comments.the code above not working on the windows platform.In all the methods its showing chmod exceptions.could you please tell me how could i execute these methods into the windows environment and then start using the classifier.

Thanks

Leonard Giura 17 January 2016 Reply

Hi Ankit,

In the article’s prerequisites section I mentioned that the code will run only under Linux or MacOS. The problem is that Hadoop (which in turn is used by the Mahout algorithm), needs special setup under Windows. You can follow this tutorial for Windows: http://alans.se/blog/2010/mahout-on-hadoop-in-cygwin/. Otherwise you can run the sample code on Linux or a Linux virtual machine.

Regards,
Leo

Osco 18 September 2015 Reply

It’s possible to use HDFS instead local FS?

Good work, thanks!

Leonard Giura 17 January 2016 Reply

Hi Osco,

According to this list Naive Bayes is one of the algorithms that uses Hadoop MapReduce: https://mahout.apache.org/users/basics/algorithms.html

Here is also a step by step guide for running Naive Bayes classification in a Hadoop cluster: https://mahout.apache.org/users/classification/twenty-newsgroups.html

Hope this helps.

Regards,
Leo

basil 15 September 2015 Reply

Why are we getting negative values for probability?

Leonard Giura 17 January 2016 Reply

Hi Basil,

Good observation! The probability of a document belonging to a category should be between 0 and 1.

To compute the probability of a document belonging to a category, we compute the products of the probability of each word to belong the category. As these probabilities are small, by multiplying them the precision is lost. For this reason the naive Bayes implementation of Mahout uses the logarithmic function: Sum( log (probabilities) ). Taking into consideration that the logarithmic function between 0 and 1 is negative, the sum of these probabilities will also be negative.

The explanation can be found also here: https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/comment-page-1/#comment-874

Regards,
Leo

Naive Bayes for sentiment analysis

Java project for sentiment analysis

Conclusion

References

Related Posts

About The Author

leo

Add a Comment