Sentiment analysis using Mahout naive Bayes
|Sentiment analysis or opinion mining is the identification of subjective information from text. This tutorial will show how to do sentiment analysis on Twitter feeds using the naive Bayes classification algorithm available on Apache Mahout. Although far from a production ready implementation, this simple demo Java application will help you understand how to use Mahout’s naive Bayes algorithm to classify text. We will start by explaining the problem, we will then see how naive Bayes can help us solve this problem and at the end we will build a working sample to see the algorithm in action. Basic Java programming knowledge is required for this tutorial.
Naive Bayes for sentiment analysis
Sentiment analysis aims to detect the attitude of a text. A simple subtask of sentiment analysis is to determine the polarity of the text: positive, negative or neutral. In this tutorial we concentrate on detecting if a short text like a Twitter message is positive or negative. For example:
- for the tweet “Have a nice day!” the algorithm should tell us that this is a positive message.
- for the tweet “I had a bad day” the algorithm should tell us that this is a negative message.
From the machine learning domain point of view this can be seen as a classification task and naive Bayes is an algorithm which suits well this kind of task.
The naive Bayes algorithm uses probabilities to decide which class best matches for a given input text. The classification decision is based on a model obtained after the training process. Model training is done by analysing the relationship between the words in the training text and their classification categories. The algorithm is considered naive because it assumes that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features (Naive Bayes classifier on Wikipedia).
Each text we will classify contains words noted with Wi (i=1..n) . For each word Wi from the training data set we can extract the following probabilities (noted with P):
P(Wi given Positive) = (The number of positive Texts with the Wi) / The number of positive Texts
P(Wi given Negative) = (The number of negative Texts with the Wi) / The number of negative Texts
For the entire test set we will have:
P(Positive) = (The number of positive Texts) / The total number of Texts
P(Negative) = (The number of negative Texts) / The number of Texts
For calculating the probability of a Text being positive or negative, given the containing words we will use the following theorem:
P(Positive given Text) = P(Text given Positive) x P(Positive) / P(Text)
P(Negative given Text) = P(Text given Negative) x P(Negative) / P(Text)
As P(Text) is 1, as each Text will be present once in the training set, we will have:
P(Positive given Text) = P(Text given Positive) x P(Positive) = P(W1 give Positive) x P(W2 given Positive) x … x P(Wn given Positive ) x P(Positive)
P(Negative given Text) = P(Text given Negative) x P(Negative) = P(W1 give Negative) x P(W2 given Negative) x … x P(Wn given Negative ) x P(Negative)
At the end one will compare P(Positive given Text) and P(Negative given Text) and the term with the higher probability will decide if the text is positive or negative. To increase the quality of the classifier, instead of using raw term frequency we will use TF-IDF. This way the least significant words are ignored when calculating the probabilities.
Java project for sentiment analysis
The project will use 100 tweets as input for the training phase. The input file will contain 100 lines, each line having the category (1 for positive and 0 for negative) and the tweet text.
In a real world project this dataset must contain millions of tweets for a accurate results. The initial data is usually split in training and test data, but for this simple demo we will use all the data for training.
Prerequisites:
- Linux or Mac
- Java 1.7
- Apache Maven 3
Create the Maven project:
1 2 3 4 5 |
mvn archetype:generate \ -DarchetypeGroupId=org.apache.maven.archetypes \ -DgroupId=com.technobium \ -DartifactId=mahout-naive-bayes \ -DinteractiveMode=false |
Rename the default created App class to NaiveBayes using the following command:
1 2 |
mv mahout-naive-bayes/src/main/java/com/technobium/App.java \ mahout-naive-bayes/src/main/java/com/technobium/NaiveBayes.java |
Add the Mahout and SLF4J libraries to this project:
1 2 |
cd mahout-naive-bayes nano pom.xml |
Add the following lines to the dependencies section:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
<dependencies> ... <dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-core</artifactId> <version>0.9</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.7</version> </dependency> </dependencies> |
In the same file, after the dependencies section, add the following configuration which makes sure the code is compiled using Java 1.7
1 2 3 4 5 6 7 8 9 10 11 |
<build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> |
Create an input folder and copy the file containing the training data, tweets.txt.
1 |
mkdir input |
Edit the NaiveBaiyes class file and add the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
package com.technobium; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.io.StringReader; import java.util.HashMap; import java.util.Map; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.util.Version; import org.apache.mahout.classifier.naivebayes.NaiveBayesModel; import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier; import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob; import org.apache.mahout.common.Pair; import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable; import org.apache.mahout.math.RandomAccessSparseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.math.Vector.Element; import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles; import org.apache.mahout.vectorizer.TFIDF; import com.google.common.collect.ConcurrentHashMultiset; import com.google.common.collect.Multiset; public class NaiveBayes { Configuration configuration = new Configuration(); String inputFilePath = "input/tweets.txt"; String sequenceFilePath = "input/tweets-seq"; String labelIndexPath = "input/labelindex"; String modelPath = "input/model"; String vectorsPath = "input/tweets-vectors"; String dictionaryPath = "input/tweets-vectors/dictionary.file-0"; String documentFrequencyPath = "input/tweets-vectors/df-count/part-r-00000"; public static void main(String[] args) throws Throwable { NaiveBayes nb = new NaiveBayes(); nb.inputDataToSequenceFile(); nb.sequenceFileToSparseVector(); nb.trainNaiveBayesModel(); nb.classifyNewTweet("Have a nice day!"); } public void inputDataToSequenceFile() throws Exception { BufferedReader reader = new BufferedReader( new FileReader(inputFilePath)); FileSystem fs = FileSystem.getLocal(configuration); Path seqFilePath = new Path(sequenceFilePath); fs.delete(seqFilePath, false); SequenceFile.Writer writer = SequenceFile.createWriter(fs, configuration, seqFilePath, Text.class, Text.class); int count = 0; try { String line; while ((line = reader.readLine()) != null) { String[] tokens = line.split("\t"); writer.append(new Text("/" + tokens[0] + "/tweet" + count++), new Text(tokens[1])); } } finally { reader.close(); writer.close(); } } void sequenceFileToSparseVector() throws Exception { SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles(); svfsf.run(new String[] { "-i", sequenceFilePath, "-o", vectorsPath, "-ow" }); } void trainNaiveBayesModel() throws Exception { TrainNaiveBayesJob trainNaiveBayes = new TrainNaiveBayesJob(); trainNaiveBayes.setConf(configuration); trainNaiveBayes.run(new String[] { "-i", vectorsPath + "/tfidf-vectors", "-o", modelPath, "-li", labelIndexPath, "-el", "-c", "-ow" }); } private void classifyNewTweet(String tweet) throws IOException { System.out.println("Tweet: " + tweet); Map<String, Integer> dictionary = readDictionary(configuration, new Path(dictionaryPath)); Map<Integer, Long> documentFrequency = readDocumentFrequency( configuration, new Path(documentFrequencyPath)); Multiset<String> words = ConcurrentHashMultiset.create(); // Extract the words from the new tweet using Lucene Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); TokenStream tokenStream = analyzer.tokenStream("text", new StringReader(tweet)); CharTermAttribute termAttribute = tokenStream .addAttribute(CharTermAttribute.class); tokenStream.reset(); int wordCount = 0; while (tokenStream.incrementToken()) { if (termAttribute.length() > 0) { String word = tokenStream.getAttribute(CharTermAttribute.class) .toString(); Integer wordId = dictionary.get(word); // If the word is not in the dictionary, skip it if (wordId != null) { words.add(word); wordCount++; } } } tokenStream.end(); tokenStream.close(); int documentCount = documentFrequency.get(-1).intValue(); // Create a vector for the new tweet (wordId => TFIDF weight) Vector vector = new RandomAccessSparseVector(10000); TFIDF tfidf = new TFIDF(); for (Multiset.Entry<String> entry : words.entrySet()) { String word = entry.getElement(); int count = entry.getCount(); Integer wordId = dictionary.get(word); Long freq = documentFrequency.get(wordId); double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount); vector.setQuick(wordId, tfIdfValue); } // Model is a matrix (wordId, labelId) => probability score NaiveBayesModel model = NaiveBayesModel.materialize( new Path(modelPath), configuration); StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier( model); // With the classifier, we get one score for each label.The label with // the highest score is the one the tweet is more likely to be // associated to Vector resultVector = classifier.classifyFull(vector); double bestScore = -Double.MAX_VALUE; int bestCategoryId = -1; for (Element element : resultVector.all()) { int categoryId = element.index(); double score = element.get(); if (score > bestScore) { bestScore = score; bestCategoryId = categoryId; } if (categoryId == 1) { System.out.println("Probability of being positive: " + score); } else { System.out.println("Probability of being negative: " + score); } } if (bestCategoryId == 1) { System.out.println("The tweet is positive :) "); } else { System.out.println("The tweet is negative :( "); } analyzer.close(); } public static Map<String, Integer> readDictionary(Configuration conf, Path dictionnaryPath) { Map<String, Integer> dictionnary = new HashMap<String, Integer>(); for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>( dictionnaryPath, true, conf)) { dictionnary.put(pair.getFirst().toString(), pair.getSecond().get()); } return dictionnary; } public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) { Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>(); for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>( documentFrequencyPath, true, conf)) { documentFrequency .put(pair.getFirst().get(), pair.getSecond().get()); } return documentFrequency; } } |
Run the class by using the following command:
1 2 |
mvn compile mvn exec:java -Dexec.mainClass="com.technobium.NaiveBayes" |
The output should be something like this:
1 2 3 4 |
Tweet: Have a nice day! Probability of being negative: -40.40761445098948 Probability of being positive: -27.75439166678417 The tweet is positive :) |
As you can see in the mail method, we start by transforming the input file to sequence file format. This file format is used by Hadoop which is further used by Mahout for parallel processing.
The next method, sequenceFileToSparseVector(), uses the previously created sequence file to create SparseVectors. These vectors contain the TFIDF measurement for the words in the tweets and will be used to train the classifier.
The trainNaiveBayesModel() method creates the model file, staring from the TFIDF vectors.
The last method, classifyNewTweet, takes a new tweet, creates the TFIDF vector from the words and calculates the probability for this tweet of being positive or negative. The highest probability decides the polarity of the tweet.
GitHub repository for this project: https://github.com/technobium/mahout-naive-bayes
Conclusion
For every business it is important to gather feedback about the own product and services. Reviews, ratings, comments, recommendations, tweets, blogs etc. are a rich source of information which can help a company improve and evolve. In this context, sentiment analysis is a valuable tool which helps by automating the process of extracting the sentiment from different content sources. Beside companies, the political parties are also increasingly interested in this kind of analysis to extract opinion polarity from tweets, Facebook messages and blogs.
References
“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/
https://mahout.apache.org/users/classification/bayesian.html
Ca you please help me in this.
When I run the code I get the following error:
[INFO] — exec-maven-plugin:1.6.0:java (default-cli) @ mahout-naive-bayes —
Sep 03, 2018 3:27:29 AM org.apache.hadoop.util.NativeCodeLoader
WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Sep 03, 2018 3:27:29 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
[WARNING]
java.lang.ArrayIndexOutOfBoundsException: 1
at com.technobium.NaiveBayes.inputDataToSequenceFile(NaiveBayes.java:69)
at com.technobium.NaiveBayes.main(NaiveBayes.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
[INFO] ————————————————————————
[INFO] BUILD FAILURE
[INFO] ————————————————————————
[INFO] Total time: 1.559 s
[INFO] Finished at: 2018-09-03T03:27:29+03:00
[INFO] Final Memory: 15M/174M
[INFO] ————————————————————————
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project mahout-naive-bayes: An exception occured while executing the Java class. 1 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Ca you please help me in this.
When I run the code I get the following error:
[INFO] — exec-maven-plugin:1.6.0:java (default-cli) @ mahout-naive-bayes —
Sep 03, 2018 3:27:29 AM org.apache.hadoop.util.NativeCodeLoader
WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Sep 03, 2018 3:27:29 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
[WARNING]
java.lang.ArrayIndexOutOfBoundsException: 1
at com.technobium.NaiveBayes.inputDataToSequenceFile(NaiveBayes.java:69)
at com.technobium.NaiveBayes.main(NaiveBayes.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
[INFO] ————————————————————————
[INFO] BUILD FAILURE
[INFO] ————————————————————————
[INFO] Total time: 1.559 s
[INFO] Finished at: 2018-09-03T03:27:29+03:00
[INFO] Final Memory: 15M/174M
[INFO] ————————————————————————
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project mahout-naive-bayes: An exception occured while executing the Java class. 1 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
I have solved the issue, the input file had extra line.
Thanks a lot for the code
Is it possible to have 3 tweet categories (positive,negative,neutral) instead of just using 2 categories (positive,negative) ? if possible, which part that should change?
And what is the code for splitting the data into training and testing?
Hello There,
i am devling a system on sentimental analysis and i am using the above code to classify the facebook comments.the code above not working on the windows platform.In all the methods its showing chmod exceptions.could you please tell me how could i execute these methods into the windows environment and then start using the classifier.
Thanks
Hi Ankit,
In the article’s prerequisites section I mentioned that the code will run only under Linux or MacOS. The problem is that Hadoop (which in turn is used by the Mahout algorithm), needs special setup under Windows. You can follow this tutorial for Windows: http://alans.se/blog/2010/mahout-on-hadoop-in-cygwin/. Otherwise you can run the sample code on Linux or a Linux virtual machine.
Regards,
Leo
It’s possible to use HDFS instead local FS?
Good work, thanks!
Hi Osco,
According to this list Naive Bayes is one of the algorithms that uses Hadoop MapReduce: https://mahout.apache.org/users/basics/algorithms.html
Here is also a step by step guide for running Naive Bayes classification in a Hadoop cluster: https://mahout.apache.org/users/classification/twenty-newsgroups.html
Hope this helps.
Regards,
Leo
Why are we getting negative values for probability?
Hi Basil,
Good observation! The probability of a document belonging to a category should be between 0 and 1.
To compute the probability of a document belonging to a category, we compute the products of the probability of each word to belong the category. As these probabilities are small, by multiplying them the precision is lost. For this reason the naive Bayes implementation of Mahout uses the logarithmic function: Sum( log (probabilities) ). Taking into consideration that the logarithmic function between 0 and 1 is negative, the sum of these probabilities will also be negative.
The explanation can be found also here: https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/comment-page-1/#comment-874
Regards,
Leo