Sentiment analysis using Mahout naive Bayes

Sentiment analysis or opinion mining is the identification of subjective information from text. This tutorial will show how to do sentiment analysis on Twitter feeds using the naive Bayes classification algorithm available on Apache Mahout. Although far from a production ready implementation, this simple demo Java application will help you understand how to use Mahout’s naive Bayes algorithm to classify text. We will start by explaining the problem, we will then see how naive Bayes can help us solve this problem and at the end we will build a working sample to see the algorithm in action. Basic Java programming knowledge is required for this tutorial.

Naive Bayes for sentiment analysis

Sentiment analysis aims to detect the attitude of a text. A simple subtask of sentiment analysis is to determine the polarity of the text: positive, negative or neutral. In this tutorial we concentrate on detecting if a short text like a Twitter message is positive or negative.  For example:

  • for the tweet “Have a nice day!” the algorithm should tell us that this is a positive message.
  • for the tweet “I had a bad day” the algorithm should tell us that this is a negative message.

From the machine learning domain point of view this can be seen as a classification task and naive Bayes is an algorithm which suits well this kind of task.

The naive Bayes algorithm uses probabilities to decide which class best matches for a given input text. The classification decision is based on a model obtained after the training process. Model training is done by analysing the relationship between the words in the training text and their classification categories. The algorithm is considered naive because it assumes that the value of a particular feature is independent of the value of any other feature, given the class variable.  For example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features (Naive Bayes classifier on Wikipedia).

Each text we will classify contains words noted with Wi (i=1..n) . For each word Wi from the training data set we can extract the following probabilities (noted with P):

P(Wi given Positive) = (The number of positive Texts with the Wi) / The number of positive Texts

P(Wi given Negative) = (The number of negative Texts with the Wi) / The number of negative Texts

For the entire test set we will have:

P(Positive) = (The number of positive Texts) / The total number of Texts

P(Negative) = (The number of negative Texts) / The number of Texts

For calculating the probability of a Text being positive or negative, given the containing words we will use the following theorem:

P(Positive given Text) = P(Text given Positive) x P(Positive) / P(Text)

P(Negative given Text) = P(Text given Negative) x P(Negative) / P(Text)

As P(Text) is 1, as each Text will be present once in the training set, we will have:

P(Positive given Text) = P(Text given Positive) x P(Positive) = P(W1 give Positive) x P(W2 given Positive) x … x P(Wn given Positive ) x P(Positive)

P(Negative given Text) = P(Text given Negative) x P(Negative) = P(W1 give Negative) x P(W2 given Negative) x … x P(Wn given Negative ) x P(Negative)

At the end one will compare P(Positive given Text) and P(Negative given Text) and the term with the higher probability will decide if the text is positive or negative. To increase the quality of the classifier, instead of using raw term frequency we will use TF-IDF. This way the least significant words are ignored when calculating the probabilities.

Java project for sentiment analysis

The project will use 100 tweets as input for the training phase. The input file will contain 100 lines, each line having the category (1 for positive and 0 for negative) and the tweet text.

In a real world project this dataset must contain millions of tweets for a accurate results. The initial data is usually split in training and test data, but for this simple demo we will use all the data for training.

Prerequisites:

Create the Maven project:

Rename the default created App class to NaiveBayes using the following command:

Add the Mahout and SLF4J libraries to this project:

Add the following lines to the dependencies section:

In the same file, after the dependencies section, add the following configuration which makes sure the code is compiled using Java 1.7

Create an input folder and copy the file containing the training data, tweets.txt.

Edit the NaiveBaiyes class file and add the following code:

Run the class by using the following command:

The output should be something like this:

As you can see in the mail method, we start by transforming the input file to sequence file format. This file format is used by Hadoop which is further used by Mahout for parallel processing.

The next method, sequenceFileToSparseVector(), uses the previously created sequence file to create SparseVectors. These vectors contain the TFIDF measurement for the words in the tweets and will be used to train the classifier.

The trainNaiveBayesModel() method creates the model file, staring from the TFIDF vectors.

The last method, classifyNewTweet, takes a new tweet, creates the TFIDF vector from the words and calculates the probability for this tweet of being positive or negative. The highest probability decides the polarity of the tweet.

GitHub repository for this project: https://github.com/technobium/mahout-naive-bayes

Conclusion

For every business it is important to gather feedback about the own product and services. Reviews, ratings, comments, recommendations, tweets, blogs etc. are a rich source of information which can help a company improve and evolve. In this context, sentiment analysis is a valuable tool which helps by automating the process of extracting the sentiment from different content sources. Beside companies, the political parties are also increasingly interested in this kind of analysis to extract opinion polarity from tweets, Facebook messages and blogs.

References

“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/

https://mahout.apache.org/users/classification/bayesian.html

https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

9 Comments

Leave a Reply to Osco Cancel reply

Your email address will not be published. Required fields are marked *