Introduction to clustering using Apache Mahout

In machine learning, clustering is the name of a category of unsupervised learning algorithms. The main problem which is solved by these algorithms is to find structure in unstructured input data. The scope of this tutorial is to demonstrate how Apache Mahout can be used to cluster a small set of documents, according to their content. We will start by formulating a simple clustering problem, we will describe the processing steps, then we will create a Java project using Apache Mahout to solve the problem. Basic Java programming knowledge is required for this tutorial.

Text Clustering explained

The simplest clustering problem is to automatically split a set of documents in distinct categories, according to their content. As input we will have a set of documents which contain the word “red” or the word “blue”. The expected output of our demo will be to have the documents split in two categories: one category containing only document “blue” and one category containing only “red” documents.

The following processing steps are used:

– the documents are transformed to into vectors using TF-IDF weighting scheme. For more details you can see also the previous post TFIDF explained using Apache Mahout.

– the documents are initially clustered using the Canopy algorithm. You can find out more about the algorithm on Mahout project site.

– a second clustering is applied using the FuzzyKMeans algorithm. You can find out more about the algorithm on Mahout project site.

The code for this demo is very similar with the one used in the post TFIDF explained using Apache Mahout. What is new in this post is  the clusterDocs() method which does the actual clustering.

Simple clustering Java project

Let’s see the practical part.

Prerequisites:

Create the Maven project:

Rename the default created App class to ClusteringDemo using the following command:

Add the Mahout and SLF4J libraries to this project:

Add the following lines to the dependencies section:

Edit the ClusteringDemo class file and add the following code:

You can run the ClusteringDemo class by using the following commands:

At end of the console log you should see the following results:

As you can see, all the documents containing the word “red” are grouped in the cluster number  7 and all the documents containing the word “blue” are grouped in the cluster number 9. You can extend the testing by adding other documents in the createTestDocuments() method.

GitHub repository for this project: https://github.com/technobium/mahout-clustering

Conclusion

Clustering is powerful machine learning mechanism used to classify unstructured data. This make the algorithm very useful in the context of analyzing big sets of unstructured data.

References

http://mahout.apache.org/users/clustering/canopy-clustering.html

http://mahout.apache.org/users/clustering/fuzzy-k-means.html

“Mahout in action”, Owen et. al., Manning Pub. 2011 – http://manning.com/owen/

11 Comments

Leave a Reply to sam Cancel reply

Your email address will not be published. Required fields are marked *