Introduction to clustering using Apache Mahout
|In machine learning, clustering is the name of a category of unsupervised learning algorithms. The main problem which is solved by these algorithms is to find structure in unstructured input data. The scope of this tutorial is to demonstrate how Apache Mahout can be used to cluster a small set of documents, according to their content. We will start by formulating a simple clustering problem, we will describe the processing steps, then we will create a Java project using Apache Mahout to solve the problem. Basic Java programming knowledge is required for this tutorial.
Text Clustering explained
The simplest clustering problem is to automatically split a set of documents in distinct categories, according to their content. As input we will have a set of documents which contain the word “red” or the word “blue”. The expected output of our demo will be to have the documents split in two categories: one category containing only document “blue” and one category containing only “red” documents.
The following processing steps are used:
– the documents are transformed to into vectors using TF-IDF weighting scheme. For more details you can see also the previous post TFIDF explained using Apache Mahout.
– the documents are initially clustered using the Canopy algorithm. You can find out more about the algorithm on Mahout project site.
– a second clustering is applied using the FuzzyKMeans algorithm. You can find out more about the algorithm on Mahout project site.
The code for this demo is very similar with the one used in the post TFIDF explained using Apache Mahout. What is new in this post is the clusterDocs() method which does the actual clustering.
Simple clustering Java project
Let’s see the practical part.
Prerequisites:
- Linux or Mac
- Java 1.7
- Apache Maven 3
Create the Maven project:
mvn archetype:generate \ -DarchetypeGroupId=org.apache.maven.archetypes \ -DgroupId=com.technobium \ -DartifactId=mahout-clustering -DinteractiveMode=false
Rename the default created App class to ClusteringDemo using the following command:
mv mahout-clustering/src/main/java/com/technobium/App.java \ mahout-clustering/src/main/java/com/technobium/ClusteringDemo.java
Add the Mahout and SLF4J libraries to this project:
cd mahout-clustering nano pom.xml
Add the following lines to the dependencies section:
<dependencies> ... <dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-core</artifactId> <version>0.9</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.7</version> </dependency> </dependencies>
Edit the ClusteringDemo class file and add the following code:
package com.technobium; import java.io.IOException; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.mahout.clustering.canopy.CanopyDriver; import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver; import org.apache.mahout.common.Pair; import org.apache.mahout.common.distance.EuclideanDistanceMeasure; import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable; import org.apache.mahout.vectorizer.DictionaryVectorizer; import org.apache.mahout.vectorizer.DocumentProcessor; import org.apache.mahout.vectorizer.common.PartialVectorMerger; import org.apache.mahout.vectorizer.tfidf.TFIDFConverter; public class ClusteringDemo { String outputFolder; Configuration configuration; FileSystem fileSystem; Path documentsSequencePath; Path tokenizedDocumentsPath; Path tfidfPath; Path termFrequencyVectorsPath; public static void main(String args[]) throws Exception { ClusteringDemo tester = new ClusteringDemo(); tester.createTestDocuments(); tester.calculateTfIdf(); tester.clusterDocs(); tester.printSequenceFile(tester.documentsSequencePath); System.out.println("\n Clusters: "); tester.printSequenceFile(new Path(tester.outputFolder + "clusters/clusteredPoints/part-m-00000")); } public ClusteringDemo() throws IOException { configuration = new Configuration(); fileSystem = FileSystem.get(configuration); outputFolder = "output/"; documentsSequencePath = new Path(outputFolder, "sequence"); tokenizedDocumentsPath = new Path(outputFolder, DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); tfidfPath = new Path(outputFolder + "tfidf"); termFrequencyVectorsPath = new Path(outputFolder + DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER); } public void createTestDocuments() throws IOException { SequenceFile.Writer writer = new SequenceFile.Writer(fileSystem, configuration, documentsSequencePath, Text.class, Text.class); Text id1 = new Text("Document 1"); Text text1 = new Text("John saw a red car."); writer.append(id1, text1); Text id2 = new Text("Document 2"); Text text2 = new Text("Marta found a red bike."); writer.append(id2, text2); Text id3 = new Text("Document 3"); Text text3 = new Text("Don need a blue coat."); writer.append(id3, text3); Text id4 = new Text("Document 4"); Text text4 = new Text("Mike bought a blue boat."); writer.append(id4, text4); Text id5 = new Text("Document 5"); Text text5 = new Text("Albert wants a blue dish."); writer.append(id5, text5); Text id6 = new Text("Document 6"); Text text6 = new Text("Lara likes blue glasses."); writer.append(id6, text6); Text id7 = new Text("Document 7"); Text text7 = new Text("Donna, do you have red apples?"); writer.append(id7, text7); Text id8 = new Text("Document 8"); Text text8 = new Text("Sonia needs blue books."); writer.append(id8, text8); Text id9 = new Text("Document 9"); Text text9 = new Text("I like blue eyes."); writer.append(id9, text9); Text id10 = new Text("Document 10"); Text text10 = new Text("Arleen has a red carpet."); writer.append(id10, text10); writer.close(); } public void calculateTfIdf() throws ClassNotFoundException, IOException, InterruptedException { DocumentProcessor.tokenizeDocuments(documentsSequencePath, StandardAnalyzer.class, tokenizedDocumentsPath, configuration); DictionaryVectorizer.createTermFrequencyVectors(tokenizedDocumentsPath, new Path(outputFolder), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, configuration, 1, 1, 0.0f, PartialVectorMerger.NO_NORMALIZING, true, 1, 100, false, false); Pair<Long[], List<Path>> documentFrequencies = TFIDFConverter .calculateDF(termFrequencyVectorsPath, tfidfPath, configuration, 100); TFIDFConverter.processTfIdf(termFrequencyVectorsPath, tfidfPath, configuration, documentFrequencies, 1, 100, PartialVectorMerger.NO_NORMALIZING, false, false, false, 1); } void clusterDocs() throws ClassNotFoundException, IOException, InterruptedException { String vectorsFolder = outputFolder + "tfidf/tfidf-vectors/"; String canopyCentroids = outputFolder + "canopy-centroids"; String clusterOutput = outputFolder + "clusters"; FileSystem fs = FileSystem.get(configuration); Path oldClusterPath = new Path(clusterOutput); if (fs.exists(oldClusterPath)) { fs.delete(oldClusterPath, true); } CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new EuclideanDistanceMeasure(), 20, 5, true, 0, true); FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path( canopyCentroids, "clusters-0-final"), new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false); } void printSequenceFile(Path path) { SequenceFileIterable<Writable, Writable> iterable = new SequenceFileIterable<Writable, Writable>( path, configuration); for (Pair<Writable, Writable> pair : iterable) { System.out .format("%10s -> %s\n", pair.getFirst(), pair.getSecond()); } } }
You can run the ClusteringDemo class by using the following commands:
mvn compile mvn exec:java -Dexec.mainClass="com.technobium.ClusteringDemo"
At end of the console log you should see the following results:
Document 1 -> John saw a red car. Document 2 -> Marta found a red bike. Document 3 -> Don need a blue coat. Document 4 -> Mike bought a blue boat. Document 5 -> Albert wants a blue dish. Document 6 -> Lara likes blue glasses. Document 7 -> Donna, do you have red apples? Document 8 -> Sonia needs blue books. Document 9 -> I like blue eyes. Document 10 -> Arleen has a red carpet. Clusters: 7 -> wt: 1.0 distance: 4.4960791719810365 vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609] 7 -> wt: 1.0 distance: 4.496079376645949 vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693] 7 -> wt: 1.0 distance: 4.496079576525459 vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693] 9 -> wt: 1.0 distance: 4.389955960700927 vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609] 9 -> wt: 1.0 distance: 4.389956011306051 vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609] 9 -> wt: 1.0 distance: 4.3899560687101395 vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609] 9 -> wt: 1.0 distance: 4.389956137136399 vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609] 7 -> wt: 1.0 distance: 5.577549042707083 vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609] 9 -> wt: 1.0 distance: 4.389956708176695 vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609] 9 -> wt: 1.0 distance: 4.389471924190491 vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]
As you can see, all the documents containing the word “red” are grouped in the cluster number 7 and all the documents containing the word “blue” are grouped in the cluster number 9. You can extend the testing by adding other documents in the createTestDocuments() method.
GitHub repository for this project: https://github.com/technobium/mahout-clustering
Conclusion
Clustering is powerful machine learning mechanism used to classify unstructured data. This make the algorithm very useful in the context of analyzing big sets of unstructured data.
References
http://mahout.apache.org/users/clustering/canopy-clustering.html
http://mahout.apache.org/users/clustering/fuzzy-k-means.html
“Mahout in action”, Owen et. al., Manning Pub. 2011 – http://manning.com/owen/
Thank you so much for this tutorial. It is pretty cool. Please I need you to clarify something for me because I am new to Hadoop and Mahout. I have thousands of texts in excel .csv file. The csv file contains only two columns (ID, Text). I want to cluster the Texts using the single csv file in mahout. Is it possible to do this or I have to extract every single row in my csv to notepad and save separately as document? Thanks.
Thank you so much!!! It runs well!!
Thanks for the good example; it was helpful to understand Mahout clustering using a simple example. A couple of things though:
1) Your website is not allowing any copy of the code fragments (I had to view source to get at the content).
2) mvn archetype:create was superseded by mvn archetype:generate (create no longer works)
3) The mvn exec line should be: exec:java -Dexec.mainClass=”com.technobium.ClusteringDemo”
Other than the above, this was a very helpful example. Thanks!
Hi Sam,
Thank very much you for the useful observations. The article is now update according to your suggestions. You can find the code also on GitHub: https://github.com/technobium/mahout-clustering
Regards,
Leo
Hi Leo,
As I understand in the results, cluster number 7 corresponds with “red” and 9 with “blue”. In this case it is known because the example is clear, but are there any way to get the word that the clusterID is reffering to? In this case 7->red, 9->blue.
And how do you interpret the results exactly? I mean, in the vector for example 29 corresponds with 29->red in the words dictionary created and it has the lower value (1.693) in the Document 1 so its assigned to the cluster 7 (red).
Thank you,
Erik.
Hi Erik,
Thanks for the question. The fact that the cluster number 7 corresponds to “red” is relative to the inputs. If you add more test sentences you will see more clusters in the output and maybe the “red” documents will be grouped in cluster number 20. It depends on the inputs.
For analyzing the resulted clusters one can use the Cluster Dumper tool. It prints the clusters and the top terms found in each cluster. For this short demo it is difficult to see a clear term separation, but when you have more than 10.000 input documents it should be more obvious which are the relevant words for a certain cluster.
What is important to see in the current demo, is that the clusters are formed according to their distance to the cluster center point. For example, all the “blue” documents have a distance of about 4.389 to the center.
Hope this helps,
Leo
Thank you for this good examle.
Do you have any idea about generating automatically the metric values of Canopy(e.g, 20, 5 in your example ). Or how automatically generate the K (number of clusters) for K-Means?
Thank you
Hi Darsh,
This is a very good question. I will start with the second part. I am not using a predefined or generated K number of clusters. I am using instead the canopy generation or canopy clustering algorithm. The key to this algorithm are the two distance treshold parameters T1 and T2 (20 and 5 in the example), where T1 should be greater than T2.
I started with two extreme values T1=100 and T2=1, then I narrowed the interval to t1=20 and T2=5. The initial values are of course related to the resulted distance. In my example the distances have values between 4.3 and 5.5, so it didn’t start with initial values like T1=2000 and T2=1500. If you have a larger dataset, you can try to find the optimal parameters on a smaller subset, then run the algorithm on the entire data.
Keep in touch,
Leo
Hi Leo,
Thanks for your reply.
Canopy clustering was promoted by people in Google but if you come in real world, Canopy does not work well, for this reason Canopy is being deprecated from Mahout.
Your idea is helpful if you are working with specific dataset, but if you have a dynamic data, it will be pretty difficult to find optimal K or T1 and T2 automatically.
Cheers,
Darsh
Hi Darsh,
Totally agree, there is no universal solution for finding the number of clusters K or T1 and T2 parameters.
As for the Mahout Canopy Clustering, I saw it is deprecated in the latest release and it will be replaced by Streaming K-Means. I will definitely give it a try. Of course this requires also an approximated K number of clusters, but if the input data domain is well known/investigated, I am pretty sure that one can determine a meaningful number.
Keep in touch,
Leo
Thank you so much for sharing this example. Very helpful. Your blog is superb. It has every information that I’ve looking for.