Introduction to clustering using Apache Mahout

In machine learning, clustering is the name of a category of unsupervised learning algorithms. The main problem which is solved by these algorithms is to find structure in unstructured input data. The scope of this tutorial is to demonstrate how Apache Mahout can be used to cluster a small set of documents, according to their content. We will start by formulating a simple clustering problem, we will describe the processing steps, then we will create a Java project using Apache Mahout to solve the problem. Basic Java programming knowledge is required for this tutorial.

Text Clustering explained

The simplest clustering problem is to automatically split a set of documents in distinct categories, according to their content. As input we will have a set of documents which contain the word “red” or the word “blue”. The expected output of our demo will be to have the documents split in two categories: one category containing only document “blue” and one category containing only “red” documents.

The following processing steps are used:

– the documents are transformed to into vectors using TF-IDF weighting scheme. For more details you can see also the previous post TFIDF explained using Apache Mahout.

– the documents are initially clustered using the Canopy algorithm. You can find out more about the algorithm on Mahout project site.

– a second clustering is applied using the FuzzyKMeans algorithm. You can find out more about the algorithm on Mahout project site.

The code for this demo is very similar with the one used in the post TFIDF explained using Apache Mahout. What is new in this post is  the clusterDocs() method which does the actual clustering.

Simple clustering Java project

Let’s see the practical part.

Prerequisites:

Create the Maven project:

mvn archetype:generate \
    -DarchetypeGroupId=org.apache.maven.archetypes \
    -DgroupId=com.technobium \
    -DartifactId=mahout-clustering
    -DinteractiveMode=false

Rename the default created App class to ClusteringDemo using the following command:

mv mahout-clustering/src/main/java/com/technobium/App.java \
   mahout-clustering/src/main/java/com/technobium/ClusteringDemo.java

Add the Mahout and SLF4J libraries to this project:

cd mahout-clustering
nano pom.xml

Add the following lines to the dependencies section:

<dependencies>
    ...
    <dependency>
        <groupId>org.apache.mahout</groupId>
        <artifactId>mahout-core</artifactId>
        <version>0.9</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>1.7.7</version>
    </dependency>
</dependencies>

Edit the ClusteringDemo class file and add the following code:

package com.technobium;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
import org.apache.mahout.vectorizer.DictionaryVectorizer;
import org.apache.mahout.vectorizer.DocumentProcessor;
import org.apache.mahout.vectorizer.common.PartialVectorMerger;
import org.apache.mahout.vectorizer.tfidf.TFIDFConverter;

public class ClusteringDemo {

    String outputFolder;
    Configuration configuration;
    FileSystem fileSystem;
    Path documentsSequencePath;
    Path tokenizedDocumentsPath;
    Path tfidfPath;
    Path termFrequencyVectorsPath;

    public static void main(String args[]) throws Exception {
        ClusteringDemo tester = new ClusteringDemo();

        tester.createTestDocuments();
        tester.calculateTfIdf();
        tester.clusterDocs();

        tester.printSequenceFile(tester.documentsSequencePath);

        System.out.println("\n Clusters: ");
        tester.printSequenceFile(new Path(tester.outputFolder
                + "clusters/clusteredPoints/part-m-00000"));
    }

    public ClusteringDemo() throws IOException {
        configuration = new Configuration();
        fileSystem = FileSystem.get(configuration);

        outputFolder = "output/";
        documentsSequencePath = new Path(outputFolder, "sequence");
        tokenizedDocumentsPath = new Path(outputFolder,
                DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
        tfidfPath = new Path(outputFolder + "tfidf");
        termFrequencyVectorsPath = new Path(outputFolder
                + DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER);
    }

    public void createTestDocuments() throws IOException {
        SequenceFile.Writer writer = new SequenceFile.Writer(fileSystem,
                configuration, documentsSequencePath, Text.class, Text.class);

        Text id1 = new Text("Document 1");
        Text text1 = new Text("John saw a red car.");
        writer.append(id1, text1);

        Text id2 = new Text("Document 2");
        Text text2 = new Text("Marta found a red bike.");
        writer.append(id2, text2);

        Text id3 = new Text("Document 3");
        Text text3 = new Text("Don need a blue coat.");
        writer.append(id3, text3);

        Text id4 = new Text("Document 4");
        Text text4 = new Text("Mike bought a blue boat.");
        writer.append(id4, text4);

        Text id5 = new Text("Document 5");
        Text text5 = new Text("Albert wants a blue dish.");
        writer.append(id5, text5);

        Text id6 = new Text("Document 6");
        Text text6 = new Text("Lara likes blue glasses.");
        writer.append(id6, text6);

        Text id7 = new Text("Document 7");
        Text text7 = new Text("Donna, do you have red apples?");
        writer.append(id7, text7);

        Text id8 = new Text("Document 8");
        Text text8 = new Text("Sonia needs blue books.");
        writer.append(id8, text8);

        Text id9 = new Text("Document 9");
        Text text9 = new Text("I like blue eyes.");
        writer.append(id9, text9);

        Text id10 = new Text("Document 10");
        Text text10 = new Text("Arleen has a red carpet.");
        writer.append(id10, text10);

        writer.close();
    }

    public void calculateTfIdf() throws ClassNotFoundException, IOException,
            InterruptedException {
        DocumentProcessor.tokenizeDocuments(documentsSequencePath,
                StandardAnalyzer.class, tokenizedDocumentsPath, configuration);

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedDocumentsPath,
                new Path(outputFolder),
                DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER,
                configuration, 1, 1, 0.0f, PartialVectorMerger.NO_NORMALIZING,
                true, 1, 100, false, false);

        Pair<Long[], List<Path>> documentFrequencies = TFIDFConverter
                .calculateDF(termFrequencyVectorsPath, tfidfPath,
                        configuration, 100);

        TFIDFConverter.processTfIdf(termFrequencyVectorsPath, tfidfPath,
                configuration, documentFrequencies, 1, 100,
                PartialVectorMerger.NO_NORMALIZING, false, false, false, 1);
    }

    void clusterDocs() throws ClassNotFoundException, IOException,
            InterruptedException {
        String vectorsFolder = outputFolder + "tfidf/tfidf-vectors/";
        String canopyCentroids = outputFolder + "canopy-centroids";
        String clusterOutput = outputFolder + "clusters";

        FileSystem fs = FileSystem.get(configuration);
        Path oldClusterPath = new Path(clusterOutput);

        if (fs.exists(oldClusterPath)) {
            fs.delete(oldClusterPath, true);
        }

        CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids),
                new EuclideanDistanceMeasure(), 20, 5, true, 0, true);

        FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(
                canopyCentroids, "clusters-0-final"), new Path(clusterOutput),
                0.01, 20, 2, true, true, 0, false);
    }

    void printSequenceFile(Path path) {
        SequenceFileIterable<Writable, Writable> iterable = new SequenceFileIterable<Writable, Writable>(
                path, configuration);
        for (Pair<Writable, Writable> pair : iterable) {
            System.out
                    .format("%10s -> %s\n", pair.getFirst(), pair.getSecond());
        }
    }
}

You can run the ClusteringDemo class by using the following commands:

mvn compile
mvn exec:java -Dexec.mainClass="com.technobium.ClusteringDemo"

At end of the console log you should see the following results:

Document 1 -> John saw a red car.
Document 2 -> Marta found a red bike.
Document 3 -> Don need a blue coat.
Document 4 -> Mike bought a blue boat.
Document 5 -> Albert wants a blue dish.
Document 6 -> Lara likes blue glasses.
Document 7 -> Donna, do you have red apples?
Document 8 -> Sonia needs blue books.
Document 9 -> I like blue eyes.
Document 10 -> Arleen has a red carpet.

 Clusters: 
         7 -> wt: 1.0 distance: 4.4960791719810365  vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
         7 -> wt: 1.0 distance: 4.496079376645949  vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
         7 -> wt: 1.0 distance: 4.496079576525459  vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
         9 -> wt: 1.0 distance: 4.389955960700927  vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
         9 -> wt: 1.0 distance: 4.389956011306051  vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
         9 -> wt: 1.0 distance: 4.3899560687101395  vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
         9 -> wt: 1.0 distance: 4.389956137136399  vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
         7 -> wt: 1.0 distance: 5.577549042707083  vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
         9 -> wt: 1.0 distance: 4.389956708176695  vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
         9 -> wt: 1.0 distance: 4.389471924190491  vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]

As you can see, all the documents containing the word “red” are grouped in the cluster number  7 and all the documents containing the word “blue” are grouped in the cluster number 9. You can extend the testing by adding other documents in the createTestDocuments() method.

GitHub repository for this project: https://github.com/technobium/mahout-clustering

Conclusion

Clustering is powerful machine learning mechanism used to classify unstructured data. This make the algorithm very useful in the context of analyzing big sets of unstructured data.

References

http://mahout.apache.org/users/clustering/canopy-clustering.html

http://mahout.apache.org/users/clustering/fuzzy-k-means.html

“Mahout in action”, Owen et. al., Manning Pub. 2011 – http://manning.com/owen/

11 Comments

Add a Comment

Your email address will not be published. Required fields are marked *