Getting started with Apache Mahout
|This article gives you a first idea about what Apache Mahout library can do and how can it be used in real life projects. You can see it as a “Hello World” project for Mahout. After a short introduction to Apache Mahout, we will see what a recommender is, then we will create a simple recommender using the library. As this is a Java oriented article, you will require basic Java programming skills.
Apache Mahout short description
Mahout is a Java written open source scalable machine learning library from Apache. The machine learning algorithms implemented by Mahout are focused on: clustering, classification and recommendations. For scalability, the algorithms were based on Apache Hadoop and the map/reduce paradigm. Starting April 2014 the project decided to move to Apache Spark. Mahout can be successfully used for machine learning problems which involve very large collections of data. For small amounts of data, other libraries or products can be faster and therefore better suited. More details about the project: https://mahout.apache.org
Recommender explained
A recommender is an application which can suggest products/services to a user, based on the preferences of expressed by other users with similar preferences.
Let’s see an example. We have three users which have expressed their preference about four books. The preferences are given using rates from one to five, five being the maximum rate. Our goal is to recommend a new book to User 1. As we can see in the following table, User 1 and User 2 have similar preferences, since they both gave Book 1 a rate of five. User 3 however, gave a rating of one for Book 1. This means User 1 and User 3 don’t have similar preferences. Looking at the table we expect the recommender to output Book 3 for User 1, as he doesn’t know this book yet and Book 3 is appreciated by User 2.
User | Item | Item ID | Preference |
---|---|---|---|
User 1 | Book 1 | 1 | 5.0 |
User 1 | Book 2 | 2 | 3.0 |
User 2 | Book 1 | 1 | 5.0 |
User 2 | Book 2 | 2 | 1.0 |
User 2 | Book 3 | 3 | 4.0 |
User 2 | Book 4 | 4 | 1.0 |
User 3 | Book 1 | 1 | 1.0 |
User 3 | Book 2 | 2 | 5.0 |
User 3 | Book 3 | 3 | 2.0 |
User 3 | Book 4 | 4 | 4.0 |
Creating the recommender using Apache Mahout
We will start by creating a Maven project, we will then add the Mahout libraries to this project and finally we will write a basic recommender. Prerequisites:
- Linux or Mac
- Java 1.7
- Apache Maven 3
Create a Maven project
From command line create a Maven project named “recommender”:
mvn archetype:generate \ -DarchetypeGroupId=org.apache.maven.archetypes \ -DgroupId=com.technobium \ -DartifactId=recommender \ -DinteractiveMode=false
This will create a “recommender” project with a default class named “App”. Rename the default class to “BasicRecommender”
mv recommender/src/main/java/com/technobium/App.java \ recommender/src/main/java/com/technobium/BasicRecommender.java
Navigate to the “recommender” project and edit the pom.xml file:
cd recommender nano pom.xml
Add the dependencies for Apache Mahout 0.9 in the <dependencies> section. I also added the logging facade library slf4j, because it is needed by Mahout.
<dependencies> ... <dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-core</artifactId> <version>0.9</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.7</version> </dependency> </dependencies>
Optionally you can generate the Eclipse project:
mvn eclipse:eclipse
Create the recommender
The input for the recommender is a comma separated values list of user preferences in the form:
- userID – user identification
- itemID – item identification
- value – the affinity or preference of the current user to the item
Create an input file with this preference data:
mkdir input nano input/data.csv
Add the following sample content, which reflects the preferences table explained before:
1,1,5.0 1,2,2.0 2,1,5.0 2,2,1.0 2,3,4.0 2,4,1.0 3,1,1.0 3,2,5.0 3,3,2.0 3,4,4.0
Edit the default created class and the following content:
package com.technobium; import java.io.File; import java.io.IOException; import java.util.List; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.impl.model.file.FileDataModel; import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood; import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.recommender.RecommendedItem; import org.apache.mahout.cf.taste.recommender.UserBasedRecommender; import org.apache.mahout.cf.taste.similarity.UserSimilarity; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * Hello Mahout world! * */ public class BasicRecommender { public static void main(String[] args) throws IOException, TasteException { Logger log = LoggerFactory.getLogger(BasicRecommender.class); // Load historical data about user preferences DataModel model = new FileDataModel(new File("input/data.csv")); // Compute the similarity between users, according to their preferences UserSimilarity similarity = new EuclideanDistanceSimilarity(model); // Group the users with similar preferences UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model); // Create a recommender UserBasedRecommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); // For the user with the id 1 get two recommendations List<RecommendedItem> recommendations = recommender.recommend(1, 2); for (RecommendedItem recommendation : recommendations) { log.info("User 1 might like the book with ID: " + recommendation.getItemID() + " (predicted preference :" + recommendation.getValue() + ")"); } } }
Run the recommender:
mvn compile mvn exec:java -Dexec.mainClass="com.technobium.BasicRecommender"
The result should be the following:
... User 1 might like the book with ID: 3 (predicted preference :3.4530818) User 1 might like the book with ID: 4 (predicted preference :1.8203772)
As we can see, we can recommend Book 3 and Book 4 to User 1, but Book 3 is more likely to be appreciated by this user since the predicted preference is higher for this books.
GitHub repository for this project: https://github.com/technobium/recommender
Conclusion
As you see, Apache Mahout is a machine learning Java library which can be easily used to build a recommendation engine. In real life, the Apache Mahout recommendation engine is used by companies like LinkedIn, Yahoo, Twitter, Intel or Foursquare.
References
http://mahout.apache.org/users/recommender/userbased-5-minutes.html
“Mahout in action”, Owen et. al., Manning Pub. 2011 – http://manning.com/owen/
an article describing the recommender systems
http://spectrum.ieee.org/computing/software/deconstructing-recommender-systems