Getting started with Apache OpenNLP
|The article is an introduction to the Apache OpenNLP library. We will start with a short description of the library, we will describe a simple problem which this library can solve, then we will do a small project in order to solve the defined problem. The tutorial requires basic Java programming skills.
Apache OpenNLP short description
OpenNLP is a Java library for natural language processing (NLP), developed under the Apache license. NLP as domain, deals with the interaction between computers and the human language. The main goal in this case is to enable computers to extract meaning from the natural language.
The Apache OpenNLP toolkit supports the following tasks:
- tokenization
- sentence segmentation
- part-of-speech tagging
- named entity extraction
- chunking
- parsing
- coreference resolution
Named entity recognition problem
The current tutorial will focus on giving a practical example on named entity extraction, also known as Named Entity Recognition (NER). The task involves the classification of text elements into pre-defined categories like names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
We will demonstrate how to extract names of persons from a given text. As an example, given the text:
If President John F. Kennedy, after visiting France in 1961 with his immensely popular wife, famously described himself as “the man who had accompanied Jacqueline Kennedy to Paris,” Mr. Hollande has been most conspicuous on this state visit for traveling alone. (NYTimes article )
The NER algorithm should recognize three person entities: John F . Kennedy, Jacqueline Kennedy and Hollande.
Creating a entity extractor using Apache OpenNLP
Let’s put this simple task into practice: start by creating a maven project, add the maven dependencies, download a pre-trained model file for person entity recognition and finally create a small Java class which parses two demo sentences using the downloaded model.
Prerequisites:
- Linux or Mac
- Java 1.7
- Apache Maven 3
Create a Maven project
Create the Maven project using the following command:
mvn archetype:generate \ -DarchetypeGroupId=org.apache.maven.archetypes \ -DgroupId=com.technobium \ -DartifactId=opennlp-ner \ -DinteractiveMode=false
Rename the default created App class to BasicNameFinder using the following command:
mv opennlp-ner/src/main/java/com/technobium/App.java \ opennlp-ner/src/main/java/com/technobium/BasicNameFinder.java
Add the OpenNLP and SLF4J libraries to this project:
cd opennlp-ner nano pom.xml
Add the following lines to the dependencies section:
<dependencies> ... <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.7</version> </dependency> <dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.5.3</version> </dependency> </dependencies>
Create the entity extractor
Create an input folder where we will have the downloaded model file. As a note, download the latest available model version. In this example we use version 1.5.
mkdir input cd input curl -O http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
Edit the BasicNameFinder class file and add the following content:
package com.technobium; import java.io.File; import java.io.IOException; import java.util.Arrays; import opennlp.tools.namefind.NameFinderME; import opennlp.tools.namefind.TokenNameFinderModel; import opennlp.tools.tokenize.SimpleTokenizer; import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.util.InvalidFormatException; import opennlp.tools.util.Span; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * Hello world OpenNLP! * */ public class BasicNameFinder { public static void main(String[] args) throws InvalidFormatException, IOException { Logger log = LoggerFactory.getLogger(BasicNameFinder.class); String[] sentences = { "If President John F. Kennedy, after visiting France in 1961 with his immensely popular wife," + " famously described himself as 'the man who had accompanied Jacqueline Kennedy to Paris,'" + " Mr. Hollande has been most conspicuous on this state visit for traveling alone.", "Mr. Draghi spoke on the first day of an economic policy conference here organized by" + " the E.C.B. as a sort of counterpart to the annual symposium held in Jackson" + " Hole, Wyo., by the Federal Reserve Bank of Kansas City. " }; // Load the model file downloaded from OpenNLP // http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin TokenNameFinderModel model = new TokenNameFinderModel(new File( "input/en-ner-person.bin")); // Create a NameFinder using the model NameFinderME finder = new NameFinderME(model); Tokenizer tokenizer = SimpleTokenizer.INSTANCE; for (String sentence : sentences) { // Split the sentence into tokens String[] tokens = tokenizer.tokenize(sentence); // Find the names in the tokens and return Span objects Span[] nameSpans = finder.find(tokens); // Print the names extracted from the tokens using the Span data log.info(Arrays.toString(Span.spansToStrings(nameSpans, tokens))); } } }
Run the class by using the following command:
mvn compile mvn exec:java -Dexec.mainClass="com.technobium.BasicNameFinder"
In the console you should see the persons recognized in each sentence:
... [John F . Kennedy, Jacqueline Kennedy, Hollande] [Draghi]
GitHub repository for this project: https://github.com/technobium/opennlp-ner
Conclusion
This article offered a hands-on first experience with Apache OpenNLP. Natural language processing is a wide field of study which has boomed in the last few years. One of the best examples of successful usage of natural language processing is IBM Watson , a computer system capable of answering questions posed in natural language.
If you want to find out more about IBM Watson you can read the following article: Getting started with IBM Watson
References:
https://opennlp.apache.org/documentation.html
“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/
Thank you 🙂
It is very rare and pleasant to follow a 3-year-old programming tutorial and everything goes well as planed, without package errors, 404 links, obscure compilation failure, etc…