Getting started with Apache OpenNLP

The article is an introduction to the Apache OpenNLP library. We will start with a short description of the library, we will describe a simple problem which this library can solve, then we will do a small project in order to solve the defined problem. The tutorial requires basic Java programming skills.

Apache OpenNLP short description

OpenNLP is a Java library for natural language processing (NLP), developed under the Apache license. NLP as domain, deals with the interaction between computers and the human language. The main goal in this case is to enable computers to extract meaning from the natural language.

The Apache OpenNLP toolkit supports the following tasks:

  • tokenization
  • sentence segmentation
  • part-of-speech tagging
  • named entity extraction
  • chunking
  • parsing
  • coreference resolution

Named entity recognition problem

The current tutorial will focus on giving a practical example on named entity extraction, also known as Named Entity Recognition (NER). The task involves the classification of text elements into pre-defined categories like names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

We will demonstrate how to extract names of persons from a given text. As an example, given the text:

If President John F. Kennedy, after visiting France in 1961 with his immensely popular wife, famously described himself as “the man who had accompanied Jacqueline Kennedy to Paris,” Mr. Hollande has been most conspicuous on this state visit for traveling alone. (NYTimes article )

The NER algorithm should recognize three person entities: John F . Kennedy, Jacqueline Kennedy and Hollande.

Creating a entity extractor using Apache OpenNLP

Let’s put this simple task into practice: start by creating a maven project, add the maven dependencies, download a pre-trained model file for person entity recognition and finally create a small Java class which parses two demo sentences using the downloaded model.

Prerequisites:

Create a Maven project

Create the Maven project using the following command:

mvn archetype:generate \
    -DarchetypeGroupId=org.apache.maven.archetypes \
    -DgroupId=com.technobium \
    -DartifactId=opennlp-ner \
    -DinteractiveMode=false

Rename the default created App class to BasicNameFinder using the following command:

mv opennlp-ner/src/main/java/com/technobium/App.java \
   opennlp-ner/src/main/java/com/technobium/BasicNameFinder.java

Add the OpenNLP and SLF4J libraries to this project:

cd opennlp-ner
nano pom.xml

Add the following lines to the dependencies section:

<dependencies>
    ...
    <dependency>
        <groupId>org.slf4j</groupId>
	<artifactId>slf4j-simple</artifactId>
	<version>1.7.7</version>
    </dependency>
    <dependency>
	<groupId>org.apache.opennlp</groupId>
	<artifactId>opennlp-tools</artifactId>
	<version>1.5.3</version>
    </dependency>
</dependencies>

Create the entity extractor

Create an input folder where we will have the downloaded model file. As a note, download the latest available model version. In this example we use version 1.5.

mkdir input
cd input
curl -O http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin

Edit the BasicNameFinder class file and add the following content:

package com.technobium;

import java.io.File;
import java.io.IOException;
import java.util.Arrays;

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Hello world OpenNLP!
 * 
 */

public class BasicNameFinder {
    public static void main(String[] args) throws InvalidFormatException,
            IOException {

        Logger log = LoggerFactory.getLogger(BasicNameFinder.class);

        String[] sentences = {
                "If President John F. Kennedy, after visiting France in 1961 with his immensely popular wife,"
                        + " famously described himself as 'the man who had accompanied Jacqueline Kennedy to Paris,'"
                        + " Mr. Hollande has been most conspicuous on this state visit for traveling alone.",
                "Mr. Draghi spoke on the first day of an economic policy conference here organized by"
                        + " the E.C.B. as a sort of counterpart to the annual symposium held in Jackson"
                        + " Hole, Wyo., by the Federal Reserve Bank of Kansas City. " };

        // Load the model file downloaded from OpenNLP
        // http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
        TokenNameFinderModel model = new TokenNameFinderModel(new File(
                "input/en-ner-person.bin"));

        // Create a NameFinder using the model
        NameFinderME finder = new NameFinderME(model);

        Tokenizer tokenizer = SimpleTokenizer.INSTANCE;

        for (String sentence : sentences) {

            // Split the sentence into tokens
            String[] tokens = tokenizer.tokenize(sentence);

            // Find the names in the tokens and return Span objects
            Span[] nameSpans = finder.find(tokens);

            // Print the names extracted from the tokens using the Span data
            log.info(Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
        }
    }
}

Run the class by using the following command:

mvn compile
mvn exec:java -Dexec.mainClass="com.technobium.BasicNameFinder"

In the console you should see the persons recognized in each sentence:

...
[John F . Kennedy, Jacqueline Kennedy, Hollande]
[Draghi]

GitHub repository for this project: https://github.com/technobium/opennlp-ner

Conclusion

This article offered a hands-on first experience with Apache OpenNLP. Natural language processing is a wide field of study which has boomed in the last few years. One of the best examples of successful usage of natural language processing is IBM Watsona computer system capable of answering questions posed in natural language.

If you want to find out more about IBM Watson you can read the following article: Getting started with IBM Watson

References:

https://opennlp.apache.org/documentation.html

“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *