Getting started with Apache OpenNLP

The article is an introduction to the Apache OpenNLP library. We will start with a short description of the library, we will describe a simple problem which this library can solve, then we will do a small project in order to solve the defined problem. The tutorial requires basic Java programming skills.

Apache OpenNLP short description

OpenNLP is a Java library for natural language processing (NLP), developed under the Apache license. NLP as domain, deals with the interaction between computers and the human language. The main goal in this case is to enable computers to extract meaning from the natural language.

The Apache OpenNLP toolkit supports the following tasks:

  • tokenization
  • sentence segmentation
  • part-of-speech tagging
  • named entity extraction
  • chunking
  • parsing
  • coreference resolution

Named entity recognition problem

The current tutorial will focus on giving a practical example on named entity extraction, also known as Named Entity Recognition (NER). The task involves the classification of text elements into pre-defined categories like names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

We will demonstrate how to extract names of persons from a given text. As an example, given the text:

If President John F. Kennedy, after visiting France in 1961 with his immensely popular wife, famously described himself as “the man who had accompanied Jacqueline Kennedy to Paris,” Mr. Hollande has been most conspicuous on this state visit for traveling alone. (NYTimes article )

The NER algorithm should recognize three person entities: John F . Kennedy, Jacqueline Kennedy and Hollande.

Creating a entity extractor using Apache OpenNLP

Let’s put this simple task into practice: start by creating a maven project, add the maven dependencies, download a pre-trained model file for person entity recognition and finally create a small Java class which parses two demo sentences using the downloaded model.

Prerequisites:

Create a Maven project

Create the Maven project using the following command:

Rename the default created App class to BasicNameFinder using the following command:

Add the OpenNLP and SLF4J libraries to this project:

Add the following lines to the dependencies section:

Create the entity extractor

Create an input folder where we will have the downloaded model file. As a note, download the latest available model version. In this example we use version 1.5.

Edit the BasicNameFinder class file and add the following content:

Run the class by using the following command:

In the console you should see the persons recognized in each sentence:

GitHub repository for this project: https://github.com/technobium/opennlp-ner

Conclusion

This article offered a hands-on first experience with Apache OpenNLP. Natural language processing is a wide field of study which has boomed in the last few years. One of the best examples of successful usage of natural language processing is IBM Watsona computer system capable of answering questions posed in natural language.

If you want to find out more about IBM Watson you can read the following article: Getting started with IBM Watson

References:

https://opennlp.apache.org/documentation.html

“Taming Text”, Ingersoll et. al., Manning Pub. 2013 – http://manning.com/ingersoll/

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *