A glimpse into SyntaxNet – the world’s most accurate parser

Spring 2016 Google released SyntaxNet as an open source project. Big news! But what is this? This is a syntactic parser that has behind a  neural network written on  TensorFlow – an open source software library for numerical computation using data flow graphs. A syntactic parser is a software component that takes as input a text and returns the same text enriched with part-of-speech (POS) tags. For Natural Language Understanding (NLU) problems, a POS tagger helps understanding the syntactic function of each word and the syntactic relationship between the words in a sentence.

Let’s say you want to build a chat bot. In order for the system to really understand the human input, it first need to understand the meaning of the input. This is where a NLU and more precisely a POS tagger is needed.

The SyntaxNet project includes the code needed to train SyntaxNet models on your own data, as well as Parsey McParseface, an English parser. For those interested to test the English parser you can follow this nice article http://www.whycouch.com/2016/07/how-to-install-and-use-syntaxnet-and.html.

If you want to work with other languages, you can train your model on using the data from Universal Dependencies project. This is also a very interesting project contains annotated training data for the following languages:

  • Ancient Greek
  • Arabic
  • Basque
  • Bulgarian
  • Catalan
  • Chinese
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • Galician
  • German
  • Gothic
  • Greek
  • Hebrew
  • Hindi
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Kazakh
  • Latin
  • Latvian
  • Norwegian
  • Old_Church_Slavonic
  • Persian
  • Polish
  • Portuguese
  • Portuguese-BR (Brazilian Portugese)
  • Romanian
  • Russian
  • Slovenian
  • Spanish
  • Swedish
  • Tamil
  • Turkish

German POS Tagger

The following steps will guide you trough the setup of a SyntaxNet instance, so that you can decide for yourself if this is the world’s most accurate parser. For the current example I used German, but you can use any of the languages supported by Universal Dependencies.

Prerequisites: install Docker on Ubuntu (kernel equal or higher than 3.10)

sudo apt-get install docker.io

Preparing the Docker image. This may take a while, depending on your machine.

mkdir build 
cd build 
wget https://raw.githubusercontent.com/tensorflow/models/master/syntaxnet/Dockerfile
sudo docker build -t syntaxnet .

Run the newly created image:

docker run -i syntaxnet bash

You should be now in the bash command mode on the Docker image. Run the following commands:

mkdir models 
cd models
curl -O http://download.tensorflow.org/models/parsey_universal/German.zip
unzip German.zip
export MODEL_DIRECTORY=/opt/tensorflow/models/syntaxnet/models/German
cd ..
echo "Ich bin ein Berliner." | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY

The output should be something linke this:

1	Ich	_	PRON	_	fPOS=PRON++	4	nsubj	_	_
2	bin	_	VERB	_	fPOS=VERB++	4	cop	_	_
3	ein	_	DET	_	fPOS=DET++	4	det	_	_
4	Berliner.	_	NOUN	_	fPOS=NOUN++	0	ROOT	_	_

Enjoy POS tagging your preferred language and share your experience 🙂

Reference

https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

http://www.whycouch.com/2016/07/how-to-install-and-use-syntaxnet-and.html

http://universaldependencies.org/

 

2 Comments

Add a Comment

Your email address will not be published. Required fields are marked *