Find words similarity using Deeplearning4j Word2Vec
|Word2Vec is a class of algorithms that solve the problem of word embedding. In plain english, the algorithms transform words in vector of real numbers so that other NLP (Natural Language Processing) algorithms can work easier. A remarkable quality of the Word2Vec is the ability to find similarity between the words. A simple example: if you know the capital of France as Paris, Word2Vec can deduce the capital of Spain. Internally the Word2Vec construction “knows” that Paris – France = Madrid – Spain. On a properly trained model, if we give the relation Paris – France + Spain, the algorithm would return Madrid. For a very good introduction to word embeddings you can also read the following article https://www.springboard.com/blog/introduction-word-embeddings/
In the following tutorial we will see the Word2Vec algorithm into practice using the implementation offered by the Deeplearning4j library.
Word2Vec overview
Neural networks are a modern computational approach which are revolutionising the current software landscape. As they are built, neural networks need to work with real number preferably with values between 0.0 and 1.0. In order to make use of neural networks in the computational linguistic field or NLP, we need a way to represent the words as numbers between 0 and 1. Here is where Word2Vec come into play.
Word2Vec is implemented using a two-layer neural network that processes text. Its input is a text corpus and its output is a set of vectors, one vector for each word found in the corpus. The vectors can be used further into a deep-learning neural network or simply queried to detect relationships between words. The algorithm can detect the similarity between word by measuring the cosine similarity: no similarity is means as a 90 degree angle, while total similarity is a 0 degree angle, the words overlap.
There are two main flavors for Word2Vec, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.
- CBOW uses context to predict a target word.
- Skip-Gram uses a word to predict a target context. This one was introduced by Miklos et al and is known to have better results than CBOW.
For a deeper understanding of the algorithms, see the following article: https://iksinc.wordpress.com/2015/04/13/words-as-vectors/. This is one of the best descriptions I found so far.
Deeplearning4j – Word2vec demo
Deeplearning4j is an Open-Source, Distributed, Deep Learning Library for the JVM (Java Virtual Machine). Supporting Java and Scala, integrated with Hadoop and Spark, the library is designed to be used in business environments on distributed GPUs and CPUs. Skymind is its commercial support arm.
Now, we will use the Word2Vec Skip-Gram implementation from Deeplearning4j to find word similarities. We will train a model from scratch using the english translation of War and Peace by Leo Tolstoi. I choose thsi bool because it has more than 570 000 words, a good corpus for our training. If you what to try the demo, you will need to do the following
- clone the project from here: https://github.com/technobium/dl4j-word2vec/
- in the input/content.txt file paste the text of the book from here: http://www.textfiles.com/etext/FICTION/warpeace.txt
- run the main() and anjoy experimenting
To notice here, is the train method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
public void train() throws IOException { SentenceIterator sentenceIterator = new FileSentenceIterator(new File(inputFilePath)); TokenizerFactory tokenizerFactory = new DefaultTokenizerFactory(); tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor()); Word2Vec vec = new Word2Vec.Builder() .minWordFrequency(2) .layerSize(300) .windowSize(5) .seed(42) .epochs(3) .elementsLearningAlgorithm(new SkipGram<VocabWord>()) .iterate(sentenceIterator) .tokenizerFactory(tokenizerFactory) .build(); vec.fit(); WordVectorSerializer.writeWordVectors(vec, "output/word2vec.bin"); } |
You can tweak the learning algorithm as you want. For a reference of the options you can consult https://deeplearning4j.org/word2vec#training-the-model
Because we used a relatively small data set for training, we will not be able to demonstrate a full “equation” of dependent terms (like Paris – France = Madrid – Spain). We can see though that the algorithm extracted some useful insight from the given text.
1 2 3 4 5 6 7 8 |
Collection<String> list = word2VecModel.wordsNearest("boy" , 10); System.out.println(list); list = word2VecModel.wordsNearest("girl" , 10); System.out.println(list); Collection<String> stringList = word2VecModel.wordsNearest("day", 10); System.out.println(stringList); |
If you run the code above, you should have a result similar with the following:
1 2 3 |
boy: [monsieur*, lad, drummer, subaltern, child, darlings, karataev, frenchman, tikhon, angel] girl: [black-eyed, woman, shes, blonde, wailed, creature, charming, foka, maid, lovely] day: [morning, evening, affirm, night, september, sunday, after, dawning, departure, information] |
As you can see, for the word “boy“, the algorithm could find in text similar words like: monsiuer, lad, child. For “girl” we got: woman, blonde, charming, lovely and maid. Also, when given the word “day“, the results were more promising: morning, evening, night, sunday, dawning, september. You can experiment with your own word, but don’t expect to extract very accurate results, because the model is trained on a very small corpus.
If you want to experiment with a more reliable model, you can use one that was pre-trained using Google News: Google News Corpus model. It includes word vectors for a vocabulary of aproximately 3 million words. For comparison, the model we trained previously resulted in a vector with aproximately 19 000 words. Most of the 570 000 words in the trainig corpus were eighte duplicate or were exclude because they did’t met the minimum word frequency (2 in our case).
After downloading the model file (1.6 Gb) you can load it like so:
1 2 |
File gModel = new File("/Path/to/your/file/GoogleNews-vectors-negative300.bin.gz"); Word2Vec word2VecModel = WordVectorSerializer.loadGoogleModel(gModel, true); |
Before runnig the code for loading the Google News Corpus model, you should add the following options to your JVM:
1 |
-Xms1024m -Xmx10g -XX:MaxPermSize=2g |
Conclusion
Beyond being useful for word embedding and extracting word similarity, Word2Vec extensions has been proposed for bioinformatics applications. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of machine learning in proteomics and genomics (Wikipedia).