Introduction

Deep learning models based on the transformer architecture have taken the NLP world by storm in the last few years, achieving state-of-the-art results in several areas. An obvious example of this success is provided by the tremendous growth of the Hugging Face ecosystem, which provides access to a plethora of pre-trained models in a very user-friendly way.

However, we believe that models based on (static) word embeddings still have their place in the “transformer era”. Some reasons why this might be the case are the following:

Transformer-based models are usually much bigger (i.e. more parameters) than “standard” models.
Transformer models are not renowned for their (inference) speed—this is related to the previous point.
Models based on word embeddings still provide a solid baseline.

Gensim is a great library when it comes to word embeddings, and some other NLP tasks, especially if you want to train them on your own. There might be cases where you would like to train two NLP models and have them “speak the same language”, i.e. share the same vocabulary. For the sake of concreteness, let’s say these two models are LSI and word2vec. The “standard” way of doing this, however, requires the preparation of two vocabularies, one for each model. In this post, we’ll show how to avoid this by transferring the vocabuly of the LSI model to the word2vec model.

LSI and word2vec the “standard” way

We will now build these two models following the “standard” procedure as can be found in the respective Gensim documentation. In what follows, we will work with the 20 Newsgroups dataset, which is a collection of ca. 20,000 newsgroup documents grouped into 20 classes. Details are not very important in relation to our discussion. This dataset can easily be downloaded using the sklearn.datasets.fetch_20newsgroups function of the scikit-learn libray, which will download and cache the dataset.

data, _ = fetch_20newsgroups(
    shuffle=True, random_state=123, remove=('headers', 'footers', 'quotes'), return_X_y=True
)

LSI model

The first step in building an LSI model is to create a dictionary, which maps words to integer ids. This is easily achieved through the Dictionary class, to which we have to pass tokenised documents:

tokenized_data = [tokenizer(doc) for doc in data]
dct = Dictionary(tokenized_data)

With the help of the dictionary we can then build our corpus using the .doc2bow() method. This returns documents in a bag-of-words (BoW) representation. We could proceed with it, but a TF-IDF representation is preferable, for which we can use the TfidfModel class.

corpus = [dct.doc2bow(line) for line in tokenized_data]
tfidf_model = TfidfModel(corpus, id2word=dct)
tfidf_matrix = tfidf_model[corpus]

We have everything we need to build our LSI model, which is conveniently done by the LsiModel class. Without further motivating this arbitrary choice, we set the number of latent dimensions to 200.

%%time

dim_lsi = 200  # Topic number (latent dimension)
lsi_model = LsiModel(corpus=tfidf_matrix, id2word=dct, num_topics=dim_lsi)

CPU times: user 12.6 s, sys: 625 ms, total: 13.2 s
Wall time: 10.5 s

We now have an LSI model ready to be used! Let’s move on to word2vec.

word2vec model

The quickest way to train a word2vec model is through the Word2Vec class.

Code

dim_w2v = dim_lsi  # Diminsionality of word vectors
alpha = 0.025  # Initial learning rate
alpha_min = 0.0001  # Drop learning rate to this value
wnd = 5        # Window size (max. distance to predicted word)
mincount = 2   # Word frequency lower bound
sample = 1e-5  # Threshold for downsampling
sg = 1         # Index 1 => Skip-Gram algo.
ngt = 10       # No. noisy words for negative sampling
epochs = 5     # No. epochs for training
cpus = multiprocessing.cpu_count()  # Tot. no. of CPUs
threads = cpus -1  # Use this number of threads for training

%%time

w2v_model = Word2Vec(
    sentences=tokenized_data, vector_size=dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd, 
    min_count=mincount, sample=sample, sg=sg, negative=ngt, epochs=epochs, workers=threads
)

CPU times: user 1min 31s, sys: 1.19 s, total: 1min 32s
Wall time: 53.7 s

Let’s double-check the number of words present in each of our two models:

Code

print('Size of LSI vocab.:', len(dct.keys()))
print('Size of w2v vocab.:', len(w2v_model.wv.key_to_index.keys()))

Size of LSI vocab.: 42439
Size of w2v vocab.: 42439

We’ve hence managed to build an LSI and a word2vec model whose vocabularies contain the exact same words—great! However, this came at an unnecessarily high price and we’ll shortly see why. What happens behind the scenes when we create a new instance of the Word2Vec class is the following. First a quick sanity check of the corpus is performed, then the vocabulary is built using the .build_vocab() method, and lastly the method .train() is executed, which trains the model. In the second step, a new dictionary is built from scratch, despite having already done so for the LSI model. When working with small datasets this procedure might be acceptable, but when the corpus is very large optimising this step can save a lot of time!

LSI and word2vec the fast way

We will now see how we can build these models avoiding the above issue. To do that, we must split the creation of the model into three steps. We start by instantiating the model, but without passing it a corpus, i.e. leaving it uninitialised.

w2v_model = Word2Vec(
    vector_size=dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd, 
    min_count=mincount, sample=sample, sg=sg, negative=ngt, workers=threads
)

Thankfully, for the second step, Gensim offers an easy workaround: one can build a vocabulary from a dictionary of word frequencies, instead of from a sequence of sentences as done by default by .build_vocab(). This can be done with the method .build_vocab_from_freq(), which requires a frequency mapping. The latter can be obtained from the LSI dictionary, specifically from the dct.cfs attribute, which contains index to frequency mappings.

%%time

# Step 2: borrow LSI vocab.
word_freq = {dct[k]: v for k,v in dct.cfs.items()}
w2v_model.build_vocab_from_freq(word_freq)
# Step 3: train model
num_samples = dct.num_docs
w2v_model.train(tokenized_data, total_examples=num_samples, epochs=epochs)

CPU times: user 1min 29s, sys: 812 ms, total: 1min 30s
Wall time: 37.1 s

(2609400, 6277620)

We have been successful in creating the word2vec model by borrowing the LSI vocabulary. This allowed us to avoid an unnecessary step and hence to waste resources.

Note that in this case the speed-up is barely observable, which is due to the very small size of the dataset (about 15 MB). However, this becomes considerable when working with huge datasets. The dataset on which I base these conclusions exceeds 50 GB and this second approach saved me several hours!

Data streaming [optional]

This is a bonus section for those who have endured until this point. The motivation behind this post was to avoid unnecessary calculations, which makes particular sense when dealing with very large datasets. Very large datasets will most likely not fit in the memory, but in the above code we have loaded everything into RAM—ouch!

Gensim models are smart enough to accept iterables that stream the input data directly from disk. In this way, our corpus can be arbitrarily large (limited only by the size of our hard drive). We repeat here the steps of the above sections, but restructuring our code to take advantage of data streaming. We assume the corpus to be stored in a unique text file (20news.txt), which we get by writing the data list to file.

Note

Loading a corpus into memory and then dumping it into a file obviously doesn’t make much sense; we’re doing this only to work with the same data as before. You will not need this step as you will be starting directly from some data stored in a (potentially very large) file.

LSI model

As we’ve seen before, the first step is to create a dictionary. Before we passed a list to Dictionary, now we pass it a generator:

curpus_path = Path('20news.txt')
dct = Dictionary((tokenizer(line) for line in open(curpus_path)))

Step two consists in creating a corpus and switching to a TF-IDF representation. Here is where things change a bit. We need to define an iterable that yields documents in BoW representation, which is done by the Corpus class here below.

class Corpus:
    '''Iterable that yields BoW representations of documents.'''
    
    def __init__(self, curpus_path, dct_object):
        self.curpus_path = curpus_path
        self.dct_object = dct_object
        
    def __iter__(self):
        for line in sopen(self.curpus_path):
            yield self.dct_object.doc2bow(tokenizer(line))

We then use it to create our streamed corpus, which can be passed to TfidfModel. We’ll skip the explicit creation of the TF-IDF matrix because it can be very large.

corpus = Corpus(curpus_path, dct)
tfidf_model = TfidfModel(corpus, id2word=dct)

We are now ready to build our LSI model:

%%time

lsi_model = LsiModel(corpus=tfidf_model[corpus], id2word=dct, num_topics=dim_lsi)

CPU times: user 3min 58s, sys: 11 s, total: 4min 9s
Wall time: 3min 17s

Important

We need to use iterables and not generators even though they both produce an iterator. This is because after we have exhausted a generator once there is no more data available. In contrast, iterables create a new iterator every time they are looped over. This is exactly what we need when creating a model: we need to be able to iterate over a dataset more than once.

word2vec model

Similar to what we did with the LSI model, we need to define an iterable that yields tokenized documents. This is provided by the CorpusW2V class below.

class CorpusW2V:
    '''Iterable that yields sentences (lists of str).'''

    def __init__(self, curpus_path):
        self.curpus_path = curpus_path

    def __iter__(self):
        for line in sopen(self.curpus_path):
            yield tokenizer(line)

corpus_w2v = CorpusW2V(curpus_path)

The rest follows exactly as above, with the only difference that now the .train() method receives an instance of the CorpusW2V class instead of a list (see tokenized_data above).

%%time

w2v_model = Word2Vec(
    vector_size=dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd, 
    min_count=mincount, sample=sample, sg=sg, negative=ngt, workers=threads
)
# Borrow LSI vocab.
word_freq = {dct[k]: v for k,v in dct.cfs.items()}
w2v_model.build_vocab_from_freq(word_freq)
# Train model
num_samples = dct.num_docs
w2v_model.train(corpus_w2v, total_examples=num_samples, epochs=epochs)

CPU times: user 5min 52s, sys: 4.78 s, total: 5min 56s
Wall time: 6min 31s

(2608743, 6277620)

We conclude by noting that this approach based on data streaming is certainly slower than when we load everything into memory. However, it allows us to process arbitrarily large datasets. One can’t have it all, as they say.

Acknowledgements and references

We must thank the Gensim community and in particular Austen Mack-Crane, on whose suggestions the section LSI and word2vec the fast way is based. The section Data streaming takes instead inspiration from the Gensim documentation and from Radim Řehůřek’s blog post about data streaming in Python.