= fetch_20newsgroups(
data, _ =True, random_state=123, remove=('headers', 'footers', 'quotes'), return_X_y=True
shuffle )
Introduction
Deep learning models based on the transformer architecture have taken the NLP world by storm in the last few years, achieving state-of-the-art results in several areas. An obvious example of this success is provided by the tremendous growth of the Hugging Face ecosystem, which provides access to a plethora of pre-trained models in a very user-friendly way.
However, we believe that models based on (static) word embeddings still have their place in the “transformer era”. Some reasons why this might be the case are the following:
- Transformer-based models are usually much bigger (i.e. more parameters) than “standard” models.
- Transformer models are not renowned for their (inference) speed—this is related to the previous point.
- Models based on word embeddings still provide a solid baseline.
Gensim is a great library when it comes to word embeddings, and some other NLP tasks, especially if you want to train them on your own. There might be cases where you would like to train two NLP models and have them “speak the same language”, i.e. share the same vocabulary. For the sake of concreteness, let’s say these two models are LSI and word2vec. The “standard” way of doing this, however, requires the preparation of two vocabularies, one for each model. In this post, we’ll show how to avoid this by transferring the vocabuly of the LSI model to the word2vec model.
LSI and word2vec the “standard” way
We will now build these two models following the “standard” procedure as can be found in the respective Gensim documentation. In what follows, we will work with the 20 Newsgroups dataset, which is a collection of ca. 20,000 newsgroup documents grouped into 20 classes. Details are not very important in relation to our discussion. This dataset can easily be downloaded using the sklearn.datasets.fetch_20newsgroups
function of the scikit-learn libray, which will download and cache the dataset.
LSI model
The first step in building an LSI model is to create a dictionary, which maps words to integer ids. This is easily achieved through the Dictionary
class, to which we have to pass tokenised documents:
= [tokenizer(doc) for doc in data]
tokenized_data = Dictionary(tokenized_data) dct
With the help of the dictionary we can then build our corpus using the .doc2bow()
method. This returns documents in a bag-of-words (BoW) representation. We could proceed with it, but a TF-IDF representation is preferable, for which we can use the TfidfModel
class.
= [dct.doc2bow(line) for line in tokenized_data]
corpus = TfidfModel(corpus, id2word=dct)
tfidf_model = tfidf_model[corpus] tfidf_matrix
We have everything we need to build our LSI model, which is conveniently done by the LsiModel
class. Without further motivating this arbitrary choice, we set the number of latent dimensions to 200.
%%time
= 200 # Topic number (latent dimension)
dim_lsi = LsiModel(corpus=tfidf_matrix, id2word=dct, num_topics=dim_lsi) lsi_model
CPU times: user 12.6 s, sys: 625 ms, total: 13.2 s
Wall time: 10.5 s
We now have an LSI model ready to be used! Let’s move on to word2vec.
word2vec model
The quickest way to train a word2vec model is through the Word2Vec
class.
Code
= dim_lsi # Diminsionality of word vectors
dim_w2v = 0.025 # Initial learning rate
alpha = 0.0001 # Drop learning rate to this value
alpha_min = 5 # Window size (max. distance to predicted word)
wnd = 2 # Word frequency lower bound
mincount = 1e-5 # Threshold for downsampling
sample = 1 # Index 1 => Skip-Gram algo.
sg = 10 # No. noisy words for negative sampling
ngt = 5 # No. epochs for training
epochs = multiprocessing.cpu_count() # Tot. no. of CPUs
cpus = cpus -1 # Use this number of threads for training threads
%%time
= Word2Vec(
w2v_model =tokenized_data, vector_size=dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd,
sentences=mincount, sample=sample, sg=sg, negative=ngt, epochs=epochs, workers=threads
min_count )
CPU times: user 1min 31s, sys: 1.19 s, total: 1min 32s
Wall time: 53.7 s
Let’s double-check the number of words present in each of our two models:
Code
print('Size of LSI vocab.:', len(dct.keys()))
print('Size of w2v vocab.:', len(w2v_model.wv.key_to_index.keys()))
Size of LSI vocab.: 42439
Size of w2v vocab.: 42439
We’ve hence managed to build an LSI and a word2vec model whose vocabularies contain the exact same words—great! However, this came at an unnecessarily high price and we’ll shortly see why. What happens behind the scenes when we create a new instance of the Word2Vec
class is the following. First a quick sanity check of the corpus is performed, then the vocabulary is built using the .build_vocab()
method, and lastly the method .train()
is executed, which trains the model. In the second step, a new dictionary is built from scratch, despite having already done so for the LSI model. When working with small datasets this procedure might be acceptable, but when the corpus is very large optimising this step can save a lot of time!
LSI and word2vec the fast way
We will now see how we can build these models avoiding the above issue. To do that, we must split the creation of the model into three steps. We start by instantiating the model, but without passing it a corpus, i.e. leaving it uninitialised.
= Word2Vec(
w2v_model =dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd,
vector_size=mincount, sample=sample, sg=sg, negative=ngt, workers=threads
min_count )
Thankfully, for the second step, Gensim offers an easy workaround: one can build a vocabulary from a dictionary of word frequencies, instead of from a sequence of sentences as done by default by .build_vocab()
. This can be done with the method .build_vocab_from_freq()
, which requires a frequency mapping. The latter can be obtained from the LSI dictionary, specifically from the dct.cfs
attribute, which contains index to frequency mappings.
%%time
# Step 2: borrow LSI vocab.
= {dct[k]: v for k,v in dct.cfs.items()}
word_freq
w2v_model.build_vocab_from_freq(word_freq)# Step 3: train model
= dct.num_docs
num_samples =num_samples, epochs=epochs) w2v_model.train(tokenized_data, total_examples
CPU times: user 1min 29s, sys: 812 ms, total: 1min 30s
Wall time: 37.1 s
(2609400, 6277620)
We have been successful in creating the word2vec model by borrowing the LSI vocabulary. This allowed us to avoid an unnecessary step and hence to waste resources.
Note that in this case the speed-up is barely observable, which is due to the very small size of the dataset (about 15 MB). However, this becomes considerable when working with huge datasets. The dataset on which I base these conclusions exceeds 50 GB and this second approach saved me several hours!
Data streaming [optional]
This is a bonus section for those who have endured until this point. The motivation behind this post was to avoid unnecessary calculations, which makes particular sense when dealing with very large datasets. Very large datasets will most likely not fit in the memory, but in the above code we have loaded everything into RAM—ouch!
Gensim models are smart enough to accept iterables that stream the input data directly from disk. In this way, our corpus can be arbitrarily large (limited only by the size of our hard drive). We repeat here the steps of the above sections, but restructuring our code to take advantage of data streaming. We assume the corpus to be stored in a unique text file (20news.txt
), which we get by writing the data
list to file.
LSI model
As we’ve seen before, the first step is to create a dictionary. Before we passed a list to Dictionary
, now we pass it a generator:
= Path('20news.txt')
curpus_path = Dictionary((tokenizer(line) for line in open(curpus_path))) dct
Step two consists in creating a corpus and switching to a TF-IDF representation. Here is where things change a bit. We need to define an iterable that yields documents in BoW representation, which is done by the Corpus
class here below.
class Corpus:
'''Iterable that yields BoW representations of documents.'''
def __init__(self, curpus_path, dct_object):
self.curpus_path = curpus_path
self.dct_object = dct_object
def __iter__(self):
for line in sopen(self.curpus_path):
yield self.dct_object.doc2bow(tokenizer(line))
We then use it to create our streamed corpus, which can be passed to TfidfModel
. We’ll skip the explicit creation of the TF-IDF matrix because it can be very large.
= Corpus(curpus_path, dct)
corpus = TfidfModel(corpus, id2word=dct) tfidf_model
We are now ready to build our LSI model:
%%time
= LsiModel(corpus=tfidf_model[corpus], id2word=dct, num_topics=dim_lsi) lsi_model
CPU times: user 3min 58s, sys: 11 s, total: 4min 9s
Wall time: 3min 17s
word2vec model
Similar to what we did with the LSI model, we need to define an iterable that yields tokenized documents. This is provided by the CorpusW2V
class below.
class CorpusW2V:
'''Iterable that yields sentences (lists of str).'''
def __init__(self, curpus_path):
self.curpus_path = curpus_path
def __iter__(self):
for line in sopen(self.curpus_path):
yield tokenizer(line)
= CorpusW2V(curpus_path) corpus_w2v
The rest follows exactly as above, with the only difference that now the .train()
method receives an instance of the CorpusW2V
class instead of a list (see tokenized_data
above).
%%time
= Word2Vec(
w2v_model =dim_w2v, alpha=alpha, min_alpha=alpha_min, window=wnd,
vector_size=mincount, sample=sample, sg=sg, negative=ngt, workers=threads
min_count
)# Borrow LSI vocab.
= {dct[k]: v for k,v in dct.cfs.items()}
word_freq
w2v_model.build_vocab_from_freq(word_freq)# Train model
= dct.num_docs
num_samples =num_samples, epochs=epochs) w2v_model.train(corpus_w2v, total_examples
CPU times: user 5min 52s, sys: 4.78 s, total: 5min 56s
Wall time: 6min 31s
(2608743, 6277620)
We conclude by noting that this approach based on data streaming is certainly slower than when we load everything into memory. However, it allows us to process arbitrarily large datasets. One can’t have it all, as they say.
Acknowledgements and references
We must thank the Gensim community and in particular Austen Mack-Crane, on whose suggestions the section LSI and word2vec the fast way is based. The section Data streaming takes instead inspiration from the Gensim documentation and from Radim Řehůřek’s blog post about data streaming in Python.