Bringing the power of neural nets to tree-based models.
regression
random forests
embeddings
pytorch
Author
Luca Papariello
Published
September 19, 2023
Introduction
Despite the buzz around generative AI, most applications in industry still originate from tabular datasets. In tabular data, some columns may represent numerical variables, like atmospheric pressure, while others may be categorical variables, like sex or product categories. These can take only a limited number of values. The path to using numerical variables is relatively smooth and requires little preprocessing (for some algorithms, even none at all). In contrast, categorical variables must first be converted into numbers, as this is what a computer can process, and this can be done in many ways.
Two conventional approaches to encode categorical variables are ordinal encoding and one-hot encoding. Ordinal encoding assigns each unique value to a different integer and, as such, assumes an ordering of the categories. For instance, “Size” could be the name of a categorical columns with values: small < medium < large, which would be mapped to 0, 1, and 2, respectively. One-hot encoding creates instead a new column for each possible value in the original data indicating its presence or absence. If the first sample in the dataset were large, we would get the following one-hot representation: (0, 0, 1). Unlike ordinal encoding, one-hot encoding does not assume an ordering of the categories.
However, both ordinal and one-hot encodings have their limitations. To name a few, in ordinal encoding, the assigned numerical values may introduce unintended relationships or orders between categories that do not exist in the original data. One-hot encoding, while solving the issue of introducing unintended order, can lead to high-dimensional and sparse representations for features with a large number of unique categories. This translates into increased computational complexity and memory usage.
In 2015, the Rossmann sales competition took place on Kaggle. The solution of one of the gold medal winners clearly diverged from the others by using a deep learning model, in one of the first known examples of a cutting-edge deep learning model for tabular data. Rather than using traditional encoding methods, the authors introduced the concept of Entity Embeddings. Entity Embeddings provide a way to represent categorical variables as low-dimensional continuous vectors, capturing the underlying semantic relationships between categories. This approach eliminates the limitations of ordinal encoding’s introduced order and one-hot encoding’s curse of dimensionality. Their approach is summarised in the paper Entity Embeddings of Categorical Variables, by Cheng Guo and Felix Berkhahn.
We’ll explore here the concepts and advantages of Entity Embeddings, diving deep and trying to replicate the main findings of the paper. Along the way, we’ll offer a comprehensive understanding of their applications in general machine learning models.
Dataset
We use the very same dataset of the Rossmann sales competition that was used by the authors of the paper—this is readily available on Kaggle. The dataset consists of two parts. The first one is train.csv and comprises of daily sales data for several different stores, while the second one is store.csv and provides additional details about each of these stores. Since the focus here is not on obtaining the best possible result, but on the representation of features and the introduction of Entity Embeddings, we’ll restrict our attention to the base features provided in the first file. Here is a snippet of this dataset:
Most of the fields are self-explanatory and their description can be found on the competition webpage. What we need to know here is that Sales is our target variable and represents the turnover on a given day. Apart from the column Date, which is a type in its own right, and Customers, which is not available at test time and will hence not be considered, all the features of this dataset are categorical. It should be noted that while most of them have a low cardinality, Store, which is a unique ID identifying each shop, can take a whopping 1115 different values!
Before moving on, we get rid of closed stores, which have zero sales, as we’ll not make any predictions on them at test time.
Our training data spans approximately 2.5 years from 2013 to mid-2015. The test data covers instead the subsequent part of 2015, with no overlap with the training set dates. To test the generalisation capability of our model, we’ll try to put ourselves in the closest possible situation to this by sorting the training data by Date and keeping the last 10% of the samples for validation. In this way, we use the older samples for training and the most recent ones for validation. This should ensure that the performance observed on the validation set is as close as possible to that of the leaderboard after submission.
We’ll also drop samples from the validation set for which there are zero sales. This is to align ourselves with the competition, whose website states that “Any day and store with 0 sales is ignored in scoring.”.1
As a first (and only) operation to prepare the data, we enrich the representation of dates. We keep it simple and only create new columns for Day, Month, and Year out of Date. After all, there are already the DayOfWeek, StateHoliday, and SchoolHoliday columns providing additional information in this regard.
Code
def proc_data(df: pd.DataFrame):"Process DataFrame and create date features inplace." df['Day'] = df.Date.dt.day df['Month'] = df.Date.dt.month df['Year'] = df.Date.dt.year df['Open'] = df.Open.fillna(1).astype(int)proc_data(train_xs)proc_data(valid_xs)proc_data(test_df)
We’re now ready to map our categorical variables to integers using scikit-learn’s OrdinalEncoder. In addition, we only select the relevant features.
Let’s have a look at our preprocessed dataframe, which is now ready to be passed to our first model.
train_xs.tail(3)
Store
DayOfWeek
Day
Month
Year
Promo
StateHoliday
SchoolHoliday
759949
351
5
1
4
2
0
0
0
759950
350
5
1
4
2
0
0
0
759951
364
5
1
4
2
0
0
0
Evaluation metric
Before moving on to the model, we want to take a look at one last important piece: the evaluation metric. Submissions for this Kaggle competitions are evaluated in terms of Root Mean Square Percentage Error (RMSPE), so we’ll use the same metric here. Additionally, we pick another common evaluation metric, the Mean Absolute Percentage Error (MAPE). This is the metric used in the publication and will allow us to benchmark our results against theirs.
Baseline model
We choose Random Forests as our starting point, which is an ensemble of decision trees. Reason for this is that it requires little preprocessing, it isn’t very sensitive to hyperparameters, and generally provides a strong baseline.
For convenience, we define an rf function that returns a fitted Random Forests regression model. We are now ready to fit our first model and check its performance, which is expressed in terms of RMSPE (the competition metric) and MAPE (the metric used in the paper).
Code
def rf(xs, y, n_estimators=30, max_samples=200_000, max_depth=35, min_samples_leaf=5, **kwargs):"Return a fitted Random Forests regression model."return RandomForestRegressor( n_jobs=-1, n_estimators=n_estimators, max_samples=max_samples, max_depth=max_depth, min_samples_leaf=min_samples_leaf, random_state=34, **kwargs ).fit(xs, y)
Submitting the results of this model would result in a score on the private test set of about 0.303—note how close this is to the score on our validation set! This is good news because it suggests that we can trust our validation set. However, this result would put us a long way from the top of the leaderboard, where the winner had an impressive score of 0.1.
Entity embeddings
Embeddings are at the heart of many recommender systems, particularly in cases where the possibility of employing a content-based approach is ruled out due to the lack of information about users and items. Collaborative filtering offers a (somewhat surprising) solution for predicting user preferences based only on the interests of other users. The key idea here is that of latent dimensions, which are features that describe users and items and are automatically discovered by a model. The resulting matrices containing the latent factors of users and items are exactly the user and item embedding matrices. Typically, they are determined via low-rank matrix factorisation or randomly initialised and improved during the training of a neural network.
The entity embeddings we discuss today generalise this same idea to any categorical variable, i.e. we want to learn a low-dimensional vector representation for each category in a given categorical feature. We’ll optimise these embeddings during the training process along with the rest of the model’s parameters. The underlying intuition is that by representing categories as continuous vectors, the model can learn meaningful relationships and similarities between categories through their proximity in the embedding space. This allows the model to capture complex interactions and dependencies among categorical variables, which can be very valuable for predictive tasks.
Training the embeddings
The deep learning framework we will use to build and train our neural network is PyTorch. In PyTorch, a Dataset is constructed by subclassing Dataset and requires us to override two dunder methods: __getitem__ and __len__. In the following we implement our Dataset to return a tensor for the features and one for the target for the training and validation set, whereas only the features for the test set.
Code
class TabDataset(Dataset):"Tabular Dataset that yields categorical features and target."def__init__(self, feats, new_data=True, tgt=None):self.x_cat = torch.tensor(feats.values, dtype=torch.int32)self.n_samples =len(feats)self.new_data = new_dataifnot new_data:self.y = torch.tensor(tgt.values, dtype=torch.float32).view(-1, 1)def__len__(self):returnself.n_samplesdef__getitem__(self, idx: int):ifnotself.new_data:returnself.x_cat[idx], self.y[idx]returnself.x_cat[idx]
Next, we can build the training and validation DataLoaders, which take a Dataset and return an iterable that handles shuffling, batching, and all the rest for us.
Now comes the main part, the construction of the neural network that will learn the entity embeddings. In PyTorch, we can easily create custom models by subclassing the nn.Module module. In a nutshell, this neural network consists of an initial part of embedding layers—these represent our matrices of latent factors—followed by a fully connected part.
class TabularNeuralNet(nn.Module):"Neural network model for tabular data."def__init__(self, emb_szs: list, # List of (unique_cats, embedding_dim) out_sz: int, # Number of outputs for final layer layers: list, # Size of the hidden layers act_cls=nn.ReLU(inplace=True), # Activation type after `Linear` layers ):super().__init__()self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in emb_szs]) n_emb =sum(e.embedding_dim for e inself.embeds)self.n_emb = n_emb sizes = [n_emb] + layers + [out_sz] actns = [act_cls for _ inrange(len(sizes) -2)] + [None] _layers = [LinReLu(sizes[i], sizes[i +1], act=a) for i, a inenumerate(actns)]self.layers = nn.Sequential(*_layers)def forward(self, x_cat): x = [e(x_cat[:, i]) for i, e inenumerate(self.embeds)] x = torch.cat(x, 1)returnself.layers(x)
The only thing that remains for us to do is to determine the emb_szs parameter. The part concerning the number of unique categories can easily be found using the OrdinalEncoder we fitted above, but the embedding size for each categorical feature is a free parameter. As such, there is no clear-cut way to determine it but it’s typically much smaller than the number of unique categories. We are thus thankful that the authors report these values in their paper.
Code
num_unique = [len(c) for c in oe.categories_]emb_dims = [10, 6, 10, 6, 2, 1, 3, 1]embed_sizes = [(u, e) for u, e inzip(num_unique, emb_dims)]
What follows is the classic PyTorch training loop, which we encapsulate for convenience in two functions, train_one_epoch and validate_one_epoch, that do what they say: train and validate the model over one epoch, respectively. We train our model for a few epochs using the standard Mean Squared Error (MSE) loss function and Adam optimiser.
for ep inrange(EPOCHS):print(f"Epoch {ep+1}\n"+"-"*31) train_one_epoch(train_dl, nn_model, loss_func, optim) validate_one_epoch(valid_dl, nn_model, loss_func)
That’s it! By training our neural network we obtained the entity embeddings. As a nice side-product we also got a fully functioning model that we can use to forecast the sales.
It turns out that this model works much better than our previous baseline, lowering both RMSPE and MAPE substantially. Submitting these results would bring the score on the private test set down to about 0.17. We are definitely moving in the right direction!
Random Forests with Entity Embeddings
Now that we have trained the neural network, we have the entity embeddings at our disposal. All that remains is to extract them from the embedding layer of our TabularNeuralNet. We write a function, embed_features, to do that since we’ll have to repeat this operation for the training, validation and test set2.
Code
def embed_features(model, xs, encoder):"Replace categorical columns in `xs` w/ embeddings extracted from `model`." xs = xs.copy()with torch.no_grad():for i, col inenumerate(encoder.feature_names_in_):# Get embedding matrix emb = model.embeds[i] emb_data = emb(torch.tensor(xs[col].values, dtype=torch.int32)) emb_names = [f'{col}_{j}'for j inrange(emb_data.shape[1])]# Replace old feature col. w/ new one(s) feat_df = pd.DataFrame( data=emb_data.cpu().numpy(), index=xs.index, columns=emb_names ) xs = xs.drop(col, axis=1).join(feat_df)return xs
We are now ready to apply the mapping defined by the embeddings to our original features. We also show the first few rows to see what the result looks like.
We can see that our initial dataset, which had 8 columns representing categorical variables, now reaches 39 columns. This number amounts to the sum of the embedding sizes we have chosen.
We can now retrain our initial model, i.e. the Random Forests, using these new features created by the neural network.
Since we now have more columns, the model will take longer to train. We therefore set the parameter max_features to 0.6, which defines how many columns to sample at each split point (i.e. 60% of the total).
Finally, the moment we have all been anxiously waiting for has arrived. Will our model have benefited from the use of Entity Embeddings? 🥁
That’s definitely the case! We started from an RMSPE of about 0.31 using ordinal encoding to arrive at 0.14 with Entity Embeddings—we’ve halved the error! The score on the private test set confirms the results and the positive trend, bringing the RSMPE down to 0.142.
Conclusion
Similarly to word embeddings in NLP (e.g. word2vec or GloVe), entity embeddings offer a low-dimensional, continuous representation of categorical variables that captures their semantic relationships. We have seen that, although they are derived from the training of a neural network, entity embeddings can also be used by non-deep learning models, such as random forests.
In this blog post, we discussed in detail how to construct embeddings and use them as features for an ensemble of decision trees. Along the way, we were able to reproduce part of the results contained in the paper of Guo and Berkhahn. In particular, replacing the ordinal encoded features in our random forests model with entity embeddings resulted in halving the error on the test set.
To wrap up, when should you give them a try? They are especially useful for datasets with features that have very high cardinality, where other methods often tend to overfit. Is the art of feature engineering dead? Certainly not. In particular, if you have domain knowledge, it would be a shame not to give it to your model via well-designed features.
Footnotes
Note that our main evaluation metric, the RMSPE, would diverge when the target is zero.↩︎
If you come from the fastai world, check out this blog on Medium where this process is done using fastai objects (TabularPandas, Learner, etc.).↩︎