Entity embeddings of categorical variables

Bringing the power of neural nets to tree-based models.
regression
random forests
embeddings
pytorch
Author

Luca Papariello

Published

September 19, 2023

Introduction

Despite the buzz around generative AI, most applications in industry still originate from tabular datasets. In tabular data, some columns may represent numerical variables, like atmospheric pressure, while others may be categorical variables, like sex or product categories. These can take only a limited number of values. The path to using numerical variables is relatively smooth and requires little preprocessing (for some algorithms, even none at all). In contrast, categorical variables must first be converted into numbers, as this is what a computer can process, and this can be done in many ways.

Two conventional approaches to encode categorical variables are ordinal encoding and one-hot encoding. Ordinal encoding assigns each unique value to a different integer and, as such, assumes an ordering of the categories. For instance, “Size” could be the name of a categorical columns with values: small < medium < large, which would be mapped to 0, 1, and 2, respectively. One-hot encoding creates instead a new column for each possible value in the original data indicating its presence or absence. If the first sample in the dataset were large, we would get the following one-hot representation: (0, 0, 1). Unlike ordinal encoding, one-hot encoding does not assume an ordering of the categories.

However, both ordinal and one-hot encodings have their limitations. To name a few, in ordinal encoding, the assigned numerical values may introduce unintended relationships or orders between categories that do not exist in the original data. One-hot encoding, while solving the issue of introducing unintended order, can lead to high-dimensional and sparse representations for features with a large number of unique categories. This translates into increased computational complexity and memory usage.

In 2015, the Rossmann sales competition took place on Kaggle. The solution of one of the gold medal winners clearly diverged from the others by using a deep learning model, in one of the first known examples of a cutting-edge deep learning model for tabular data. Rather than using traditional encoding methods, the authors introduced the concept of Entity Embeddings. Entity Embeddings provide a way to represent categorical variables as low-dimensional continuous vectors, capturing the underlying semantic relationships between categories. This approach eliminates the limitations of ordinal encoding’s introduced order and one-hot encoding’s curse of dimensionality. Their approach is summarised in the paper Entity Embeddings of Categorical Variables, by Cheng Guo and Felix Berkhahn.

We’ll explore here the concepts and advantages of Entity Embeddings, diving deep and trying to replicate the main findings of the paper. Along the way, we’ll offer a comprehensive understanding of their applications in general machine learning models.

Dataset

We use the very same dataset of the Rossmann sales competition that was used by the authors of the paper—this is readily available on Kaggle. The dataset consists of two parts. The first one is train.csv and comprises of daily sales data for several different stores, while the second one is store.csv and provides additional details about each of these stores. Since the focus here is not on obtaining the best possible result, but on the representation of features and the introduction of Entity Embeddings, we’ll restrict our attention to the base features provided in the first file. Here is a snippet of this dataset:

train_data = path/'train.csv'
train_df = pd.read_csv(train_data, parse_dates=['Date'], low_memory=False)
train_df.head(3)
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
0 1 5 2015-07-31 5263 555 1 1 0 1
1 2 5 2015-07-31 6064 625 1 1 0 1
2 3 5 2015-07-31 8314 821 1 1 0 1

Most of the fields are self-explanatory and their description can be found on the competition webpage. What we need to know here is that Sales is our target variable and represents the turnover on a given day. Apart from the column Date, which is a type in its own right, and Customers, which is not available at test time and will hence not be considered, all the features of this dataset are categorical. It should be noted that while most of them have a low cardinality, Store, which is a unique ID identifying each shop, can take a whopping 1115 different values!

Before moving on, we get rid of closed stores, which have zero sales, as we’ll not make any predictions on them at test time.

Code
to_keep = ~((train_df.Open == 0) & (train_df.Sales == 0))
train_df = train_df.loc[to_keep, :]

Train/Valid split

Our training data spans approximately 2.5 years from 2013 to mid-2015. The test data covers instead the subsequent part of 2015, with no overlap with the training set dates. To test the generalisation capability of our model, we’ll try to put ourselves in the closest possible situation to this by sorting the training data by Date and keeping the last 10% of the samples for validation. In this way, we use the older samples for training and the most recent ones for validation. This should ensure that the performance observed on the validation set is as close as possible to that of the leaderboard after submission.

We’ll also drop samples from the validation set for which there are zero sales. This is to align ourselves with the competition, whose website states that “Any day and store with 0 sales is ignored in scoring.”.1

Code
tgt = 'Sales'  # Target variable
train_df = train_df.sort_values(by=['Date']).reset_index(drop=True)

train_xs, valid_xs, train_y, valid_y = train_test_split(
    train_df.drop(columns=tgt), train_df[tgt], test_size=0.1, shuffle=False
)

valid_xs, valid_y = drop_zero_sales(valid_xs, valid_y)

print(f'Train size: {len(train_y):>6}')
print(f'Valid size: {len(valid_y):>6}')
Train size: 759952
Valid size:  84439

Ordinal encoding

As a first (and only) operation to prepare the data, we enrich the representation of dates. We keep it simple and only create new columns for Day, Month, and Year out of Date. After all, there are already the DayOfWeek, StateHoliday, and SchoolHoliday columns providing additional information in this regard.

Code
def proc_data(df: pd.DataFrame):
    "Process DataFrame and create date features inplace."
    df['Day'] = df.Date.dt.day
    df['Month'] = df.Date.dt.month
    df['Year'] = df.Date.dt.year
    df['Open'] = df.Open.fillna(1).astype(int)

proc_data(train_xs)
proc_data(valid_xs)
proc_data(test_df)

We’re now ready to map our categorical variables to integers using scikit-learn’s OrdinalEncoder. In addition, we only select the relevant features.

cats = ['Store', 'DayOfWeek', 'Day', 'Month', 'Year', 'Promo', 'StateHoliday', 'SchoolHoliday']

oe = OrdinalEncoder(dtype=int)
train_xs[cats] = oe.fit_transform(train_xs[cats])
valid_xs[cats] = oe.transform(valid_xs[cats])
test_df[cats] = oe.transform(test_df[cats])

train_xs, valid_xs, test_xs = train_xs[cats], valid_xs[cats], test_df[cats]

Let’s have a look at our preprocessed dataframe, which is now ready to be passed to our first model.

train_xs.tail(3)
Store DayOfWeek Day Month Year Promo StateHoliday SchoolHoliday
759949 351 5 1 4 2 0 0 0
759950 350 5 1 4 2 0 0 0
759951 364 5 1 4 2 0 0 0

Evaluation metric

Before moving on to the model, we want to take a look at one last important piece: the evaluation metric. Submissions for this Kaggle competitions are evaluated in terms of Root Mean Square Percentage Error (RMSPE), so we’ll use the same metric here. Additionally, we pick another common evaluation metric, the Mean Absolute Percentage Error (MAPE). This is the metric used in the publication and will allow us to benchmark our results against theirs.

Baseline model

We choose Random Forests as our starting point, which is an ensemble of decision trees. Reason for this is that it requires little preprocessing, it isn’t very sensitive to hyperparameters, and generally provides a strong baseline.

For convenience, we define an rf function that returns a fitted Random Forests regression model. We are now ready to fit our first model and check its performance, which is expressed in terms of RMSPE (the competition metric) and MAPE (the metric used in the paper).

Code
def rf(xs, y, n_estimators=30, max_samples=200_000, max_depth=35, min_samples_leaf=5, **kwargs):
    "Return a fitted Random Forests regression model."
    return RandomForestRegressor(
        n_jobs=-1, n_estimators=n_estimators, max_samples=max_samples,
        max_depth=max_depth, min_samples_leaf=min_samples_leaf, random_state=34, **kwargs
    ).fit(xs, y)
m = rf(train_xs.values, train_y.values)
Code
print(f"Valid RMSPE: {m_rmspe(m, valid_xs.values, valid_y.values):.3f}")
print(f"Valid MAPE:  {m_mape(m, valid_xs.values, valid_y.values):.3f}")
Valid RMSPE: 0.310
Valid MAPE:  0.185

Submitting the results of this model would result in a score on the private test set of about 0.303—note how close this is to the score on our validation set! This is good news because it suggests that we can trust our validation set. However, this result would put us a long way from the top of the leaderboard, where the winner had an impressive score of 0.1.

Entity embeddings

Embeddings are at the heart of many recommender systems, particularly in cases where the possibility of employing a content-based approach is ruled out due to the lack of information about users and items. Collaborative filtering offers a (somewhat surprising) solution for predicting user preferences based only on the interests of other users. The key idea here is that of latent dimensions, which are features that describe users and items and are automatically discovered by a model. The resulting matrices containing the latent factors of users and items are exactly the user and item embedding matrices. Typically, they are determined via low-rank matrix factorisation or randomly initialised and improved during the training of a neural network.

The entity embeddings we discuss today generalise this same idea to any categorical variable, i.e. we want to learn a low-dimensional vector representation for each category in a given categorical feature. We’ll optimise these embeddings during the training process along with the rest of the model’s parameters. The underlying intuition is that by representing categories as continuous vectors, the model can learn meaningful relationships and similarities between categories through their proximity in the embedding space. This allows the model to capture complex interactions and dependencies among categorical variables, which can be very valuable for predictive tasks.

Training the embeddings

The deep learning framework we will use to build and train our neural network is PyTorch. In PyTorch, a Dataset is constructed by subclassing Dataset and requires us to override two dunder methods: __getitem__ and __len__. In the following we implement our Dataset to return a tensor for the features and one for the target for the training and validation set, whereas only the features for the test set.

Code
class TabDataset(Dataset):
    "Tabular Dataset that yields categorical features and target."

    def __init__(self, feats, new_data=True, tgt=None):
        self.x_cat = torch.tensor(feats.values, dtype=torch.int32)
        self.n_samples = len(feats)
        self.new_data = new_data
        
        if not new_data:
            self.y = torch.tensor(tgt.values, dtype=torch.float32).view(-1, 1)

    def __len__(self):
        return self.n_samples

    def __getitem__(self, idx: int):
        if not self.new_data:
            return self.x_cat[idx], self.y[idx]
        return self.x_cat[idx]
train_ds = TabDataset(train_xs, new_data=False, tgt=train_y)
valid_ds = TabDataset(valid_xs, new_data=False, tgt=valid_y)

Next, we can build the training and validation DataLoaders, which take a Dataset and return an iterable that handles shuffling, batching, and all the rest for us.

bs = 1024  # Batch size
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=bs*2, shuffle=False)

Now comes the main part, the construction of the neural network that will learn the entity embeddings. In PyTorch, we can easily create custom models by subclassing the nn.Module module. In a nutshell, this neural network consists of an initial part of embedding layers—these represent our matrices of latent factors—followed by a fully connected part.

class TabularNeuralNet(nn.Module):
    "Neural network model for tabular data."

    def __init__(
        self,
        emb_szs: list,  # List of (unique_cats, embedding_dim)
        out_sz: int,    # Number of outputs for final layer
        layers: list,   # Size of the hidden layers
        act_cls=nn.ReLU(inplace=True),  # Activation type after `Linear` layers
    ):
        super().__init__()
        self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in emb_szs])
        n_emb = sum(e.embedding_dim for e in self.embeds)
        self.n_emb = n_emb
        sizes = [n_emb] + layers + [out_sz]
        actns = [act_cls for _ in range(len(sizes) - 2)] + [None]
        _layers = [LinReLu(sizes[i], sizes[i + 1], act=a) for i, a in enumerate(actns)]
        self.layers = nn.Sequential(*_layers)

    def forward(self, x_cat):
        x = [e(x_cat[:, i]) for i, e in enumerate(self.embeds)]
        x = torch.cat(x, 1)
        return self.layers(x)

The only thing that remains for us to do is to determine the emb_szs parameter. The part concerning the number of unique categories can easily be found using the OrdinalEncoder we fitted above, but the embedding size for each categorical feature is a free parameter. As such, there is no clear-cut way to determine it but it’s typically much smaller than the number of unique categories. We are thus thankful that the authors report these values in their paper.

Code
num_unique = [len(c) for c in oe.categories_]
emb_dims = [10, 6, 10, 6, 2, 1, 3, 1]
embed_sizes = [(u, e) for u, e in zip(num_unique, emb_dims)]
nn_model = TabularNeuralNet(emb_szs=embed_sizes, out_sz=1, layers=[200, 100])

What follows is the classic PyTorch training loop, which we encapsulate for convenience in two functions, train_one_epoch and validate_one_epoch, that do what they say: train and validate the model over one epoch, respectively. We train our model for a few epochs using the standard Mean Squared Error (MSE) loss function and Adam optimiser.

for ep in range(EPOCHS):
    print(f"Epoch {ep+1}\n" + "-"*31)
    train_one_epoch(train_dl, nn_model, loss_func, optim)
    validate_one_epoch(valid_dl, nn_model, loss_func)
Epoch 1
-------------------------------
Loss: 56439536  [     0/759952]
Loss:  1829678  [409600/759952]
Loss:  2018285  [759952/759952]
Avg. valid. loss: 1823780, Avg. RMSPE: 0.2195

Epoch 2
-------------------------------
Loss:  1885142  [     0/759952]
Loss:  1538159  [409600/759952]
Loss:  1631149  [759952/759952]
Avg. valid. loss: 1681688, Avg. RMSPE: 0.2059

Epoch 3
-------------------------------
Loss:  1414050  [     0/759952]
Loss:  1237316  [409600/759952]
Loss:   899826  [759952/759952]
Avg. valid. loss: 1374336, Avg. RMSPE: 0.1720

That’s it! By training our neural network we obtained the entity embeddings. As a nice side-product we also got a fully functioning model that we can use to forecast the sales.

Code
print(f"Valid RMSPE: {rmspe(all_preds, valid_y.values.reshape(-1, 1)):.3f}")
print(f"Valid MAPE:  {mape(all_preds, valid_y.values.reshape(-1, 1)):.3f}")
Valid RMSPE: 0.179
Valid MAPE:  0.123

It turns out that this model works much better than our previous baseline, lowering both RMSPE and MAPE substantially. Submitting these results would bring the score on the private test set down to about 0.17. We are definitely moving in the right direction!

Random Forests with Entity Embeddings

Now that we have trained the neural network, we have the entity embeddings at our disposal. All that remains is to extract them from the embedding layer of our TabularNeuralNet. We write a function, embed_features, to do that since we’ll have to repeat this operation for the training, validation and test set2.

Code
def embed_features(model, xs, encoder):
    "Replace categorical columns in `xs` w/ embeddings extracted from `model`."
    xs = xs.copy()

    with torch.no_grad():
        for i, col in enumerate(encoder.feature_names_in_):
            # Get embedding matrix
            emb = model.embeds[i]
            emb_data = emb(torch.tensor(xs[col].values, dtype=torch.int32))
            emb_names = [f'{col}_{j}' for j in range(emb_data.shape[1])]

            # Replace old feature col. w/ new one(s)
            feat_df = pd.DataFrame(
                data=emb_data.cpu().numpy(), index=xs.index, columns=emb_names
            )
            xs = xs.drop(col, axis=1).join(feat_df)
    return xs

We are now ready to apply the mapping defined by the embeddings to our original features. We also show the first few rows to see what the result looks like.

# Train embeddings
emb_train_xs = embed_features(nn_model, train_xs, oe)

# Validation embeddings
emb_valid_xs = embed_features(nn_model, valid_xs, oe)
emb_train_xs.head(3)
Store_0 Store_1 Store_2 Store_3 Store_4 Store_5 Store_6 Store_7 Store_8 Store_9 ... Month_3 Month_4 Month_5 Year_0 Year_1 Promo_0 StateHoliday_0 StateHoliday_1 StateHoliday_2 SchoolHoliday_0
0 -0.464628 1.389546 0.031453 0.837222 -2.890536 -1.271165 0.351642 0.352165 -1.067315 0.432503 ... 0.912523 1.552763 -0.231177 -0.952061 0.123717 0.058928 0.693458 -1.862958 -0.175023 -1.131267
1 0.548922 0.104170 0.247237 0.384087 -1.878805 -1.581414 1.528581 -0.838187 -1.658705 -2.186104 ... 0.912523 1.552763 -0.231177 -0.952061 0.123717 0.058928 0.693458 -1.862958 -0.175023 -1.131267
2 -0.613926 -0.435071 -0.327684 2.248206 -0.140331 -0.281201 0.297878 2.848119 1.672459 1.167650 ... 0.912523 1.552763 -0.231177 -0.952061 0.123717 0.058928 0.693458 -1.862958 -0.175023 -1.131267

3 rows × 39 columns

We can see that our initial dataset, which had 8 columns representing categorical variables, now reaches 39 columns. This number amounts to the sum of the embedding sizes we have chosen.

We can now retrain our initial model, i.e. the Random Forests, using these new features created by the neural network.

emb_m = rf(emb_train_xs.values, train_y.values, max_features=0.6)
Note

Since we now have more columns, the model will take longer to train. We therefore set the parameter max_features to 0.6, which defines how many columns to sample at each split point (i.e. 60% of the total).

Finally, the moment we have all been anxiously waiting for has arrived. Will our model have benefited from the use of Entity Embeddings? 🥁

Code
emb_valid_xs, valid_y_clean = drop_zero_sales(emb_valid_xs, valid_y)

print(f"Valid RMSPE: {m_rmspe(emb_m, emb_valid_xs.values, valid_y_clean.values):.3f}")
print(f"Valid MAPE:  {m_mape(emb_m, emb_valid_xs.values, valid_y_clean.values):.3f}")
Valid RMSPE: 0.139
Valid MAPE:  0.101

That’s definitely the case! We started from an RMSPE of about 0.31 using ordinal encoding to arrive at 0.14 with Entity Embeddings—we’ve halved the error! The score on the private test set confirms the results and the positive trend, bringing the RSMPE down to 0.142.

Conclusion

Similarly to word embeddings in NLP (e.g. word2vec or GloVe), entity embeddings offer a low-dimensional, continuous representation of categorical variables that captures their semantic relationships. We have seen that, although they are derived from the training of a neural network, entity embeddings can also be used by non-deep learning models, such as random forests.

In this blog post, we discussed in detail how to construct embeddings and use them as features for an ensemble of decision trees. Along the way, we were able to reproduce part of the results contained in the paper of Guo and Berkhahn. In particular, replacing the ordinal encoded features in our random forests model with entity embeddings resulted in halving the error on the test set.

To wrap up, when should you give them a try? They are especially useful for datasets with features that have very high cardinality, where other methods often tend to overfit. Is the art of feature engineering dead? Certainly not. In particular, if you have domain knowledge, it would be a shame not to give it to your model via well-designed features.

Footnotes

  1. Note that our main evaluation metric, the RMSPE, would diverge when the target is zero.↩︎

  2. If you come from the fastai world, check out this blog on Medium where this process is done using fastai objects (TabularPandas, Learner, etc.).↩︎