Even Further Beyond One-Hot: Feature Hashing

In the previous post about categorical encoding we explored different methods for converting categorical variables into numeric features.  In this post, we will explore another method: feature hashing.

Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector.  It can be extremely efficient by having a standalone hash function that requires no pre-built dictionary of possible categories to function.

A simple implementation that allows the user to pick the desired output dimensionality is to simply hash the input value into a number, then divide it by the desired output dimensionality and take the remainder, R.  With that, you can encode the feature as a vector of zeros with a one in index R.

In psuedo-code, from wikipedia's article:

function hashing_vectorizer(features : array of string, N : integer):
    x := new vector[N]
    for f in features:
        h := hash(f)
        x[h mod N] += 1
    return x

Or in python, from categorical encoding:

def hash_fn(x):
        tmp = [0 for _ in range(N)]
        for val in x.values:
            tmp[hash(val) % N] += 1
        return pd.Series(tmp, index=cols)

cols = ['col_%d' % d for d in range(N)]
X = X.apply(hash_fn, axis=1)

How well does this work out though? Well, last time, we ran all of the encoders using the same model and compared the scores.  Some encodings are a little bit different every time (depending on how the initial ordinal transform goes), so we run the models 100 times, and produce a boxplot of their scores.  Using the mushroom dataset:

Categorical Encoding Scores

In this plot you can see a couple of interesting things.  First, the binary encoding that was introduced in the last post, while it performs well on average, has pretty high variance run-to-run, which is not ideal.   Also, you can see a gradual rise in performance as the number of components used for the hashing encoder increases, but it saturates at 16, because there just aren't enough categories to need more than 16 components for this dataset.

In the next post, I'll introduce the scikit-learn compatible transformers for each of these, some examples using them, and some rules of thumb for building good models on high-dimension categorical data without blowing out your memory.

Check out the updated repository to try your own encoders, use mine, or to try this out on your own data:








Will has a background in Mechanical Engineering from Auburn, but mostly just writes software now. He was the first employee at Predikto, and is currently building out the premiere platform for predictive maintenance in heavy industry there as Chief Scientist. When not working on that, he is generally working on something related to python, data science or cycling.


Leave a Reply