In machine learning, data are king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let's look at a little subset of those input data: categorical variables.

Categorical variables (wiki) are those that represent a fixed number of possible values, rather than a continuous number. Each value assigns the measurement to one of those finite groups, or categories. They differ from ordinal variables in that the distance from one category to another ought to be equal regardless of the number of categories, as opposed to ordinal variables which have some intrinsic ordering. As an example:

- Ordinal: low, medium, high
- Categorical: Georgia, Alabama, South Carolina, ... , New York

The machine learning algorithms we will later use tend to want numbers, and not strings, as their inputs so we need some method of coding to convert them.

A quick interjection: there is one other concept that will come up frequently in this post, and that is the concept of dimensionality. In simplistic terms, it is just the number of columns in the dataset, but it has significant downstream effects on the eventual models. At the extremes, the concept of the "curse of dimensionality" discusses that in high-dimensional spaces some things just stop working properly. Even in relatively low dimensional problems, a dataset with more dimensions requires more parameters for the model to understand, and that means more rows to reliably learn those parameters. If the number of rows in the dataset is fixed, addition of extra dimensions without adding more information for the models to learn from can have a detrimental effect on the eventual model accuracy.

To circle back to the problem at hand: we want to code categorical variables into numbers, but we are concerned about this dimensionality problem. The obvious answer is to just assign an integer to each category (we are assuming we know all of the possible categories up front). This is called ordinal coding. It does not add any dimensions to the problem, but implies an order to the variable that may not actually exist.

##### Methodology

To find out how well this works, I put together a simple python script to test different coding methods on common datasets. First an overview of the process:

- We gather a dataset for a classification problem that has categorical variables
- We use some method of coding to convert the X dataset into numeric values
- We use scikit-learn's cross-validation-score and a BernoulliNB() classifier to generate scores for the dataset. This is repeated 10x for each dataset and the mean of all scores is used.
- We store the dimensionality of the dataset, mean score, and time to code the data and generate the scores.

This is all repeated for a few different datasets from the UCI dataset repository:

I tried 7 different encoding methods (descriptions of 4-7 are taken from statsmodel's docs):

- Ordinal: as described above
- One-Hot: one column per category, with a 1 or 0 in each cell for if the row contained that column's category
- Binary: first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This encodes the data in fewer dimensions that one-hot, but with some distortion of the distances.
- Sum: compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others.
- Polynomial: The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing.
- Backward Difference: the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.
- Helmert: The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name ‘reverse’ being sometimes applied to differentiate from forward Helmert coding.

#### Results

###### Mushroom

Coding | Dataset | Dimensionality | Avg. Score | Elapsed Time | |
---|---|---|---|---|---|

0 | Ordinal | Mushroom | 22 | 0.810919 | 3.653194 |

1 | One-Hot Encoded | Mushroom | 117 | 0.813252 | 8.193983 |

6 | Helmert Coding | Mushroom | 117 | 0.837997 | 5.431131 |

5 | Backward Difference Coding | Mushroom | 117 | 0.846864 | 7.829706 |

3 | Sum Coding | Mushroom | 117 | 0.850555 | 4.929640 |

4 | Polynomial Coding | Mushroom | 117 | 0.855596 | 6.136916 |

2 | Binary Encoded | Mushroom | 43 | 0.871493 | 3.948484 |

###### Cars

Coding | Dataset | Dimensionality | Avg. Score | Elapsed Time | |
---|---|---|---|---|---|

10 | Sum Coding | Cars | 21 | 0.549347 | 1.456738 |

13 | Helmert Coding | Cars | 21 | 0.577471 | 1.458556 |

7 | Ordinal | Cars | 6 | 0.638522 | 1.466667 |

8 | One-Hot Encoded | Cars | 21 | 0.648694 | 1.393966 |

11 | Polynomial Coding | Cars | 21 | 0.666130 | 1.495264 |

12 | Backward Difference Coding | Cars | 21 | 0.697274 | 1.499557 |

9 | Binary Encoded | Cars | 12 | 0.697911 | 1.441609 |

###### Splice

Coding | Dataset | Dimensionality | Avg. Score | Elapsed Time | |
---|---|---|---|---|---|

14 | Ordinal | Splice | 61 | 0.681816 | 5.107389 |

17 | Sum Coding | Splice | 3465 | 0.922276 | 25.898854 |

16 | Binary Encoded | Splice | 134 | 0.935744 | 3.352499 |

15 | One-Hot Encoded | Splice | 3465 | 0.944839 | 2.563578 |

#### Conclusions

This is by no means an exhaustive study, but it seems that with decent consistency binary coding performs well, without a significant increase in dimensionality. Ordinal, as expected, performs consistently poorly.

If you'd like to look at the source, add or suggest new datasets, or new coding methods, I've put everything (including datasets) up on github: https://github.com/wdm0006/categorical_encoding.

Feel free to either contribute directly there, or comment with suggestions here.

It might be interesting to also include distributed representations such as learned embedding.

Thanks for the suggestion George, I totally agree. Do you know of a good paper or implementation of such methods in this context that I could use as a reference?

Interesting! Some suggestions:

Some indication of the variance of your estimates for avg score could be useful to get a feel for whether these differences are significant

It might be slightly fairer if you selected an optimal amount regularisation for each coding using a validation set -- some codings might need more or less of it for optimal performance, especially given the differences in dimensionality.

Another suggestion to try would be a hashing-based approach (the 'hash trick' -- see e.g. http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing ), or other related Random Indexing approaches (e.g. just using a different gaussian random vector to represent each class)

That's a great idea, actually. It would make sense to do some kind of search for the classifier's hyperparameters, and maybe even try a few different common classifiers as well.

It would be interesting I think to see what adding a dimensionality reduction step does in these cases. The higher-dimensional encodings likely many useless dimensions that can simply be dropped, which may narrow the gaps some.

I will for sure be doing a follow up with some extra techniques like hashing, and will likely add at least a grid search for alpha on BernoulliNB.

Hey Will,

Great post -- thank you. Do you happen to know of any papers or other resources that discuss binary coding? I was surprised to find a derth of resources on the topic given the performance.

Thanks

It's actually a hack that I've just seen some success with in the past, so I haven't stumbled across it elsewhere. It does intuitively make some sense for truly unordered categories in cases where there are a huge number of categories and limited data, because in those cases the benefits of the lower dimension representation outweigh the distance distortion.

I'll continue to keep an eye out though, and if anyone else knows of a paper, I'd certainly be keen to read it as well.

Thanks for this interesting post. I'm really curious to try out binary encoding in m next project.

However, I didnt get the following (my apologies if they are bit silly questions),

-- For Helmert, Backward Difference, Polynomial, Sum, you refer to the mean of the Dependent variable, so is it safe to assumet

that the context you're referring to is when the target variable is continuous?

-- Could you please elaborate more on the Polynomial coding with an e.g. perhaps? I didn't quite get how it works.

Can you explain why you chose to use a BernoulliNB classifier? The result will depend a lot on the chosen classifier. If you use a DecisionTree for example, you'll get close to perfect classification on the Mushroom data set without any need to encode the categorical variables.

This was intended as more of a demonstration of the differences between encoding methods rather than a search for the "best" one. Depending on your dataset, the assumptions you are comfortable making and the model or models used on the encoded data, different methods will be best. Your decision-tree/mushroom example is a great example of that. There is some tradeoff between dimensionality and appropriateness to the practitioner, which I tried to touch on as well.

As for why bernoulli nb, it's designed for discrete or binary/boolean features, which is what the majority of these are outputting, and also gave self-consistent results across multiple runs.

Cool, thanks for the clarification.

Looking at your source code, it doesn't look like one of the columns was dropped in the one-hot encoding, which means that there is multicollinearity in those features. Is that correct?

The one-hot encoder used is a direct pass-thru to the scikit-learn version: https://github.com/wdm0006/categorical_encoding/blob/master/category_encoders/one_hot.py

Is that the encoder you are referring to?

Hi Will, great post. My question is somehow similar to Daniel's. The Mushroom dataset contains about 2000 missing values in the 'stalk-root' feature. How do you deal with that? Do you encode also the missing values?

Great question, missing values ought to be encoded as a separate category if they make it to the encoder, so if you want them treated differently, handle that upstream.

One-hot and other similar encoding mechanism will blow up the memory tremendously and hence not recommended for big data problems.

Yes, one hot can produce extremely large sparse datasets that in many cases aren't tractable. For that sort of problem, something like hashing is often a better first stab.

While running your example script from category_encoders , I got the error, X_test = encoder().fit_transform(X)

TypeError: 'module' object is not callable. Can you help me with this as I'm new to python?

Will, Excellent work, particularly since there seems to be a lack of support in general for dealing with categorical variables, particularly for neural nets. I was trying to reproduce your binary coding for the Car dataset, without being able to run your Python code. How did you arrive at a dimensionality of 9? I would have thought you would need at least 12 columns, 2 (or more) for each of the six attributes.

Thanks very much.

David Wilt

Looking back on this, I also am not sure how binary could yield anything but 12 dimensions for the cars dataset. I'm guessing it got fat-fingered, but having been about a year ago I'm not sure. I'll update the post WRT to dimensionality, but this probably warrants an outright re-run. In the past year a good bit of development, bugfixes and improvements have occurred in the library, so I would guess slightly different and hopefully more repeatable experiments could be run now.

Thanks for the catch!