Beyond One-Hot: incremental improvements in categorical encoding

The beyond-one-hot project has started to grow up.  Last fall, I did a couple of posts comparing different methods of encoding categorical variables for machine learning problems.  You can check them out here and here respectively.

Those posts were pretty well received, so the hacky little script that was used to make the plots got worked on a little more and eventually became a pip-installable python library that used scikit-learn style objects.   I'm now happy to say that the library is being used in production in at least 2 large systems that I know of, and has reached something that resembles stability.

Aside from just stability, we've added some useful functionality in the past few months including:

  • Addition of a drop_invariant option to the transformers to check for features with 0 variance at the fit() step, and drop those features from the output reliably at transform()
  • Addition of a return_df option to all transformers to allow the user to toggle between the transform() method returning a pandas DataFrame or a numpy array
  • If cols is passed as [], nothing is encoded and the dataset is passed through unchanged
  • If cols is passed as None, then the dataset passed to fit() is inspected to infer which columns should be encoded, and those are used.  Any column typed as 'object' in the pandas DataFrame representation is considered appropriate for encoding.

In the past few months I've accumulated some interest in contributing, and of course there are still things to help with so if you are interested, leave a comment below or find me on github and get involved.  We need help with documentation, addition of new encoders, benchmarking of performance (computationally), and most importantly, getting the library into production so we can find out where it's useful and where it's lacking.

So if you haven't already, check out categorical_encoding on github, and let me know what you think.

https://github.com/wdm0006/categorical_encoding

Will

Will has a background in Mechanical Engineering from Auburn, but mostly just writes software now. He was the first employee at Predikto, and is currently building out the premiere platform for predictive maintenance in heavy industry there. When not working on that, he is generally working on something related to python, data science or cycling.

3 Comments

  1. Hi Will, I'm a huge fan of your blog and your work. Specially your Category Encoders module. I was just wondering if you have considering adding functionally of inverse_transform these encodings? They would be very valuable. Thanks!

    • Thanks Daniel, that's a great idea. Would you mind opening an issue on the project github with it? If you're interested in helping to implement it that'd be great too.

Leave a Reply