The beyond-one-hot project has started to grow up. Last fall, I did a couple of posts comparing different methods of encoding categorical variables for machine learning problems. You can check them out here and here respectively.
Those posts were pretty well received, so the hacky little script that was used to make the plots got worked on a little more and eventually became a pip-installable python library that used scikit-learn style objects. I'm now happy to say that the library is being used in production in at least 2 large systems that I know of, and has reached something that resembles stability.
Aside from just stability, we've added some useful functionality in the past few months including:
- Addition of a drop_invariant option to the transformers to check for features with 0 variance at the fit() step, and drop those features from the output reliably at transform()
- Addition of a return_df option to all transformers to allow the user to toggle between the transform() method returning a pandas DataFrame or a numpy array
- If cols is passed as , nothing is encoded and the dataset is passed through unchanged
- If cols is passed as None, then the dataset passed to fit() is inspected to infer which columns should be encoded, and those are used. Any column typed as 'object' in the pandas DataFrame representation is considered appropriate for encoding.
In the past few months I've accumulated some interest in contributing, and of course there are still things to help with so if you are interested, leave a comment below or find me on github and get involved. We need help with documentation, addition of new encoders, benchmarking of performance (computationally), and most importantly, getting the library into production so we can find out where it's useful and where it's lacking.
So if you haven't already, check out categorical_encoding on github, and let me know what you think.