In the past I've posted about the various categorical encoding methods one can use for machine learning tasks, like one-hot encoding, ordinal or binary. In my OSS package, category_encodings, I've added a single scikit-learn compatible encoder called BaseNEncoder, which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, or anywhere in between), and get consistently encoded categorical variables out. Note that base 1 and one-hot aren't really the same thing, but in this case it's convenient to consider them as such.

Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems. Where it becomes useful, however, is when this encoder is coupled with a grid search.

from __future__ import print_function from sklearn import datasets from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report from sklearn.pipeline import Pipeline from category_encoders.basen import BaseNEncoder from examples.source_data.loaders import get_mushroom_data from sklearn.linear_model import LogisticRegression # first we get data from the mushroom dataset X, y, _ = get_mushroom_data() X = X.values # use numpy array not dataframe here n_samples = X.shape[0] # Split the dataset in two equal parts X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) # create a pipeline ppl = Pipeline([ ('enc', BaseNEncoder(base=2, return_df=False, verbose=True)), ('clf', LogisticRegression()) ]) # Set the parameters by cross-validation tuned_parameters = { 'enc__base': [1, 2, 3, 4, 5, 6] } scores = ['precision', 'recall'] for score in scores: print("# Tuning hyper-parameters for %s\n" % score) clf = GridSearchCV(ppl, tuned_parameters, cv=5, scoring='%s_macro' % score) clf.fit(X_train, y_train) print("Best parameters set found on development set:\n") print(clf.best_params_) print("\nGrid scores on development set:\n") for mean, std, params in clf.grid_scores_: print("%s (+/-%s) for %s" % (mean, std * 2, params)) print("\nDetailed classification report:\n") print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.\n") y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred))

This code, from HERE, uses a normal scikit-learn grid search to find the optimal base for encoding categorical variables. The trade-off between between how well pairwise distances between categories and the final dataset dimensionality is no longer a difficult to tune parameter.

By running the above script we get:

Best parameters set found on development set: {'enc__base': 1} Grid scores on development set: {'enc__base': 1} (+/-1.99905151856) for [ 1. 1. 1. 1. 0.9976247] {'enc__base': 2} (+/-1.98737951324) for [ 0.9805492 0.99763033 0.99621212 0.9964455 0.9976247 ] {'enc__base': 3} (+/-1.95968049624) for [ 0.99411765 0.98387419 0.9651717 0.96970966 0.98633155] {'enc__base': 4} (+/-1.96534331006) for [ 0.99500636 0.96541172 0.98387419 0.99013767 0.97892831] {'enc__base': 5} (+/-1.96034803727) for [ 0.97773263 0.97556628 0.98636545 0.97058734 0.99063232] {'enc__base': 6} (+/-1.93791104567) for [ 0.96788716 0.95480882 0.97648608 0.97769848 0.96790524] Detailed classification report: The model is trained on the full development set. The scores are computed on the full evaluation set. precision recall f1-score support 0 1.00 1.00 1.00 2110 1 1.00 1.00 1.00 1952 avg / total 1.00 1.00 1.00 4062 # Tuning hyper-parameters for recall Best parameters set found on development set: {'enc__base': 1} Grid scores on development set: {'enc__base': 1} (+/-1.99802826596) for [ 0.99761905 1. 1. 1. 0.99744898] {'enc__base': 2} (+/-1.98660963142) for [ 0.98904035 0.98854962 0.99745547 1. 0.99148239] {'enc__base': 3} (+/-1.88434381179) for [ 0.95086332 0.8547619 0.94664667 0.98862857 0.97008487] {'enc__base': 4} (+/-1.98025257596) for [ 0.99261178 0.98005271 0.98436023 0.99618321 0.99744898] {'enc__base': 5} (+/-1.93166516505) for [ 0.98530534 0.98657761 0.89642857 0.9800385 0.98086735] {'enc__base': 6} (+/-1.94647463413) for [ 0.96687568 0.97385496 0.99507452 0.95912053 0.97123861] Detailed classification report: The model is trained on the full development set. The scores are computed on the full evaluation set. precision recall f1-score support 0 1.00 1.00 1.00 2110 1 1.00 1.00 1.00 1952 avg / total 1.00 1.00 1.00 4062

Which shows us that for this relatively simple problem, with a small dataset, using the dimension-inefficient one-hot encoding (base=1) is the best option available.

We've got a lot of cool projects in the pipeline in preparation for the 1.3.0 release, and the first release since being included in scikit-learn-contrib, so if you're interested in this kind of work, head over to github or reach out here to get involved.

Will, Excellent work, as always. I support your approach to hyper-parameter optimization: I think the nature of the problem determines many things, including the preferred model, validation, and category encoding.

I have two questions:

* How do I obtain your package?

* What else do I need to use it?

Thanks.

David Wilt

P.S. Are you aware of any neural network that fully supports categorical predictors? That is, converting such predictors to numerical values causes them to be treated as continuous, which is problematic, as discussed in the paper below (which I can send you if interested). Brouwer, who recently died, had a very clever approach to addressing this issue, but I don't know if was ever implemented.

Thanks.

****************************************************

A feed-forward network for input that is both categorical and quantitative

Roelof K. Brouwer*

Department of Computing Science, University College of the Cariboo, Kamloops, BC, Canada V2C 5N3

Received 24 January 2001; accepted 29 April 2002

Abstract

The data on which a multi-layer perceptron (MLP) is to be trained to approximate a continuous function may have inputs that are

categorical rather than numeric or quantitative such as color, gender, race, etc. A categorical variable causes a discontinuous relationship

between an input variable and the output. A MLP, with connection matrices that multiply input values and sigmoid functions that further

transform values, represents a continuous mapping in all input variables. A MLP therefore requires that all inputs correspond to numeric,

continuously valued variables and represents a continuous function in all input variables. The way that this problem is usually dealt with is to

replace the categorical values by numeric ones and treat them as if they were continuously valued. However, there is no meaningful

correspondence between the continuous quantities generated this way and the original categorical values. Another approach is to encode the

categorical portion of the input using 1-out-of-n encoding and include this code as input to the MLP.

The approach in this paper is to segregate categorical variables from the continuous independent variables completely. The MLP is trained

with multiple outputs; a separate output unit for each of the allowed combination of values of the categorical independent variables. During

training the categorical value or combination of categorical values determines which of the output units should have the target value on it,

with the remaining outputs being ‘do not care’. Three data sets were used for comparison of methods. Results show that this approach is much

more effective than the conventional approach of assigning continuous variables to the categorical features. In case of the data set where there

were several categorical variables the method proposed here is also more effective than the 1-out-of-n input method.q2002 Elsevier Science

Hey David,

Thanks for the kind words. To use the package, you can get the most recent version off of pip (

`pip install category_encoders`

), BaseNEncoder hasn't been released yet, so to get the bleeding edge version just do`pip install git+https://github.com/scikit-learn-contrib/categorical-encoding`

.It will install all of it's dependencies along with it. If you run into trouble let me know here or on github.

As for the neural network question, I don't work with NNs all that much, so can't speak to that issue in too much detail, but the simple thing would be to encode the categorical variables before passing them to your NN. If you have a scikit-learn style predictor for your NN, then you can use the scikit-learn pipeline to accomplish this very similarly to the code in this post.

"Note that base 1 and one-hot aren't really the same thing"

What's the difference between them again?

From a pragmatic standpoint, they're the same thing, but the 0th category in 'base 1' would be [0 0 0], in one-hot it'd be [0 0 1], so pretty slight difference.