BaseN Encoding and Grid Search in category_encoders

In the past I've posted about the various categorical encoding methods one can use for machine learning tasks, like one-hot encoding, ordinal or binary.  In my OSS package, category_encodings, I've added a single scikit-learn compatible encoder called BaseNEncoder, which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, or anywhere in between), and get consistently encoded categorical variables out.  Note that base 1 and one-hot aren't really the same thing, but in this case it's convenient to consider them as such.

Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems.  Where it becomes useful, however, is when this encoder is coupled with a grid search.

from __future__ import print_function
from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from category_encoders.basen import BaseNEncoder
from examples.source_data.loaders import get_mushroom_data
from sklearn.linear_model import LogisticRegression

# first we get data from the mushroom dataset
X, y, _ = get_mushroom_data()
X = X.values  # use numpy array not dataframe here
n_samples = X.shape[0]

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# create a pipeline
ppl = Pipeline([
    ('enc', BaseNEncoder(base=2, return_df=False, verbose=True)),
    ('clf', LogisticRegression())
])


# Set the parameters by cross-validation
tuned_parameters = {
    'enc__base': [1, 2, 3, 4, 5, 6]
}

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s\n" % score)
    clf = GridSearchCV(ppl, tuned_parameters, cv=5, scoring='%s_macro' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:\n")
    print(clf.best_params_)
    print("\nGrid scores on development set:\n")
    for mean, std, params in clf.grid_scores_:
        print("%s (+/-%s) for %s" % (mean, std * 2, params))

    print("\nDetailed classification report:\n")
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.\n")
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))

This code, from HERE, uses a normal scikit-learn grid search to find the optimal base for encoding categorical variables.  The trade-off between between how well pairwise distances between categories and the final dataset dimensionality is no longer a difficult to tune parameter.

By running the above script we get:

Best parameters set found on development set:

{'enc__base': 1}

Grid scores on development set:

{'enc__base': 1} (+/-1.99905151856) for [ 1.         1.         1.         1.         0.9976247]
{'enc__base': 2} (+/-1.98737951324) for [ 0.9805492   0.99763033  0.99621212  0.9964455   0.9976247 ]
{'enc__base': 3} (+/-1.95968049624) for [ 0.99411765  0.98387419  0.9651717   0.96970966  0.98633155]
{'enc__base': 4} (+/-1.96534331006) for [ 0.99500636  0.96541172  0.98387419  0.99013767  0.97892831]
{'enc__base': 5} (+/-1.96034803727) for [ 0.97773263  0.97556628  0.98636545  0.97058734  0.99063232]
{'enc__base': 6} (+/-1.93791104567) for [ 0.96788716  0.95480882  0.97648608  0.97769848  0.96790524]

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2110
          1       1.00      1.00      1.00      1952

avg / total       1.00      1.00      1.00      4062

# Tuning hyper-parameters for recall

Best parameters set found on development set:

{'enc__base': 1}

Grid scores on development set:

{'enc__base': 1} (+/-1.99802826596) for [ 0.99761905  1.          1.          1.          0.99744898]
{'enc__base': 2} (+/-1.98660963142) for [ 0.98904035  0.98854962  0.99745547  1.          0.99148239]
{'enc__base': 3} (+/-1.88434381179) for [ 0.95086332  0.8547619   0.94664667  0.98862857  0.97008487]
{'enc__base': 4} (+/-1.98025257596) for [ 0.99261178  0.98005271  0.98436023  0.99618321  0.99744898]
{'enc__base': 5} (+/-1.93166516505) for [ 0.98530534  0.98657761  0.89642857  0.9800385   0.98086735]
{'enc__base': 6} (+/-1.94647463413) for [ 0.96687568  0.97385496  0.99507452  0.95912053  0.97123861]

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2110
          1       1.00      1.00      1.00      1952

avg / total       1.00      1.00      1.00      4062

Which shows us that for this relatively simple problem, with a small dataset, using the dimension-inefficient one-hot encoding (base=1) is the best option available.
We've got a lot of cool projects in the pipeline in preparation for the 1.3.0 release, and the first release since being included in scikit-learn-contrib, so if you're interested in this kind of work, head over to github or reach out here to get involved.

 

Will

Will has a background in Mechanical Engineering from Auburn, but mostly just writes software now. He was the first employee at Predikto, and is currently building out the premiere platform for predictive maintenance in heavy industry there as Chief Scientist. When not working on that, he is generally working on something related to python, data science or cycling.

4 Comments

  1. Will, Excellent work, as always. I support your approach to hyper-parameter optimization: I think the nature of the problem determines many things, including the preferred model, validation, and category encoding.

    I have two questions:

    * How do I obtain your package?
    * What else do I need to use it?

    Thanks.

    David Wilt

    P.S. Are you aware of any neural network that fully supports categorical predictors? That is, converting such predictors to numerical values causes them to be treated as continuous, which is problematic, as discussed in the paper below (which I can send you if interested). Brouwer, who recently died, had a very clever approach to addressing this issue, but I don't know if was ever implemented.

    Thanks.

    ****************************************************
    A feed-forward network for input that is both categorical and quantitative
    Roelof K. Brouwer*
    Department of Computing Science, University College of the Cariboo, Kamloops, BC, Canada V2C 5N3
    Received 24 January 2001; accepted 29 April 2002
    Abstract
    The data on which a multi-layer perceptron (MLP) is to be trained to approximate a continuous function may have inputs that are
    categorical rather than numeric or quantitative such as color, gender, race, etc. A categorical variable causes a discontinuous relationship
    between an input variable and the output. A MLP, with connection matrices that multiply input values and sigmoid functions that further
    transform values, represents a continuous mapping in all input variables. A MLP therefore requires that all inputs correspond to numeric,
    continuously valued variables and represents a continuous function in all input variables. The way that this problem is usually dealt with is to
    replace the categorical values by numeric ones and treat them as if they were continuously valued. However, there is no meaningful
    correspondence between the continuous quantities generated this way and the original categorical values. Another approach is to encode the
    categorical portion of the input using 1-out-of-n encoding and include this code as input to the MLP.
    The approach in this paper is to segregate categorical variables from the continuous independent variables completely. The MLP is trained
    with multiple outputs; a separate output unit for each of the allowed combination of values of the categorical independent variables. During
    training the categorical value or combination of categorical values determines which of the output units should have the target value on it,
    with the remaining outputs being ‘do not care’. Three data sets were used for comparison of methods. Results show that this approach is much
    more effective than the conventional approach of assigning continuous variables to the categorical features. In case of the data set where there
    were several categorical variables the method proposed here is also more effective than the 1-out-of-n input method.q2002 Elsevier Science

    • Hey David,

      Thanks for the kind words. To use the package, you can get the most recent version off of pip (pip install category_encoders), BaseNEncoder hasn't been released yet, so to get the bleeding edge version just do pip install git+https://github.com/scikit-learn-contrib/categorical-encoding.

      It will install all of it's dependencies along with it. If you run into trouble let me know here or on github.

      As for the neural network question, I don't work with NNs all that much, so can't speak to that issue in too much detail, but the simple thing would be to encode the categorical variables before passing them to your NN. If you have a scikit-learn style predictor for your NN, then you can use the scikit-learn pipeline to accomplish this very similarly to the code in this post.

    • From a pragmatic standpoint, they're the same thing, but the 0th category in 'base 1' would be [0 0 0], in one-hot it'd be [0 0 1], so pretty slight difference.

Leave a Reply