TransWikia.com

String handling by OneHotEncoder

Data Science Asked by pragun on October 4, 2021

I am reading everywhere on new questions and blogs that since version 0.20, OneHotEncoder is able to handle string features.

Moreover, the documentation is what looks more ambiguous. Here are the first two lines from the documentation:

Encode categorical integer features as a one-hot numeric array. The input to this transformer should be an array-like of integers or
strings, denoting the values taken on by categorical (discrete)
features.

First line says it

encodes categorical integer features

and the next line says

input should be array like of integers or strings.

When i tried it, i still got the value error.

print(X.columns)
encoder = OneHotEncoder(categorical_features=[1,4,5])
encoder.fit(X)

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region'], dtype='object')
ValueError: could not convert string to float: 'female'

I am aware of the means to handle encoding of string features with LabelEncoder, ColumnTransfomer and pd.getDummies() but specifically want to understand about this.

One Answer

This seems to fail only when you're using categorical_features, which was deprecated at the same time the encoder was extended to strings. Using the now-recommended ColumnTransformer to specify which columns to encode works with strings (as does applying the encoder to the entire frame, though that's not what you want, with features like bmi).

E.g.,

onehot = OneHotEncoder(...)
cat_cols = [1,4,5]
preproc = ColumnTransformer(transformers=[('onehot', onehot, cat_cols)],
                            remainder='passthrough')
preproc.fit_transform(X)

Answered by Ben Reiniger on October 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP