Encoding with OrdinalEncoder : how to give levels as user input?

Question

I am trying to do ordinal encoding using:

from sklearn.preprocessing import OrdinalEncoder

I will try to explain my problem with a simple dataset.

X = pd.DataFrame({'animals':['low','med','low','high','low','high']})
enc = OrdinalEncoder()
enc.fit_transform(X.loc[:,['animals']])

array([[1.],
       [2.],
       [1.],
       [0.],
       [1.],
       [0.]])

It is labelling alphabetically, but if I try:

enc = OrdinalEncoder(categories=['low','med','high'])
enc.fit_transform(X.loc[:,['animals']])

Shape mismatch: if n_values is an array, it has to be of shape (n_features,).

Which I do not understand. I would like to be able to decide how the labelling is done.

I considered doing this:

level_mapping={'low':0,'med':1,'high':2}
X['animals']=data['animals'].replace(level_mapping)

However, I have large number of features in my dataset which have similar categories.

Thanks.

fugumagu · Answer

I'm not sure if you ever figured this out but I was trying to find answers on this exact same question and there aren't really any good answers in my opinion. I finally figured it out though.
OrdinalEncoder is capable of encoding multiple columns in a dataframe. So, when you instantiate OrdinalEncoder(), you give the categories parameter a list of lists:
enc = OrdinalEncoder(categories=[list_of_values_cat1, list_of_values_cat2, etc])

Specifically, in your example above, you would just put ['low', 'med', 'high'] inside another list:
end = OrdinalEncoder(categories=[['low', 'med', 'high']])
enc.fit_transform(X.loc[:,['animals']])
>>array([[0.],
         [1.],
         [0.],
         [2.],
         [0.],
         [2.]])
# Now 'low' is correctly mapped to 0, 'med' to 1, and 'high' to 2

To see how you can encode multiple columns with their own individual ordinal values, try this:
# Sample dataframe with 2 ordinal categorical columns: 'temp' and 'place'
categorical_df = pd.DataFrame({'my_id': ['101', '102', '103', '104'],
                               'temp': ['hot', 'warm', 'cool', 'cold'], 
                               'place': ['third', 'second', 'first', 'second']})

# In the 'temp' column, I want 'cold' to be 0, 'cool' to be 1, 'warm' to be 2, and 'hot' to be 3
# In the 'place' column, I want 'first' to be 0, 'second' to be 1, and 'third' to be 2
temp_categories = ['cold', 'cool', 'warm', 'hot']
place_categories = ['first', 'second', 'third']

# Now, when you instantiate the encoder, both of these lists go in one big categories list:
encoder = OrdinalEncoder(categories=[temp_categories, place_categories])

encoder.fit_transform(categorical_df[['temp', 'place']])
>>array([[3., 2.],
         [2., 1.],
         [1., 0.],
         [0., 1.]])

Encoding with OrdinalEncoder : how to give levels as user input?

One Answer

Add your own answers!

Ask a Question