TransWikia.com

Difference between OrdinalEncoder and LabelEncoder

Data Science Asked by Saurabh Singh on October 12, 2020

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing:

In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given about sklearn.preprocessing.LabelEncoder(), when I checked their functionality it looked same to me. Can Someone please tell me the difference between the two please?

3 Answers

Afaik, both have the same functionality. A bit difference is the idea behind. OrdinalEncoder is for converting features, while LabelEncoder is for converting target variable.

That's why OrdinalEncoder can fit data that has the shape of (n_samples, n_features) while LabelEncoder can only fit data that has the shape of (n_samples,) (though in the past one used LabelEncoder within the loop to handle what has been becoming the job of OrdinalEncoder now)

Answered by ipramusinto on October 12, 2020

As for differences in OrdinalEncoder and LabelEncoder implementation, the accepted answer mentions the shape of the data: (OrdinalEncoder for 2D data; shape (n_samples, n_features), LabelEncoder is for 1D data: for shape (n_samples,))

That's why a OrdinalEncoder would get an error:

ValueError: Expected 2D array, got 1D array instead:

...if trying to fit on 1D data: OrdinalEncoder().fit(['a','b'])

However, another difference between the encoders is the name of their learned parameter;

  • LabelEncoder learns classes_
  • OrdinalEncoder learns categories_

Notice the differences in fitting LabelEncoder vs OrdinalEncoder, and the differences in the values of these learned parameters. LabelEncoder.classes_ is 1D, while OrdinalEncoder.categories_ is 2D.

LabelEncoder().fit(['a','b']).classes_
# >>> array(['a', 'b'], dtype='<U1')

OrdinalEncoder().fit([['a'], ['b']]).categories_
# >>> [array(['a', 'b'], dtype=object)]

Other encoders that work in 2D, including OneHotEncoder, also use the property categories_

More info here about the dtype <U1 (little-endian , Unicode, 1 byte; i.e. a string with length 1)

EDIT

In the comments to my answer, Piotr disagrees; Piotr points out the difference between ordinal encoding and label encoding more generally.

  • Ordinal encoding are good for ordinal variables (where order matters, like cold, warm, hot);
  • vs a non-ordinal (aka nominal) variable (where order doesn't matter, like blonde, brunette)

This is a great concept, but this question asks about the sklearn classes/implementation. It's interesting to see how implementation does not match the concepts; specifically if you want ordinal encoding like Piotr describes (order is preserved); you must do the ordinal encoding yourself (neither OrdinalEncoder nor LabelEncoder can infer order).

As for implementation it seems like LabelEncoder and OrdinalEncoder have consistent behavior as far as the chosen integers. They both assign integers based on alphabetical order. For example:

OrdinalEncoder().fit_transform([['cold'],['warm'],['hot']]).reshape((1,3))
# >>> array([[0., 2., 1.]])

LabelEncoder().fit_transform(['cold','warm','hot'])
# >>> array([0, 2, 1], dtype=int64)

Notice how both encoders assigned integers in alphabetical order 'c'<'h'<'w'.

But this part is important: Notice how neither encoder got the "real" order correct (i.e. the real order should reflect the temperature, where order is 'cold'<'warm'<'hot'); based on "real" order, the value 'warm' would have been assigned the integer 1.

In the blog post referenced by Piotr, the author does not even use OrdinalEncoder(). To achieve ordinal encoding the author does it manually: maps each temperature to a "real" order integer, using a dictionary like {'cold':0, 'warm':1, 'hot':2}:

Refer to this code using Pandas, where first we need to assign the real order of the variable through a dictionary... Though its very straight forward but it requires coding to tell ordinal values and what is the actual mapping from text to integer as per the order.

In other words, if you're wondering whether to use OrdinalEncoder, please note OrdinalEncoder may not actually provide "ordinal encoding" the way you expect!

Answered by The Red Pea on October 12, 2020

You use ordinal encoding to preserve order of categorical data i.e. cold, warm, hot; low, medium, high. You use label encoding or one hot for categorical data, where there's no order in data i.e. dog, cat, whale. Check this post on medium. It explains these concepts well.

Answered by Piotr Rarus - Reinstate Monica on October 12, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP