Data Science Asked by Saurabh Singh on October 12, 2020
I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing:
In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder()
whereas in the book it was given about sklearn.preprocessing.LabelEncoder()
, when I checked their functionality it looked same to me. Can Someone please tell me the difference between the two please?
Afaik, both have the same functionality. A bit difference is the idea behind. OrdinalEncoder
is for converting features, while LabelEncoder
is for converting target variable.
That's why OrdinalEncoder
can fit data that has the shape of (n_samples, n_features)
while LabelEncoder
can only fit data that has the shape of (n_samples,)
(though in the past one used LabelEncoder
within the loop to handle what has been becoming the job of OrdinalEncoder
now)
Answered by ipramusinto on October 12, 2020
As for differences in OrdinalEncoder
and LabelEncoder
implementation, the accepted answer mentions the shape of the data: (OrdinalEncoder
for 2D data; shape (n_samples, n_features)
, LabelEncoder
is for 1D data: for shape (n_samples,)
)
That's why a OrdinalEncoder
would get an error:
ValueError: Expected 2D array, got 1D array instead:
...if trying to fit on 1D data: OrdinalEncoder().fit(['a','b'])
However, another difference between the encoders is the name of their learned parameter;
LabelEncoder
learns classes_
OrdinalEncoder
learns categories_
Notice the differences in fitting LabelEncoder
vs OrdinalEncoder
, and the differences in the values of these learned parameters. LabelEncoder.classes_
is 1D, while OrdinalEncoder.categories_
is 2D.
LabelEncoder().fit(['a','b']).classes_
# >>> array(['a', 'b'], dtype='<U1')
OrdinalEncoder().fit([['a'], ['b']]).categories_
# >>> [array(['a', 'b'], dtype=object)]
Other encoders that work in 2D, including OneHotEncoder
, also use the property categories_
More info here about the dtype <U1
(little-endian , Unicode, 1 byte; i.e. a string with length 1)
EDIT
In the comments to my answer, Piotr disagrees; Piotr points out the difference between ordinal encoding and label encoding more generally.
cold
, warm
, hot
);blonde
, brunette
)This is a great concept, but this question asks about the sklearn
classes/implementation. It's interesting to see how implementation does not match the concepts; specifically if you want ordinal encoding like Piotr describes (order is preserved); you must do the ordinal encoding yourself (neither OrdinalEncoder
nor LabelEncoder
can infer order).
As for implementation it seems like LabelEncoder
and OrdinalEncoder
have consistent behavior as far as the chosen integers. They both assign integers based on alphabetical order. For example:
OrdinalEncoder().fit_transform([['cold'],['warm'],['hot']]).reshape((1,3))
# >>> array([[0., 2., 1.]])
LabelEncoder().fit_transform(['cold','warm','hot'])
# >>> array([0, 2, 1], dtype=int64)
Notice how both encoders assigned integers in alphabetical order 'c'<'h'<'w'.
But this part is important: Notice how neither encoder got the "real" order correct (i.e. the real order should reflect the temperature, where order is 'cold'<'warm'<'hot'); based on "real" order, the value 'warm'
would have been assigned the integer 1.
In the blog post referenced by Piotr, the author does not even use OrdinalEncoder()
. To achieve ordinal encoding the author does it manually: maps each temperature to a "real" order integer, using a dictionary like {'cold':0, 'warm':1, 'hot':2}
:
Refer to this code using Pandas, where first we need to assign the real order of the variable through a dictionary... Though its very straight forward but it requires coding to tell ordinal values and what is the actual mapping from text to integer as per the order.
In other words, if you're wondering whether to use OrdinalEncoder
, please note OrdinalEncoder
may not actually provide "ordinal encoding" the way you expect!
Answered by The Red Pea on October 12, 2020
You use ordinal encoding to preserve order of categorical data i.e. cold, warm, hot; low, medium, high. You use label encoding or one hot for categorical data, where there's no order in data i.e. dog, cat, whale. Check this post on medium. It explains these concepts well.
Answered by Piotr Rarus - Reinstate Monica on October 12, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP