Data Science Asked by UchuuStranger on September 4, 2021
From what I read online, there seems to be some confusion regarding the taxonomy and the terms used, so to avoid misunderstanding I’m going to define them here:
Label Encoding – encoding a nominal variable with arbitrary numeric labels.
Ordinal Encoding – encoding an ordinal variable with numeric labels arranged in a specific order.
The course on Machine Learning I’m currently taking compares One-Hot Encoding with Ordinal Encoding. However, during my research online I came to realize that "Ordinal Encoding" is actually a misnomer, and what that course actually demonstrates is called "Label Encoding". Ordinal Encoding is supposed to pertain strictly to ordinal variables, and the dataset in question didn’t even have any ordinal variables.
Where did that misnomer come from? Turns out that it comes from the scikit-learn library that has LabelEncoder and OrdinalEncoder classes. The thing is, OrdinalEncoder class actually does not perform Ordinal Encoding by default. To make it ordinal, you have to specify the order in the ‘categories’ parameter (and its usage is extremely not-user-friendly – dictionary mapping by pandas can do the same way easier). If you don’t, OrdinalEncoder will assign labels alphabetically, just like LabelEncoder does. So the real difference between these two classes is that one encodes only a single column, while the other encodes two or more columns at a time. Perhaps it would be better and much less confusing if these classes were called "LabelEncoder1D" and "LabelEncoder2D".
So that’s where mistakenly calling Label Encoding "Ordinal Encoding" is coming from. But getting back to the question, the course I’m taking advocates the usage of (what I learned to be) Label Encoding for tree-based algorithms, because One-Hot encoding performs much worse for trees (which is ture). However, from what I read online, it appears that other Machine Learning platforms, such as R or H2O, are capable of processing nominal variables for trees without any kind of encoding at all, and the requirement to encode everything into numeric form seems to be exclusively scikit-learn’s problem. And there’s conflicting information as to whether trees perform better with Label Encoding – my course, as well as some responses online, advocate for its usage, but my intuition, as well as some other responses online, seem to indicate that scikit-learn trees will not be able to distinguish these labels as categories, and will mistakenly assume that they are continuous values on a meaningful scale instead. So they recommend using One-Hot Encoding even for trees as the only option despite it being sub-optimal.
So my questions are 1) is it true that Label Encoding will be misinterpreted as a numeric scale by scikit-learn trees? 2) if so, are there any situations at all where arbitrary Label Encoding can be useful? Or does this technique has no use at all unless the variable is ordinal, and a specific labeling order is given?
P.S: I’m asking because my course has a whole lesson dedicated to teaching students "Ordinal" Encoding. At first I wanted to suggest them to rename it to "Label Encoding", but now I suspect that that whole lesson is best removed altogether to avoid teaching students bad practices.
First, I generally agree that encoding unordered categories as consecutive integers is not a great approach: you are adding a ton of additional relationships that aren't present in the data.
First, let me point out (because I nearly forgot) that there are two main types of decision tree: CART and the Quinlan family. For the Quinlan family, categorical variables are dealt with by using higher arity splits, so no encoding is needed and this is mostly moot.
Q1, yes, ordinally encoding will be treated by the model as numeric (unless some other parameter controls that, e.g. LightGBM). But for (most) trees, only the order is actually relevant: the scale is irrelevant, and e.g. the relationship "10 is twice as much as 5" is completely invisible to the tree.
As you point out, one-hot encoding for a CART model can be detrimental, especially when there are many levels in a categorical feature: will the tree ever actually decide to split on one of the dummy variables, if it is only 1 for a small subset of the data? (Q2)But when you encode ordinally, there will just by chance be some splits that are useful and split many levels in each direction. (You may even try more than one random ordering of the levels as different features!)
And yes, presumably the best approach is to use an implementation that can take advantage of the raw categoricals, using the average-response trick. (There's even some debate on how much that helps: some studies have been done, but generally the datasets are synthetic or too small to be representative.)
In other models, very often one-hot encoding is just fine, and doesn't suffer from the same problem as trees. If there are too many levels, and especially if some of them are too small, you may consider smoothing techniques to avoid overfitting. (Q2)I'd be surprised if ordinally encoding is ever worth it for most models, but one would need to consider each model type individually, and probably do some testing.
As for naming, things are a bit muddy, but I don't think this is sklearn's fault. The "label" in LabelEncoder
means it is supposed to be used on the labels, a.k.a. the dependent variable. And for that usage, there is no debate about whether it's appropriate: sklearn just requires consecutive integer labels for its multiclass classification; it doesn't use the numeric values as though they were mathematically meaningful.
As for OrdinalEncoder
, it is meant to be used with input ordering of the categories. See sklearn Issue#13488 below. But one could argue that you are encoding the categorical variable in an ordinal way, so even with unordered categories this isn't necessarily a misnomer.
See Issue#13488 for some related discussion.
Correct answer by Ben Reiniger on September 4, 2021
- is it true that Label Encoding will be misinterpreted as a numeric scale by scikit-learn trees?
Yes, SciKit-Learn treats it as Numeric value.
Hence, it will impact the depth of Tree and result in different Tree structure.
On results - Definitely, different hyperparameter tuning will be required for different methods but I am not sure about the fact that whether we will never achieve the best with Label encoding Or we may if tuned properly.
It is also true that if the encoding is aligned with Labels/target, it will achieve a good result quickly.
May like to read this Answer
- if so, are there any situations at all where arbitrary Label Encoding can be useful or does this technique has no use at all unless the variable is ordinal, and a specific labeling order is given (i.e. Ordinal Encoding is useful only when it's truly ordinal)?
I doubt that it will work i.e. with Neural Network Or Linear Regression, etc.
10 will become 2 times of 5 without any such underlying relation between two values of a Feature.
If it happens, it will be a coincidence or might be because of a subconscious knowledge about the Target(Target encoding) while assigning the value randomly.
but now I suspect that that whole lesson is best removed altogether to avoid teaching students bad practices
I think students should know how it will fail/behave in different conditions. So that they can grasp the underlying concept.
Answered by 10xAI on September 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP