Data Science Asked by carlo_sguera on August 20, 2021
I am having a look at this material and I have found the following statement:
For this class of models [Gradient Boosting Machine algorithms] […] it is both safe and significantly
more computationally efficient use an arbitrary integer encoding [also known as Numeric Encoding] for
the categorical variable even if the ordering is arbitrary [instead of
One-Hot encoding].
Do you know some references that support this statement? I get that Numeric Encoding is more computationally efficient than One-Hot Encoding, but I would like to know more about their supposed equivalence to encode unordered categorical variables in Gradient Boosting Methods.
Thanks!
This is actually a feature of tree-based models in general, not just gradient boosting trees.
Not exactly a reference, but this Medium article explains why ordinal encoding is often more efficient.
On the topic of safety, I think the author should have said that the use of ordinal encoding is more safe compared to linear methods, but still not perfectly safe. It's possible for decision-tree methods to find spurious rules within ordinal encodings, but they don't have the strong assumptions about numeric semantics that linear methods do.
. . . I would like to know more about their supposed equivalence to encode unordered categorical variables . . .
Any rule derived with one-hot encoding can also be represented with ordinal encoding, it just might take more splits.
To illustrate, suppose you have a categorical variable foo
with possible values spam
, ham
, eggs
. A one-hot encoding would create 3 dummy variables, is_spam
, is_ham
, is_eggs
. Let's say an arbitrary ordinal encoding assigns spam
= 1, ham
= 2, and eggs
= 3.
Suppose the OHE decision tree splits on is_eggs = 1
. This can be represented in the ordinal decision tree by the split foo > 2
. Suppose the OHE tree splits on is_ham = 1
. The ordinal tree will require two splits: foo > 1
then foo < 3
Answered by zachdj on August 20, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP