Data Science Asked by user4446237 on May 9, 2021
I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.
One-hot encoding leads to very high dimensionality. The approach I’ve landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.
My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.
How should target encoded values be combined?
There are several options:
Domain knowledge - Given what you know about the domain, combine the categories that make the most sense.
Empirical - Treat combing categories as a hyperparameter. Search through the space of options and pick the best combinations based on cross-validation score.
Answered by Brian Spiering on May 9, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP