Data Science Asked by MartykQ on March 19, 2021
There a lot of information on how to handle categorical variables when preprocessing data for ML classification. However, I cannot find any feedback on how to handle categorical variables, where each sample can belong to more than one label.
I’m working on bug detection classifier. I’ve got many features like who contributed to the source code. There are about 200 unique labels and creating so many dummy variables makes my model overfit.
So are there any alternatives for this method. Something like target-based encoders (ex. CatBoost)
Just having 200 unique labels and using MultiLabelBinarizer does not automatically mean overfitting. Overfitting is an empirical question. The number features and observations relative to the complexity of the algorithm effect the chance of overfitting.
Answered by Brian Spiering on March 19, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP