TransWikia.com

Decision trees, categorizacion and oversampling

Data Science Asked by Elisa on May 21, 2021

I want to create a model to predict the propensity to buy a certain product. As my proportion of 1’s is very low, I decided to apply oversampling (to get a 10% of 1’s and a 90% of 0’s).

Now, I want to discretize some of the variables. To do so I run a tree for each variable against the target.

Should I define the prior probabilities when I do this (run the trees), or it doesn’t matter and I can use the over-sampled dataset just like that?

One Answer

Do you use Python? Python class DecisionTreeClassifier has an attribute class_weight for this purpose. So you do not need to adjust it manually. Check here https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Regarding discretizing or encoding - hard to say what is better without knowing your data. Unless you are really sure one is the best choice, you can check the model by using encoding instead of discretizing and compare the quality.

Answered by Denis on May 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP