Decision trees, categorizacion and oversampling

Question

I want to create a model to predict the propensity to buy a certain product. As my proportion of 1's is very low, I decided to apply oversampling (to get a 10% of 1's and a 90% of 0's).
Now, I want to discretize some of the variables. To do so I run a tree for each variable against the target.
Should I define the prior probabilities when I do this (run the trees), or it doesn't matter and I can use the over-sampled dataset just like that?

Denis · Answer

Do you use Python?
Python class DecisionTreeClassifier has an attribute class_weight for this purpose.
So you do not need to adjust it manually. Check here
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Regarding discretizing or encoding - hard to say what is better without knowing your data.
Unless you are really sure one is the best choice, you can check the model by using encoding instead of discretizing and compare the quality.

Decision trees, categorizacion and oversampling

One Answer

Add your own answers!

Ask a Question