Cross Validated Asked by Aflatoun on February 17, 2021
I want to discretize a continuous variable $X$ into a given number of classes $k$ (assume for simplicity that $k$ is even).
Decision trees (and related methods) are already used to discretize a variable in a supervised context, but I find nothing regarding decision trees in an unsupervised context (in the sense of Kohavi et al.). The idea is simply to build a regression tree model where the target and the input are the same, i.e. $y=X$. This should result in subgroups (from which I derive the classes) which maximize the between-group variance.
Yet, I feel like I’m missing something obvious since the method is not being used anywhere. The module sklearn.preprocessing
proposes methods based on K-means, on the distribution of the variable, on the range of the variable but nothing regarding decision trees.
Edit. Here is an example : Suppose I’m trying to predict the popularity of a set of articles for a press company. The popularity is captured by the number of shares on social networks, so it’s a continuous variable. The company wants me to discretize the number of shares in order to get 4 classes (from very popular to not popular at all) before moving on to prediction because it is more readable from a business point of view. Can decision trees be of any help in achieving this goal?
Based on the example added to the original question, it seems that you already have data on the "popularity" of the articles. In that case I agree with @ttnphns that the best approach for discretization, if you really have to do it, would be to break up the known continuous "popularity" measure into 4 equal sized groups representing quartiles of the distribution. You rank the cases in order of "popularity" and break them into four equal-sized groups based on the ranks. That "get[s] 4 classes (from very popular to not popular at all)."
I strongly recommend avoiding the discretization as much as possible, however. Even if there needs to be a breakdown into groups to make things "more readable from a business point of view," that shouldn't interfere with the modeling itself if you are trying to predict a continuous measure like "popularity." To make things "readable from a business point of view," you could provide examples of prediction performance within those groups. The predictions will be most reliable and the examples least misleading, however, if you set up your prediction model to work with the actual "popularity" values as the outcome variable.
Answered by EdM on February 17, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP