Efficient Decision Tree Pruning

Question

Is there an efficient way to handle pruning in Decision Tree with Python ?
Currently I'm doing that:
def do_best_tree(Xtrain, ytrain, Xtest, ytest):
    clf = DecisionTreeClassifier()
    clf.fit(Xtrain, ytrain)
    path = clf.cost_complexity_pruning_path(Xtrain, ytrain)
    ccp_alphas = path.ccp_alphas
    clfs = []
    for ccp_alpha in tqdm(ccp_alphas):
        clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha)
        clf.fit(Xtrain, ytrain)
        clfs.append(clf)
    return max(clfs, key=lambda x:x.score(Xtest, ytest))

But it's super slow (as I create and fit a lot of trees).
Is there a more efficient way to do this with scikit-learn, or another library that handle this ?

Nitin · Answer

You might benefit from random forests instead which aim to achieve the same objectives you are aiming for, i.e better generalization through pruning to remove overfitting.
scikit learn's random forest algorithm will let you specify how many or what proportion of variables you want to automatically drop across the many trees whose results will be averaged for even better generalization performance.

Efficient Decision Tree Pruning

One Answer

Add your own answers!

Ask a Question