Sklearn: applying cost complexity pruning along with pipeline

Question

I have a data set with categorical variables. I have defined a decision tree algorithm and transformed these columns to numerical equivalent using one hot encoding functionality in sklearn:
Create Decision Tree classifer object:
clf2 = DecisionTreeClassifier(criterion = 'entropy')
pipe = make_pipeline(column_trans, clf2)            # (1)
pipe.fit(X_train2,y_train2)

where:
column_trans = make_column_transformer(
            (OneHotEncoder(),['ShelveLoc','Urban','US']),
             remainder = 'passthrough')

Now when I built the decision tree without using sklearn but using pandas directly for categorical feature encoding, I was able to find the suitable candidates for alpha to prune the decision tree by
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
ccp_alphas = ccp_alphas[:-1] #remove max value of alpha

where as now given that my model is baked into pipe argument in (1) when I try to find candidate alphas
path = pipe.cost_complexity_pruning_path(X_train2, y_train2)

I get an error message saying pipe does not have the attribute called cost complexity pruning. and looking at all the attributes available to pipe, I can't find cost complexity pruning as well.
Is it only possible to do cost complexity pruning if you are building the model without using the pipe functionality in Sklearn?

Codeman340 · Answer

I have had a first crack at coming up with a workaround, although its ugly and won't scale:
alpha_candidates = (np.arange(0.0,0.5, 0.001)).tolist()
alpha_accuracy_list = []
# Create Decision Tree classifer object
for i in alpha_candidates:
    clf2_entropy_alpha = DecisionTreeClassifier(criterion = 'entropy', ccp_alpha= i,random_state=42)
    pipe = make_pipeline(column_trans, clf2_entropy_alpha)
    pipe.fit(X_train2,y_train2)
    y_pred2_entropy_alpha = pipe.predict(X_test2)
    alpha_accuracy = [i, metrics.accuracy_score(y_test2, y_pred2_entropy_alpha)] 
    alpha_accuracy_list.append(alpha_accuracy)

Thoughts?

Ben Reiniger · Answer

Pipelines themselves don't generally carry the methods and attributes of the final estimator, aside from basics like predict, predict_proba, transform.  If you need to access a method of a step, you should access the step itself using one of:
pipe[-1]
pipe['decisiontreeclassifier']
pipe.named_steps['decisiontreeclassifier']

However, in this case it's a little trickier, because cost_complexity_pruning_path needs the dataset X, y, but you need your pipeline's transformer to apply to it first.  It's a little cumbersome, but I think this should work and is relatively straightforward:
pipe[-1].cost_complexity_pruning_path(
    pipe[:-1].transform(X, y),
    y,
)

(Note that pipe[-1] is the final estimator in the pipeline, and pipe[:-1] is every step except the last.)

Sklearn: applying cost complexity pruning along with pipeline

2 Answers

Add your own answers!

Ask a Question