Data Science Asked by codeman340 on January 15, 2021
I have a data set with categorical variables. I have defined a decision tree algorithm and transformed these columns to numerical equivalent using one hot encoding functionality in sklearn:
Create Decision Tree classifer object:
clf2 = DecisionTreeClassifier(criterion = 'entropy')
pipe = make_pipeline(column_trans, clf2) # (1)
pipe.fit(X_train2,y_train2)
where:
column_trans = make_column_transformer(
(OneHotEncoder(),['ShelveLoc','Urban','US']),
remainder = 'passthrough')
Now when I built the decision tree without using sklearn but using pandas directly for categorical feature encoding, I was able to find the suitable candidates for alpha to prune the decision tree by
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
ccp_alphas = ccp_alphas[:-1] #remove max value of alpha
where as now given that my model is baked into pipe argument in (1) when I try to find candidate alphas
path = pipe.cost_complexity_pruning_path(X_train2, y_train2)
I get an error message saying pipe does not have the attribute called cost complexity pruning. and looking at all the attributes available to pipe, I can’t find cost complexity pruning as well.
Is it only possible to do cost complexity pruning if you are building the model without using the pipe functionality in Sklearn?
I have had a first crack at coming up with a workaround, although its ugly and won't scale:
alpha_candidates = (np.arange(0.0,0.5, 0.001)).tolist()
alpha_accuracy_list = []
# Create Decision Tree classifer object
for i in alpha_candidates:
clf2_entropy_alpha = DecisionTreeClassifier(criterion = 'entropy', ccp_alpha= i,random_state=42)
pipe = make_pipeline(column_trans, clf2_entropy_alpha)
pipe.fit(X_train2,y_train2)
y_pred2_entropy_alpha = pipe.predict(X_test2)
alpha_accuracy = [i, metrics.accuracy_score(y_test2, y_pred2_entropy_alpha)]
alpha_accuracy_list.append(alpha_accuracy)
Thoughts?
Answered by Codeman340 on January 15, 2021
Pipelines themselves don't generally carry the methods and attributes of the final estimator, aside from basics like predict
, predict_proba
, transform
. If you need to access a method of a step, you should access the step itself using one of:
pipe[-1]
pipe['decisiontreeclassifier']
pipe.named_steps['decisiontreeclassifier']
However, in this case it's a little trickier, because cost_complexity_pruning_path
needs the dataset X, y
, but you need your pipeline's transformer to apply to it first. It's a little cumbersome, but I think this should work and is relatively straightforward:
pipe[-1].cost_complexity_pruning_path(
pipe[:-1].transform(X, y),
y,
)
(Note that pipe[-1]
is the final estimator in the pipeline, and pipe[:-1]
is every step except the last.)
Answered by Ben Reiniger on January 15, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP