Cross Validated Asked by dimitriy on November 16, 2021
I have a dataset that contains estimated posterior membership probabilities for an unordered, 5 level categorical variable. These sum to unity. The data looks like this:
id pred1 pred2 pred3 pred4 pred5
1 .859171 .03846882 .03101809 .00917879 .06216329
2 .36325043 .4150897 .07394462 .01415896 .13355629
3 .15642941 .70247179 .08668463 .01269947 .0417147
4 .95966166 .00182377 .00850792 .01247164 .01753501
5 .93548764 .02308354 .00776801 .00950647 .02415434
6 .98533042 .00024188 .0036747 .00361082 .00714217
7 .73908375 .1286123 .13230396 0 0
8 .64286514 .0447184 .26215937 .02530076 .02495633
9 .90968365 .0175316 .05905972 .00243294 .01129209
10 .95854473 .01169418 .00776546 .01392006 .00807557
Here observation 1 probably belongs to k=1, and observation 2 is more divided between 1 and 2. Most of the observations come from class 1.
I also have a data set of true values that were not used in the model for about 100K observations.
I would like to develop a set of reliable and interpretable threshold rules that assign unique category membership given pred1-pred5 that are are less prone to misclassification than just going with the highest posterior probability estimate. I care the most about distinguishing group 2 from group 1 correctly (both type I and II error rates) than other types of misclassification. I suppose I care more about type I error.
I am struggling with how to translate this to a misclassification cost matrix for a classification tree to use. I have tried doing something like this (the code below does not match the fake data above as far as ordering of the groups):
# rows are actual values, columns are predicted value
# Type II penalty below diagonal, Type I penalty above
MCM <- matrix(c(0,1,1,1,1,
2,0,2,2,3,
1,1,0,1,1,
1,1,1,0,1,
1,4,1,1,0),
byrow=TRUE,
nrow=5,
dimnames = list(sort(unique(phats$outcome)), sort(unique(phats$outcome))))
# grow tree
fit <- rpart(outcome ~ pred1 + pred2 + pred3 + pred4 + pred5,
data = phats,
method='class',
parms=list(split='information', loss = MCM),
)
# prune the tree by picking the tree which minimizes the CV error
pfit<- prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"])])
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP