Classification Problem With Estimated and Dependent Covariates

Question

I have a dataset that contains estimated posterior membership probabilities for an unordered, 5 level categorical variable. These sum to unity. The data looks like this:
id  pred1       pred2        pred3       pred4       pred5
1   .859171     .03846882   .03101809   .00917879   .06216329
2   .36325043   .4150897    .07394462   .01415896   .13355629
3   .15642941   .70247179   .08668463   .01269947   .0417147
4   .95966166   .00182377   .00850792   .01247164   .01753501
5   .93548764   .02308354   .00776801   .00950647   .02415434
6   .98533042   .00024188   .0036747    .00361082   .00714217
7   .73908375   .1286123    .13230396    0           0
8   .64286514   .0447184    .26215937   .02530076   .02495633
9   .90968365   .0175316    .05905972   .00243294   .01129209
10  .95854473   .01169418   .00776546   .01392006   .00807557

Here observation 1 probably belongs to k=1, and observation 2 is more divided between 1 and 2. Most of the observations come from class 1.
I also have a data set of true values that were not used in the model for about 100K observations.
I would like to develop a set of reliable and interpretable threshold rules that assign unique category membership given pred1-pred5 that are are less prone to misclassification than just going with the highest posterior probability estimate. I care the most about distinguishing group 2 from group 1 correctly (both type I and II error rates) than other types of misclassification. I suppose I care more about type I error.
I am struggling with how to translate this to a misclassification cost matrix for a classification tree to use. I have tried doing something like this (the code below does not match the fake data above as far as ordering of the groups):
# rows are actual values, columns are predicted value
# Type II penalty below diagonal, Type I penalty above
    MCM <- matrix(c(0,1,1,1,1,
                    2,0,2,2,3,
                    1,1,0,1,1,
                    1,1,1,0,1,
                    1,4,1,1,0), 
                  byrow=TRUE,
                  nrow=5,
                  dimnames = list(sort(unique(phats$outcome)), sort(unique(phats$outcome))))              
    # grow tree
    fit <- rpart(outcome ~ pred1 + pred2 + pred3 + pred4 + pred5,
                 data = phats,
                 method='class',
                 parms=list(split='information', loss = MCM),
                 )
    # prune the tree by picking the tree which minimizes the CV error
    pfit<- prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"])])

Classification Problem With Estimated and Dependent Covariates

Add your own answers!

Ask a Question