Cross Validated Asked by Atakan on November 29, 2021
I’m analyzing biological data where I’d like to see the impact of scaled gene expression on the classification of the sample. I binarized the response variable as 0
and 1
and used lasso with cross-validation. My goal is inference rather than prediction at this point.
I compared two different type.measure
parameters (auc
and class
) in cv.glmnet
function and plotted coefficient estimates in a scatter plot:
Correct me if I’m wrong, but the documentation suggests both methods can be suitable for binary classification problems. I’m trying to understand the differences I see here. When using misclassification error (class
) loss function, more variables shink to zero compared to the area under the ROC curve (auc
) What is the reason for this different behavior?
The main point is that accuracy is not really "suitable for binary classification problems" despite its frequent use as a criterion in model evaluation.
In an important sense there is no single "accuracy" measure as it depends on selection of a particular probability cutoff for assigning class membership. For binary classification this selection is often hidden from view (as it seems to be in cv.glmnet()
when class
is selected as the criterion) and set at a value of p = 0.5; that is, class membership is assigned to whichever class has the highest probability. That's only appropriate if you assign the same cost to false-positive and false-negative errors. Other relative costs would lead to different choices of the probability cutoff. See this recent page for an introduction and links to further discussion about selecting cutoffs.
So your sense expressed in a comment is correct: the difference is that AUC examines the whole range of potential false-positive versus false-negative tradeoffs versus the single choice imposed by the p = 0.5 class-assignment threshold. As this page discusses, auc
is thus preferable to class
as a criterion for comparing models as you are effectively doing with cross validation.
This answer describes how the best way to evaluate such models is with a proper scoring rule, which is optimized when you have identified the correct probability model. The deviance
criterion in cv.glmnet()
(the default for logistic regression) is equivalent to a strictly proper log-loss scoring rule. That may be bit more sensitive than auc
for distinguishing among models; see this page.
I can't say with certainty why the class
criterion maintains fewer genes in the final model than does auc
. I suspect that's because the class
criterion is less sensitive to distinguishing among models, which is what you're doing when you try to minimize over a range of penalty values, so it ends up with larger weights on fewer predictors. But that's an intuitive heuristic argument with no formal basis.
A final note: inference following LASSO is not straightforward. See this page for some discussion. With gene-expression data you typically have a large number of correlated potential predictors, among which LASSO will make choices that can be very data dependent. So even if you calculate p-values and CI properly that doesn't mean you have identified "the most important" genes for the classification, just a particular set that is justifiable. Try repeating the modeling on multiple bootstrapped samples of the data to gauge how stable the gene-selection process is.
Answered by EdM on November 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP