Data Science Asked on August 13, 2020
I confront with a binary classification machine learning task which is both slightly imbalanced and cost sensitive. I wonder what (and where in the modeling pipeline, say, in sklearn) is the best way to take all these considerations into account.
Class proportionality: positive: 0.25% negative: 0.75%. This could be addressed with sklearn.utils.class_weigh.compute_class_weight
:
class_weights = compute_class_weight(y=y, class_weight='balanced')
OK, but this is only for rebalancing proportionalty, I should take misclassification cost into consideration as well. Let’s say that this is 10* larger in case of false negatives than false positives, so I guess that I should still increase the weights in class_weights
accordingly by upweighting positives further by 10, right?
But there is another point in the pipeline where I could take care of this, namely in the evaluation metrics with F-beta for example with upweighted recall (F2, for instance). Does it have the same effect? Should I pick one method (F-beta for evaluation OR upweighting classes) or both of them simultaneously?.
Additionally, in case I upweight my classes with compute_class_weight()
, I assume that no further class distribution should be taken into consideration downstream (so when I use RandomForestClassifier()
, class_weight
hyperparameter shouldn’t be ='balanced'
, again, because this would further distort the weights proportionality that is already set before. Is this correct?
Additionally, in case I upweight my classes with compute_class_weight(), I assume that no further class distribution should be taken into consideration downstream (so when I use RandomForestClassifier(), class_weight hyperparameter shouldn't be ='balanced', again, because this would further distort the weights proportionality that is already set before. Is this correct?
The util function is just a utility that will return you the dictionary for class weight.
The place where you will use this dictionary is the class_weight parm of the Model e.g. RandomForest.
So, you will have to pass the return of the Util to the model.
Better, just keep the parameter value = "balanced" Or any specific value e.g. [{1:1}, {2:5}, {3:1}, {4:1}]
OK, but this is only for rebalancing proportionalty, I should take misclassification cost into consideration as well. Let's say that this is 10 larger in case of false negatives than false positives, so I guess that I should still increase the weights in class_weights accordingly by upweighting positives further by 10, right?*
Just using the Util and getting the dict of weight will not have any effect. Action will happen only when the Model uses the class_weight parameter.
The Model will add corresponding weight to the misclassification cost of the minority class. Low-level design detail changes with Model e.g. Neural network, Decision tree.
But there is another point in the pipeline where I could take care of this, namely in the evaluation metrics with F-beta for example with upweighted recall (F2, for instance). Does it have the same effect? Should I pick one method (F-beta for evaluation OR upweighting classes) or both of them simultaneously?
Change in Scoring technique will be needed because whatever be the Class weights, the counts are still unbalanced e.g. 100:1. So, 90 correct of the majority will make the overall accuracy ~90 which can be deceiving.
Answered by 10xAI on August 13, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP