Understanding the Gini/AUC metric as out-of-development performance metric

Question

Assume we develop a model for a binary classification task that reaches a certain Gini/AUROC estimate on the validation ( or training ) sample, among others. This is an overall good metric, often used for evaluating the ability of the model to separate the samples into, say, goods vs bads.

Further, assume this model is adequate and  will be used for further collection of new samples with a certain cutoff value. What should be expected Gini/AUC estimates on the newly collected sample?

From what I'm noticing, on the training sample there were clear cases where the model was able to distinguish and separate it with large probabilities. On the other hand, with applied cuttoff, say, <50%, the new sample with collect only those cases where no such clear separation is possible (because if it would, the case might not get collected). With such approach, for me it seems logical that the overall separation in the new sample will be lower, resulting in lower out-of-development-period Gini/AUC.

Is this the expected behaviour in normal production environments? Am I understanding things correctly?

Note: I understand that there are other simple metrics, such as sensitivity/specificity, hoslem.test and others, allowing for measuring and visualising True/False Positives. However, I have found that Gini/AUC is often a key metric when discussing and comparing classification models.

Juan Esteban de la Calle · Answer

The advantage which train/test/validation dataset separation has is that you separate your dataset into:

The individuals which you know the exogenous variables and the output: Training
The individuals which you know the exogenous variables and the output (but you suppose you don't know which the output is): Test
The individuals you know the exogenous variables but not the output: Validation

Every DS or ML model is made so it is prepared to receive a validation dataset in the future and try to get every metric just almost as good as if it was the train dataset.

The test dataset has the objective of simulating the situation of having data but not output, and then you have the output to measure the behaviour and comparing the modelled vs real output.

So, for a concrete answer:
The behaviour you should expect from the validation (or newly collected sample) is the same as the test dataset.

Given that the underlying phenomenon and sampling technique remains the same.

For more information:
https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

Understanding the Gini/AUC metric as out-of-development performance metric

One Answer

Add your own answers!

Ask a Question