Why does my model produce too good to be true output?

Question

I am trying to run a binary classification problem on people with diabetes and non-diabetes.
For labeling my datasets, I followed a simple rule. If a person has T2DM in his medical records, we label him as positive cases (diabetes) and if he doesn't have T2DM, we label him as Non-T2DM.
Since there are a lot of data points for each subject, meaning he has a lot of lab measurements, a lot of drugs taken, a lot of diagnoses recorded, etc, I end up with 1370 features for each patient.
In my training, I have 2475 patients and in my testing, I have 2475 patients. (I already tried 70:30. Now am trying 50:50 still the same result (as 70:30))
My results are too good to be true as shown below

Should I reduce the number of features? Is it overfitting? Should I retain only the top features like top 20 features, top 10 features etc? can help me understand why is this happening?
Detailed Update
We check for the presence of T2DM by a set of diagnosis codes (like icd9,10 codes for T2DM and its complications etc). For ex: let's say if a patient has an icd9 code of 250 in his diagnosis records, we know that he was diagnosed with Type 2 diabetes. Let's not worry about the accuracy of this labeling approach at this point. Similarly, we label all other patients as T2DM and Non-T2DM.
But when we extract features, all his medical records are treated as features. The frequency of the drug/condition/lab tests will be used as a feature value. So, basically, the same diagnosis code (250) will be an input feature as well. Does it mean I should drop the diagnosis codes which were used to label a dataset from being used as features? But those are really very good features that can help me find out whether a patient is diagnosed with T2DM or not (when I apply my model on a totally different dataset). My objective is not to find out whether a patient will develop diabetes in the future or not but my objective is only to find out whether a patient is diabetic or not (from his records). So, I label my dataset with an imperfect heuristic as stated above and build a model. Once I build this model, I would like to validate this model at another site and find out how good is the model built using this imperfect heuristic in identifying whether a patient is diabetic or not. Hope this helps

Erwan · Accepted Answer

Assuming that these results are obtained on a valid test set with no data leakage, these results don't show overfitting because overfitting would cause great performance on the training set but significantly lower perfomance on the test set.
Make sure that your instances between the training and test set are truly distinct: there might be some data leakage, for example if there are duplicate records for the same patient.
Another potential flaw is the gold standard labeling: if the patient has this T2DM  in their medical record, it means that they are already diagnosed right? And since the features are also based on the medical record, it's likely that this medical record contains direct indications about this diagnosis (for example in the drugs prescribed to the patient). There are two interpretations about this:

either the task is purposefully defined by this T2DM label, and in this case you can just enjoy the great performance but it's not technically about detecting diabetes in general.
or the goal is to detect patients with diabetes including the ones who are not diagnosed yet, but then it's likely that your gold standard is incorrect for this task.

[edit following updated question]
Your update clarifies which exact task you're targeting, and it corresponds exactly to my first interpretation above: given that your goal is to predict which instances satisfy this T2DM criterion and that the features contain direct indications about it:

I think you're right to keep these features, if a very useful information is available it would be absurd not to exploit it (assuming it's also available in the same form in any future dataset you plan to use, of course)
The very high performance you obtain makes perfect sense for this task, it's not a bug. It simply happens that the task is easy, so the system is able to predict the label very well.

However this also means that you could do even better without any ML: currently the ML method gives you around 99% F-score because it doesn't perfectly represent the criterion used for the gold. But since the gold standard label is based entirely on the features, the most direct way to "predict" the label is to apply the criterion. There's no point using a complex ML method to predict an information that you can obtain from the same data more accurately with a deterministic method.

Mario Tormo · Answer

It pretty much looks like overfitting. It would be also interesting to know which algorythm did you use. Some are really sensitive to low number of instances / big number of features, and yYou have almost so many features as instances.
Trying checking first correlation between features and reduce the number of features with PCA or another method, before fitting your model again.

10xAI · Answer

Might be a case of Data leakage.
For 1370 features, 2475 is a very small dataset for such an extreme result.
Please try -
 Inspecting the misclassified records.
 Try removing the T2DM feature and note the dip
 Repeat the last step for all the features. You must observe a negligible dip for other features and a very large dip for any feature which is causing the leakage.

kutschkem · Answer

It sounds like the system can just learn your algorithm for labeling. If that is intended then you can just use that and throw away all the ML. If you want to predict, for example, the diagnosis of icd9=250, then of course there is no point to include icd9 as feature. Alternatively, if there is a history, you can use the record just before the diagnosis of diabetes as training example. You said you didn't want to predict whether a patient will be diabetic in the future. But you do want to predict whether someone is diabetic right now even if not diagnosed, right?

Answered by kutschkem on December 21, 2020

Allohvk · Answer

The best approach is to use a model like Decision Tree to see what actually is happening. Maybe there are couple of features in there that correlate in a big way to the label and the rest of the 1000+ features dont matter at all. It is possible (as someone else too point out) that one of the feature hiding in there (an icd with a certain response code) has a direct bearing the output label.

Answered by Allohvk on December 21, 2020

Why does my model produce too good to be true output?

5 Answers

Add your own answers!

Ask a Question