TransWikia.com

Dropping features after final evaluation on test data

Data Science Asked on August 8, 2021

Would you please let me know if I am committing a statistical or machine learning mal-practice in this procedure?

I want to estimate meteorological variable y1 from ${x_1, …, x_{10}}$ variables. I use data from different weather stations. I keep some weather stations as test sites/data.

I do feature selection and hyper parameter tuning with cross-validation on training data. My model is Random Forest (RF) and two other tree based models.

Before I evaluate my models on test sites I was skeptic about keeping one of the features: Elevation of the weather station, $x_{10}$. This is a static feature that would be present/same in all rows of data related to a station. Knowing a tiny bit about RF made me worried that the model will use this as a kind of "site_id" and possibly overfit to this feature. It wouldn’t make me worried if I was using linear/nonlinear regression models.

So I train my models once with and once without $x_{10}$ as a feature.

Then I evaluate my models on the test sites and turns out that the models without $x_{10}$ do significantly better on test sites.

Even before testing this hypothesis about the static feature, I wanted to do similar tests with dropping other features as well, say $x_{9}$.

Now my question is: now that I know that $x_{10}$ hurts my model, I like to retrain my models without $x_{10}$ and test model performance with and without $x_{9}$ in the filtered feature set.

To me it seems like Im using my test data to kind of filter out my feature so it is not right.
But then, I have this information and if $x_{10}$ is hurting my models in the end, why should I go on testing hypothesis and preparing my models with $x_{10}$ being in them?

2 Answers

What you're doing is manual feature selection based on the test set. You're right that it's not correct to proceed this way: in theory, feature selection should be done using only the training set and a validation set, not the final test set. The risk is data leakage: you're modifying the model using information from the test set. Maybe the performance is better without these features on the test set because they happen to be bad for this particular current test set by chance. As a result the model could be overfit, and you wouldn't be able to detect this problem on this test set since it's the source of the overfitting.

So in principle it's always better to separate the data first, keep the test set aside until the final evaluation and use a validation set for intermediate evaluation until the final model (including the set of features) is determined.

In practice, it sometimes happens that we realize that we should have done something differently after applying the model on the final test set. It's a mistake but it's not the end of the world, usually the risk of bias is low. As you said, obviously we don't have to ignore an information which is important for the performance of the model. However if you know that you're going to repeat this procedure with several features, you should defintely do it using a separate validation set (taken from the training data), not the test set: the more you use the test set like this, the higher the risk of data leakage and bias.

Correct answer by Erwan on August 8, 2021

This is a good example of what is usually called "data leakage" -- you are bleeding information from your test set back to your model set. A certain amount of this is inevitable, and it's why (especially for deep learning) problems data scientists often split up data sets into training, validation, and holdout sets. The validation/testing set is used to do the kind of parsimonious model tuning you're talking about before assessing the accuracy with a final holdout set.

Incidentally, it sounds like you're right that the elevation might be a feature that overfits. One way to solve that without totally losing the information value from actual elevation would be to transform the feature into bins, which will likely include multiple observations in each.

Answered by BAustin on August 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP