How to control for Co-variate shift in test data set compared to train data for regression task?

Question

I am working on a regression project. But I am facing the problem of covariate shift in features due to time delay.Test data was collected a year later due to which there has been some change in distribution. The research paper that I am working on also mentioned this shift in feature values but didn't mention anything about rectifying it.
I found this but it talks about changing training set distribution.
What I need is a way to make test data distribution closer to training data or some other way in which I could control for it.
I am using sklearn and Elastic-net for regression.

Brian Spiering · Answer

There are several options:

Change the training dataset

Use some of the test data as training data. This the best option since it better models the problem you are trying to solve.

Since it happens over time, take only the most recent data for training.

Manually engineer features. If you have knowledge of how the test data feature values are different, then explicitly put that knowledge in the model.

Increase regularization. For scikit-learn's ElasticNet, increase alpha hyperparameter. Unless your validation dataset is similar to your test data, you'll have no evidence that this will improve the modeling.

Switch algorithms to one that is better at generalization. Possibly, Random Forest could be better than regularized linear regression.

How to control for Co-variate shift in test data set compared to train data for regression task?

One Answer

Add your own answers!

Ask a Question