TransWikia.com

How to control for Co-variate shift in test data set compared to train data for regression task?

Data Science Asked on May 29, 2021

I am working on a regression project. But I am facing the problem of covariate shift in features due to time delay.Test data was collected a year later due to which there has been some change in distribution. The research paper that I am working on also mentioned this shift in feature values but didn’t mention anything about rectifying it.

I found this but it talks about changing training set distribution.
What I need is a way to make test data distribution closer to training data or some other way in which I could control for it.

I am using sklearn and Elastic-net for regression.

One Answer

There are several options:

  1. Change the training dataset

    1. Use some of the test data as training data. This the best option since it better models the problem you are trying to solve.

    2. Since it happens over time, take only the most recent data for training.

  2. Manually engineer features. If you have knowledge of how the test data feature values are different, then explicitly put that knowledge in the model.

  3. Increase regularization. For scikit-learn's ElasticNet, increase alpha hyperparameter. Unless your validation dataset is similar to your test data, you'll have no evidence that this will improve the modeling.

  4. Switch algorithms to one that is better at generalization. Possibly, Random Forest could be better than regularized linear regression.

Answered by Brian Spiering on May 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP