The distribution of dataset train and test are the differents, how to fix this?

Data Science Asked by Natalie Chaves on September 5, 2020

I am new in data science and like some help to understand my problem. For instance, I have two signals non-stationary for the same condition (figure 1).

enter image description here

I acquisition them at different times(in the morning and in the afternoon), when applying the Kolmogorov-Smirnov test, the null hypothesis was rejected, I don’t understand why distribution is different if I no change any parameters in my system of acquisition.

This is the main trouble in my analysis because of this no get model any algorithm of machine learning to classification (overfitting).

I read something about the Covariate Shift (*I saw this post), and Kullback_Leiber Importance Estimation Procedure, but I don’t know iff will really work out.

*’Different Test Set and Training Set Distribution

One Answer

  • how does it compare to longer timespans? Can you still observe noticeable drift?
  • to some extent this is usual behaviour
  • here is a guide for KL divergence, It's just a tool you can use to compare probability distributions.
  • exploration is equally important as hard stats. For me those plots look alright. There's no defined point, telling you when you can or cannot do things. That's whole point of ML, you don't have a priori knowledge. Best run some models and check validation metrics.

Answered by Piotr Rarus - Reinstate Monica on September 5, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP