Data Science Asked on August 15, 2021
I have an unbalanced, panel pandas data frame.
I would like to split this data into a training set and a testing set. Python’s train_test_split
method will not work because it does a random split, and so, it will likely places observations from t + 1
into the training set, and observations from t
into the test set.
Which, of course, makes no sense, because the future cannot predict the past.
TimeSeriesSplit
will also not work because this function does not take into consideration the panel dimension of my data set.
Is there an easy way to do a train test split on unbalanced panel data sets? This split should (1) take into consideration the panel dimension of the data set, and (2) place earlier observations in the training set and later observations in the testing set.
Not sure what do you mean by panel dimension, I think would be best to clarify it more. Nonetheless, I can tell you for your 2nd question Stratified Sampling takes into account the imbalance. And if you want to have one group t
on the training and t+1
on the test you can use group sampling. Sklearn has both implementations [1] [2].
Answered by 20roso on August 15, 2021
You can use the following function to keep the train distribution in test set:
def split_train_test(data: np.ndarray, distribution: list, test_ratio: float) -> Union:
skf = StratifiedKFold(n_splits=int(test_ratio * 100), random_state=1374, shuffle=False)
return next(skf.split(data, distribution))
According to documentation in StratifiedKFold
the folds are made by
preserving the percentage of samples for each class.
so we can pass the labels instead of distribution to get proper sample and then call it like:
split_train_test(data, labels, test_ratio)
Answered by Marzi Heidari on August 15, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP