How can I do a train test split for an unbalanced panel data set in Python?

Question

I have an unbalanced, panel pandas data frame.
I would like to split this data into a training set and a testing set. Python's train_test_split method will not work because it does a random split, and so, it will likely places observations from t + 1 into the training set, and observations from t into the test set.
Which, of course, makes no sense, because the future cannot predict the past.
TimeSeriesSplit will also not work because this function does not take into consideration the panel dimension of my data set.
Is there an easy way to do a train test split on unbalanced panel data sets? This split should (1) take into consideration the panel dimension of the data set, and (2) place earlier observations in the training set and later observations in the testing set.

20roso · Answer

Not sure what do you mean by panel dimension, I think would be best to clarify it more. Nonetheless, I can tell you for your 2nd question Stratified Sampling takes into account the imbalance. And if you want to have one group t on the training and t+1 on the test you can use group sampling. Sklearn has both implementations [1] [2].

Answered by 20roso on August 15, 2021

Marzi Heidari · Answer

You can use the following function to keep the train distribution in test set:
def split_train_test(data: np.ndarray, distribution: list, test_ratio: float) -> Union:
    skf = StratifiedKFold(n_splits=int(test_ratio * 100), random_state=1374, shuffle=False)
    return next(skf.split(data, distribution))

According to documentation in StratifiedKFold the folds are made by
preserving the percentage of samples for each class.
so we can pass the labels instead of distribution to get proper sample and then call it like:
split_train_test(data, labels, test_ratio)

How can I do a train test split for an unbalanced panel data set in Python?

2 Answers

Add your own answers!

Ask a Question