Time-series grouped cross-validation

Question

I have data with the following structure:
created_at | customer_id | features | target
2019-01-01             2   xxxxxxxx       y  
2019-01-02             3   xxxxxxxx       y  
2019-01-03             3   xxxxxxxx       y  
...

That is, a session timestamp, a customer id, some features, and a target. I want to build an ML model to predict this target, and I'm having issues to do cross-validation properly.
The idea is that this model is deployed and used to model new customers. For this reason, I need the cross-validation setting to satisfy the following properties:

It has to be done in a time-series way: that is, for every train-validation split in cross-validation, we need all created_at of the validation set to be higher than all created_at of the training set.
It has to split customers: that is, for every train-validation split in cross-validation, we cannot have any customer both in train and validation.

Can you think of a way of doing this? Is there an implementation in python or in the scikit-learn ecosystem?

Noah Weber · Answer

Sort on the customer id. And than do the time series split. If there is any overlapping than drop these rows if possible.
These are mutually exclusive conditions, meaning that if you have class 2 for customer id in the beginning of the time series and Right and the end of it, you can not expect not to have to drop these rows in the beginning. Because not doing that would damage one of the two posed conditions.

etiennedm · Answer

As @Noah Weber mentioned, one solution is to:

first split by customer id
then do the time series split

Below is a code sample I was writing at the same time he answered.
import pandas as pd
import numpy as np
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import TimeSeriesSplit

# Generating dates
def pp(start, end, n):
    start_u = start.value//10**9
    end_u = end.value//10**9

return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))

start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
fake_date = pp(start, end, 500)

# Fake dataframe
df = pd.DataFrame(data=np.random.random((500,5)), index=fake_date, columns=['feat'+str(i) for i in range(5)])
df['customer_id'] = np.random.randint(0, 5, 500)
df['label'] = np.random.randint(0, 3, 500)

# First split by customer
rkf = RepeatedKFold(n_splits=2, n_repeats=5, random_state=42)
for train_cust, test_cust in rkf.split(df['customer_id'].unique()):
    print("training/testing with customers : " + str(train_cust)+"/"+str(test_cust))
    train_df = pd.concat( [ df.groupby('customer_id').get_group(i) for i in train_cust ])
    test_df = pd.concat( [ df.groupby('customer_id').get_group(i) for i in train_cust ])

# Then sort all the training customer (if not already sorted)
    tscv = TimeSeriesSplit(max_train_size=None, n_splits=5)
    sorted_df = train_df.sort_index()
    X = sorted_df[['feat'+str(i) for i in range(5)]] # get the features
    y = sorted_df['label'] # get the corresponding labels
    
    # Then do the time series split
    for train_index, test_index in tscv.split(X.values):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y[train_index], y[test_index]

# What do you want here ?
        # I guess add all other customer from first split
        X_test_full = pd.concat([ X_test, test_df[['feat'+str(i) for i in range(5)]] ])
        y_test_full = pd.concat([y_test, test_df['label']])

Note: Generating random dates is based on this post

German C M · Answer

As a first porint, when you say "The idea is that this model is deployed and used to model new customers" I guess you mean and used to infere on new customers, is it correct?
I can think of two possible options:

following the properties you impose, you can first make use of the TimeSeriesSplit cross-validator by scikit-learn, with wich you get the time-ordered indices of each train-validation split, so that you can use them later on the clients IDs you decide to fulfill the second condition, something like:

As a second option, you could try to apply clustering on your clients, based on certain features, and build as many models as clients types you get (each cluster having n clients history data). This would solve a possible problem I see in your approach, which is (due to the second restriction) not using a client whole history data both for training and validating

Time-series grouped cross-validation

3 Answers

Add your own answers!

Ask a Question