Data Science Asked on July 29, 2020
I have data with the following structure:
created_at | customer_id | features | target
2019-01-01 2 xxxxxxxx y
2019-01-02 3 xxxxxxxx y
2019-01-03 3 xxxxxxxx y
...
That is, a session timestamp, a customer id, some features, and a target. I want to build an ML model to predict this target, and I’m having issues to do cross-validation properly.
The idea is that this model is deployed and used to model new customers. For this reason, I need the cross-validation setting to satisfy the following properties:
created_at
of the validation set to be higher than all created_at
of the training set.Can you think of a way of doing this? Is there an implementation in python or in the scikit-learn ecosystem?
Sort on the customer id. And than do the time series split. If there is any overlapping than drop these rows if possible.
These are mutually exclusive conditions, meaning that if you have class 2 for customer id in the beginning of the time series and Right and the end of it, you can not expect not to have to drop these rows in the beginning. Because not doing that would damage one of the two posed conditions.
Answered by Noah Weber on July 29, 2020
As @Noah Weber mentioned, one solution is to:
Below is a code sample I was writing at the same time he answered.
import pandas as pd
import numpy as np
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import TimeSeriesSplit
# Generating dates
def pp(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
fake_date = pp(start, end, 500)
# Fake dataframe
df = pd.DataFrame(data=np.random.random((500,5)), index=fake_date, columns=['feat'+str(i) for i in range(5)])
df['customer_id'] = np.random.randint(0, 5, 500)
df['label'] = np.random.randint(0, 3, 500)
# First split by customer
rkf = RepeatedKFold(n_splits=2, n_repeats=5, random_state=42)
for train_cust, test_cust in rkf.split(df['customer_id'].unique()):
print("training/testing with customers : " + str(train_cust)+"/"+str(test_cust))
train_df = pd.concat( [ df.groupby('customer_id').get_group(i) for i in train_cust ])
test_df = pd.concat( [ df.groupby('customer_id').get_group(i) for i in train_cust ])
# Then sort all the training customer (if not already sorted)
tscv = TimeSeriesSplit(max_train_size=None, n_splits=5)
sorted_df = train_df.sort_index()
X = sorted_df[['feat'+str(i) for i in range(5)]] # get the features
y = sorted_df['label'] # get the corresponding labels
# Then do the time series split
for train_index, test_index in tscv.split(X.values):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
# What do you want here ?
# I guess add all other customer from first split
X_test_full = pd.concat([ X_test, test_df[['feat'+str(i) for i in range(5)]] ])
y_test_full = pd.concat([y_test, test_df['label']])
Note: Generating random dates is based on this post
Answered by etiennedm on July 29, 2020
As a first porint, when you say "The idea is that this model is deployed and used to model new customers" I guess you mean and used to infere on new customers, is it correct? I can think of two possible options:
following the properties you impose, you can first make use of the TimeSeriesSplit cross-validator by scikit-learn, with wich you get the time-ordered indices of each train-validation split, so that you can use them later on the clients IDs you decide to fulfill the second condition, something like:
As a second option, you could try to apply clustering on your clients, based on certain features, and build as many models as clients types you get (each cluster having n clients history data). This would solve a possible problem I see in your approach, which is (due to the second restriction) not using a client whole history data both for training and validating
Answered by German C M on July 29, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP