Regression performance varies hugely on shuffling training and testing data

Question

I'm working on a regression problem to predict a variable y based on an input vector X with about 10 columns. To split the data for training and testing, I use the test_train_split method:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = True)

Here, it can be seen that X and y are shuffled entirely along with their indices.
For the purpose of this example, I will consider only one algorithm, RandomForestRegressor:
regr1 = RandomForestRegressor(n_estimators = 200)
regr1.fit(X_train, y_train)
y_pred = regr1.predict(X_test)

When I train the regressor with X_train and y_train from the above method, I get certain results which I consider good. For my application, I would need the model to predict for cases with sequential data (as they would be if not shuffled).
Therefore, I tried the following method to shuffle the training data alone and keep the testing data as is.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = False)
train_data = pd.concat([X_train, y_train], axis = 1)
train_data = train_data.sample(frac=1)
X_train = train_data.iloc[:,0:-1]
y_train = train_data.iloc[:,-1]

In the above method, I first split the data into test/train indices and then shuffle the training data alone and keep the testing data with their original indices. When I train the regressor with the exact same parameters as before and test with X_test, I get significantly poorer results.
I have also tried with both train and test data without shuffling and got bad results as well. I want to be able to train the model so that it can predict for unknown values coming in a sequential order (as it would be in real time).
I'm not able to understand why the shuffling of the test data alone affects the performance, as the model should merely be predicting based on the trained parameters which entirely depend on the training data.

jared3412341 · Answer

I can't add comments yet, so that's why I'm making a post.
It is natural that for a time series data (I assume that's the kind of data you have if you don't want to shuffle) without shuffling you get worse results. Imagine this toy example. You want to predict sales of a grocery store where your data is sales of the store for every day. If you shuffle you can get dates in your training set that are later than dates in test set. So for example if you have 2nd of March in your training set, it is easy to predict the sales for 1st of March if it's in the test set. However, this situation is not feasible in real life - you don't have data about future when making the prediction.

Regression performance varies hugely on shuffling training and testing data

One Answer

Add your own answers!

Ask a Question