How to split dataset for time-series prediction?

Question

I have historic sales data from a bakery (daily, over 3 years). Now I want to build a model to predict future sales (using features like weekday, weather variables, etc.).
How should I split the dataset for fitting and evaluating the models?

Does it need to be a chronological train/validation/test split?
Would I then do hyperparameter tuning with the train and validation set?
Is (nested) cross validation a bad strategy for a time-series problem?

EDIT
Here are some links I came across after following the URL suggested by @ene100:

Rob Hyndman describing "rolling forecasting origin" in theory and in practice (with R code)
other terms for rolling forecasting origin are "walk forward optimization" (here or here), "rolling horizon" or "moving origin"
it seems that these techniques won’t be integrated into scikit-learn in the near future, because “the demand for and seminality of these techniques is unclear” (stated here).

And this is another suggestion for time-series cross validation.

Elias Hasle · Answer

Disclaimer: The method described here is original research, not based on a thorough reading of the litterature. It is my best attempt at improvising a K-fold CV method for a multivariate timeseries analysis with relatively short input window lengths (assuming no/low dependence over longer time spans), where there was a problem with non-homogeneous presence of data sources over the data collection period.
First, the series of observations is transformed into a series of observation history windows of length h and with step 1 between windows. Then the principle is to split the window dataset in S ordered slices (where S>>K, to approximate random splitting), each with length>>h (to not waste data), and hand out the slices alternately (like playing cards) to separate model instances. To keep the resulting subsets more cleanly separated, a quarantine window of length h at the beginning of each slice is held out of training.
The models are trained on all slices except their own, and their own slices are used for validation. Validation of the collection/ensemble of models is done by summing the validation error over all slices, where each slice is processed by the submodel which has not been trained on that slice. Testing on unseen data can be done using an average (or other suitable combination) of the outputs of all the trained model instances. Or one can first distill the ensemble into a single model, training on reproduction of the validation outputs.
This method is intended to reduce dependence on the stationarity of the data-generating process (including measurement reliability) over the collection period. It is also intended to give every part of the data roughly the same influence on the model.
Note that the slice length should not align too well with periods that (are expected to) appear in the data, such as (typically) daily, weekly and yearly cycles. Otherwise, the subsets will be more biased. Imagine, for example (and it is a silly one), a situation where one fold contains all night hours and one contains all day hours and the task is to predict air temperature from radon gas concentration. I have no idea what to expect from the radon gas, but certainly a best guess with no sensible input is lower at night than at day.
One way to test the performance of the resulting CV ensemble is to hold out every K+1-th slice and test the ensemble on the resulting subset. This can be extended to an outer Cross-Validation where different subsets are held out in each fold, at a cost of a factor K+1 on the amount of computation needed.

Aksakal · Answer

In your case you don't have a lot of options. You only have one bakery, it seems. So, to run an out-of-sample test your only option is the time separation, i.e. the training sample would from the beginning to some recent point in time, and the holdout would from that point to today.

If your model is not time series, then it's a different story. For instance, if your sales $y_t=f(t)+varepsilon_t$, where $f(t)$ is a function of different exogenous things like seasonal dummies, weather etc. but not $y_{s< t}$, then this is not a dynamic time series model. In this case you can create the holdout sample in any different ways such as random subset of days, a month from any period in the past etc.

Sycorax · Answer

I often approach problems from a Bayesian perspective. In this case, I'd consider using overimputation as a strategy. This means setting up a likelihood for your data, but omit some of your outcomes. Treat those values as missing, and model those missing outcomes using their corresponding covariates. Then rotate through which data are omitted. You can do this inside of, e.g., a 10-fold CV procedure.

When implemented inside of a sampling program, this means that at each step you draw a candidate value of your omitted data value (alongside your parameters) and assess its likelihood against your proposed model. After achieving stationarity, you have counter-factual sampled values given your model which you can use to assess prediction error: these samples answer the question "what would my model have looked like in the absence of these values?" Note that these predictions will also inherit uncertainty from the uncertainty present in coefficient estimates, so when you collect all of your predicted values for, e.g. March 1, 2010 together, you'll have a distribution of predictions for that date.

The fact that these values are sampled means that you can still use error terms that depend on having a complete data series available (e.g. moving average), since you have a sampled outcome value available at every step.

ene100 · Answer

This link from Rob Hyndman's blog has some info that may be useful: http://robjhyndman.com/hyndsight/crossvalidation/

In my experience, splitting data into chronological sets (year 1, year 2, etc) and checking for parameter stability over time is very useful in building something that's robust. Furthermore, if your data is seasonal, or has another obvious way to split in to groups (e.g. geographic regions) then checking for parameter stability in those sub-groups can also help determine how robust the model will be and if it makes sense to fit separate models for separate categories of data.

I think that statistical tests can be useful but the end result should also pass the "smell test".

James · Answer

1) Technically speaking, you don't need to test out of sample if you use AIC and similar criteria because they help avoid overfitting.

3) I don't see how you can do the standard CV because it implies training a time series model with some missing values. Instead, try using a rolling window for training and predict the response at one or more points that follow the window.

How to split dataset for time-series prediction?

5 Answers

Add your own answers!

Ask a Question