Consolidating multivariate time-series information from many data sets

Question

I am having trouble setting up a problem with regards to time series analysis. I have 30 data sets, where each set corresponds to a certain project. Each project has 7 features, and each feature has time series information sampled every week from 2018 to today.
One of the features is how much the project is under/over budget and I wish to use this as the label.
If I was learning on a single project it could be a straight forward multivariate time-series. As an example I could transform the data from this:
Week   X1   Y1
1      0.5  3
2      1    5
3      1.5  8

to this:
X1   X2   X3   Y
-    -    0.5  3
0.5  3    1    5
1    5    1.5  8

Then I would use the second table as my input data. However, with 30 different projects all with the same time-steps I'm not sure how to combine this information so a single model could learn it. One solution I thought I could do was a bagging approach. I would train 30 models and I could do a voting/weighted average for predictions, but I feel like this isn't the best approach. If anyone has dealt with a problem like this before, please let me know. Thanks in advance!

aranglol · Accepted Answer

Let a single observation in your dataset be the target variable y for the jth project in the ith week. Then, use a cateogorical variable to indicate the jth project, along with other features you think are relevant (such as lagged values which are likely to be important to include). Finally, for extrapolation purposes and possible seasonality include features that represent time in some way. A natural variable to include is of course the observed week, but perhaps months or even years (if you have data that spans many years for example). Other variables might be holiday indicators, etc.
Overall, your data could look like this. Assume you have three projects and observed values for 3 weeks.

| Week | Project | ... | y |
|------|---------|-----|---|
| 1    | 1       |  ...     | 3 |
| 1    | 2       |  ...   | 8 |
| 1    | 3       |  ...   | 9 |
| 2    | 1       |  ...     | 3 |
| 2    | 2       |  ...   | 7 |
| 2    | 3       |  ...     | 4 |
| 3    | 1       |  ...     | 7 |
| 3    | 2       |  ...     | 0 |
| 3    | 3       |  ...   | 1 |

And so on. "..." refers to other features you can add to capture autocorrelation and/or seasonality like I described above (and as you have demonstrated) along with the 7 other features you said you have. The one week lagged effects that you presented will of course be the lagged values for the same jth project (i.e. when you calculate these features group by project number). In total, you will have 30 * N number of observations, where N is the total number of weeks you have per a single series.
As an aside, this is how the Rossman and the recent M5 competitions were presented, albeit, in the latter case the data was hierarchical as well.

drops · Answer

You could put all Information into 1 table with 30*7 features, since your timeline is the same for all tables.
Then, in order to predict the under/over budget, you can build a model with 30 output-nodes, where one node indicates whether a certain project was under/over budgeted.

Consolidating multivariate time-series information from many data sets

2 Answers

Add your own answers!

Ask a Question