TransWikia.com

Can an entire data frame be used as a prediction variable?

Data Science Asked on December 28, 2021

I am attempting to use XGBoost in R to train a model that predicts a fixed number of target variables using all data from previous dates, as well as the two categorical variables (Cat1 and Cat2) for the current date as predictors. The original data is in this format:

╔═════════╦═════════╦══════════╦══════╦══════╦══════╦══════╗
║ Target1 ║ Target2 ║ Date     ║ Cat1 ║ Cat2 ║ Var1 ║ Var2 ║
╠═════════╬═════════╬══════════╬══════╬══════╬══════╬══════╣
║ 1       ║ 2       ║ 01/01/20 ║ A    ║ B    ║ 3    ║ 4    ║
║ 5       ║ 6       ║ 02/01/20 ║ C    ║ D    ║ 7    ║ 8    ║
║ 8       ║ 7       ║ 03/01/20 ║ A    ║ D    ║ 6    ║ 5    ║
║ 4       ║ 3       ║ 04/01/20 ║ C    ║ B    ║ 2    ║ 1    ║
║         ║         ║          ║      ║      ║      ║      ║
╚═════════╩═════════╩══════════╩══════╩══════╩══════╩══════╝

And I conceptualise the training data looking like this for each row of the data frame, where the Train_DataFrame column contains all data from previous dates:

╔═════════╦═════════╦══════╦══════╦═════════════════╦═════════╦══════╦══════╦══════╦══════╗
║ Target1 ║ Target2 ║ Cat1 ║ Cat2 ║ Train_DataFrame ║         ║      ║      ║      ║      ║
╠═════════╬═════════╬══════╬══════╬═════════════════╬═════════╬══════╬══════╬══════╬══════╣
║       4 ║       3 ║ C    ║ B    ║ Target1         ║ Target2 ║ Cat1 ║ Cat2 ║ Var1 ║ Var2 ║
║         ║         ║      ║      ║ 1               ║ 2       ║ A    ║ B    ║ 3    ║ 4    ║
║         ║         ║      ║      ║ 5               ║ 6       ║ C    ║ D    ║ 7    ║ 8    ║
║         ║         ║      ║      ║ 8               ║ 7       ║ A    ║ D    ║ 6    ║ 5    ║
║         ║         ║      ║      ║                 ║         ║      ║      ║      ║      ║
╚═════════╩═════════╩══════╩══════╩═════════════════╩═════════╩══════╩══════╩══════╩══════╝

First question; is it possible to pass an entire data frame as a variable to a model?

If so, then how can this be done in a memory efficient manner, so as to avoid data duplication? I know I could have the Train_DataFrame column contain lists of each "past" data, however, this would lead to data duplication and inefficient memory usage. Is there a way to have this column contain a sub-setting function to pass to the original data frame, for example?

Or is there a better approach to the problem that I am potentially missing?

Thanks in advance.

2 Answers

Lets break down your questions into sub-parts. Q1: do you mean the predictor needs to predict 2 target values? (Target 1 and target2)

Answer: the model can always predict only ONE target value. This target value can be either categorial or numerical.

Q2. One column ( train_dataframe) contains all data from previous dates. Why?

Ans: we don’t need to do this. We can simply use the entire data from previous dates in separate columns too.

Q3 : passing an entire dataframe as 1 variable? Ans: not required. Pass it as it is.

Q4: list of each past data is duplicate? Ans: that is ok. Your question is a bit unclear. Is the data imported as it is or you are using an external engine for it? Like mysql/ mongodb? Then you can put a query to get only unique values.

Q5: Is there a better approach? Ans: it looks like you have a big dataset. Try using mongodb. Load your data into mongodb and import it to R. R has a wonderful feature of showing the values of all variables in its right side window. It will be helpful to check it.

I hope this helps!

Answered by Nidhi Garg on December 28, 2021

First question; is it possible to pass an entire data frame as a variable to a model?

No, each feature must be a single value. In other words you could provide the data frame as a vector containing all the values, assuming the size is fixed: each column would correspond to a specific cell in the original data frame.

But I think an even better option in your case is to look into methods which take (chronological) sequences into account. Conditional Random Fields might be a good option, assuming you need to predict the target variables for the whole sequence?

Answered by Erwan on December 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP