Data Science Asked by Michael Pyle on August 31, 2021
I am attempting to use XGBoost in R to train a model that predicts a fixed number of target variables using all data from previous dates, as well as the two categorical variables (Cat1
and Cat2
) for the current date as predictors. The original data is in this format:
╔═════════╦═════════╦══════════╦══════╦══════╦══════╦══════╗
║ Target1 ║ Target2 ║ Date ║ Cat1 ║ Cat2 ║ Var1 ║ Var2 ║
╠═════════╬═════════╬══════════╬══════╬══════╬══════╬══════╣
║ 1 ║ 2 ║ 01/01/20 ║ A ║ B ║ 3 ║ 4 ║
║ 5 ║ 6 ║ 02/01/20 ║ C ║ D ║ 7 ║ 8 ║
║ 8 ║ 7 ║ 03/01/20 ║ A ║ D ║ 6 ║ 5 ║
║ 4 ║ 3 ║ 04/01/20 ║ C ║ B ║ 2 ║ 1 ║
║ ║ ║ ║ ║ ║ ║ ║
╚═════════╩═════════╩══════════╩══════╩══════╩══════╩══════╝
And I conceptualise the training data looking like this for each row of the data frame, where the Train_DataFrame
column contains all data from previous dates:
╔═════════╦═════════╦══════╦══════╦═════════════════╦═════════╦══════╦══════╦══════╦══════╗
║ Target1 ║ Target2 ║ Cat1 ║ Cat2 ║ Train_DataFrame ║ ║ ║ ║ ║ ║
╠═════════╬═════════╬══════╬══════╬═════════════════╬═════════╬══════╬══════╬══════╬══════╣
║ 4 ║ 3 ║ C ║ B ║ Target1 ║ Target2 ║ Cat1 ║ Cat2 ║ Var1 ║ Var2 ║
║ ║ ║ ║ ║ 1 ║ 2 ║ A ║ B ║ 3 ║ 4 ║
║ ║ ║ ║ ║ 5 ║ 6 ║ C ║ D ║ 7 ║ 8 ║
║ ║ ║ ║ ║ 8 ║ 7 ║ A ║ D ║ 6 ║ 5 ║
║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
╚═════════╩═════════╩══════╩══════╩═════════════════╩═════════╩══════╩══════╩══════╩══════╝
First question; is it possible to pass an entire data frame as a variable to a model?
If so, then how can this be done in a memory efficient manner, so as to avoid data duplication? I know I could have the Train_DataFrame
column contain lists of each "past" data, however, this would lead to data duplication and inefficient memory usage. Is there a way to have this column contain a sub-setting function to pass to the original data frame, for example?
Or is there a better approach to the problem that I am potentially missing?
Thanks in advance.
First question; is it possible to pass an entire data frame as a variable to a model?
No, each feature must be a single value. In other words you could provide the data frame as a vector containing all the values, assuming the size is fixed: each column would correspond to a specific cell in the original data frame.
But I think an even better option in your case is to look into methods which take (chronological) sequences into account. Conditional Random Fields might be a good option, assuming you need to predict the target variables for the whole sequence?
Answered by Erwan on August 31, 2021
Lets break down your questions into sub-parts. Q1: do you mean the predictor needs to predict 2 target values? (Target 1 and target2)
Answer: the model can always predict only ONE target value. This target value can be either categorial or numerical.
Q2. One column ( train_dataframe) contains all data from previous dates. Why?
Ans: we don’t need to do this. We can simply use the entire data from previous dates in separate columns too.
Q3 : passing an entire dataframe as 1 variable? Ans: not required. Pass it as it is.
Q4: list of each past data is duplicate? Ans: that is ok. Your question is a bit unclear. Is the data imported as it is or you are using an external engine for it? Like mysql/ mongodb? Then you can put a query to get only unique values.
Q5: Is there a better approach? Ans: it looks like you have a big dataset. Try using mongodb. Load your data into mongodb and import it to R. R has a wonderful feature of showing the values of all variables in its right side window. It will be helpful to check it.
I hope this helps!
Answered by Nidhi Garg on August 31, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP