Data Science Asked on August 8, 2021
I’m building a model that fills the missing values from a Dataframe that contains the number of visitors for different stores, each day:
day | store_a | store_b | store_c |
---|---|---|---|
2021-01-01 | 100 | 200 | 300 |
2021-01-02 | 110 | 220 | 290 |
2021-01-03 | 50 | 110 | 170 |
2021-01-04 | NAN | 220 | 290 |
2021-01-05 | 7 | 16 | NAN |
2021-01-06 | 90 | NAN | NAN |
I’m using the IterativeImputer class from scikit-learn, this method of imputation puts aside one column at each step and train an estimator on the other columns to predict the column that was put aside.
My question is: should I transpose my dataframe or not?
If I keep my dataframe (1 line = 1 day, 1 column = 1 store), this means that we can completely predict the number of visitors in a store in a particular day just by looking at the other stores.
But if I transpose my dataframe (1 line = 1 store, 1 line = 1 day), this means that we can predict the number of visitors just by looking at the history of one store.
I guess one simple way to check if I should transpose is comparing the RMSE of the two methods but I wanted some explanation instead of "method A works better, move along."
Just by eyeballing the data i would think the number of visitors for store A on jan 4th is around 110. Basically it is always approximately half of Store B on any given day.
It seems like the number of visitors across stores are correlated, so you could potentially use simple linear regression to get a reasonable estimate for any given day.
Answered by Jayaram Iyer on August 8, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP