Impute missing value: transpose or not?

Question

I'm building a model that fills the missing values from a Dataframe that contains the number of visitors for different stores, each day:

day
store_a
store_b
store_c

2021-01-01
100
200
300

2021-01-02
110
220
290

2021-01-03
50
110
170

2021-01-04
NAN
220
290

2021-01-05
7
16
NAN

2021-01-06
90
NAN
NAN

I'm using the IterativeImputer class from scikit-learn, this method of imputation puts aside one column at each step and train an estimator on the other columns to predict the column that was put aside.
My question is: should I transpose my dataframe or not?
If I keep my dataframe (1 line = 1 day, 1 column = 1 store), this means that we can completely predict the number of visitors in a store in a particular day just by looking at the other stores.
But if I transpose my dataframe (1 line = 1 store, 1 line = 1 day), this means that we can predict the number of visitors just by looking at the history of one store.
I guess one simple way to check if I should transpose is comparing the RMSE of the two methods but I wanted some explanation instead of "method A works better, move along."

Jayaram Iyer · Answer

Just by eyeballing the data i would think the number of visitors for store A on jan 4th is around 110. Basically it is always approximately half of Store B on any given day.
It seems like the number of visitors across stores are correlated, so you could potentially use simple linear regression to get a reasonable estimate for any given day.

Impute missing value: transpose or not?

One Answer

Add your own answers!

Ask a Question

day	store_a	store_b	store_c
2021-01-01	100	200	300
2021-01-02	110	220	290
2021-01-03	50	110	170
2021-01-04	NAN	220	290
2021-01-05	7	16	NAN
2021-01-06	90	NAN	NAN