TransWikia.com

Impute missing value: transpose or not?

Data Science Asked on August 8, 2021

I’m building a model that fills the missing values from a Dataframe that contains the number of visitors for different stores, each day:

day store_a store_b store_c
2021-01-01 100 200 300
2021-01-02 110 220 290
2021-01-03 50 110 170
2021-01-04 NAN 220 290
2021-01-05 7 16 NAN
2021-01-06 90 NAN NAN

I’m using the IterativeImputer class from scikit-learn, this method of imputation puts aside one column at each step and train an estimator on the other columns to predict the column that was put aside.

My question is: should I transpose my dataframe or not?

If I keep my dataframe (1 line = 1 day, 1 column = 1 store), this means that we can completely predict the number of visitors in a store in a particular day just by looking at the other stores.

But if I transpose my dataframe (1 line = 1 store, 1 line = 1 day), this means that we can predict the number of visitors just by looking at the history of one store.

I guess one simple way to check if I should transpose is comparing the RMSE of the two methods but I wanted some explanation instead of "method A works better, move along."

One Answer

Just by eyeballing the data i would think the number of visitors for store A on jan 4th is around 110. Basically it is always approximately half of Store B on any given day.

It seems like the number of visitors across stores are correlated, so you could potentially use simple linear regression to get a reasonable estimate for any given day.

Answered by Jayaram Iyer on August 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP