Code Review Asked by lcrmorin on January 10, 2021
I have a big python data-frame and I am trying to add a column to it with average number of missing values by row. I have inherited some code that is working but I’d like to reduce memory usage by removing intermediary values.
Here is a toy exemple :
students = [ ('jack', np.NaN, 'Sydeny' , 'Australia') ,
('Riti', np.NaN, 'Delhi' , 'India' ) ,
('Vikas', 31, np.NaN , 'India' ) ,
('Neelu', 32, 'Bangalore' , 'India' ) ,
('John', 16, 'New York' , 'US') ,
('John' , 11, np.NaN, np.NaN ) ,
(np.NaN , np.NaN, np.NaN, np.NaN )
]
dfObj = pd.DataFrame(students, columns = ['Name' , 'Age', 'City' , 'Country'])
And the code I inherited :
print('NanCounter -> transform')
nan_count = pd.DataFrame(data = np.mean(dfObj.isna().values, axis=1).astype('float32'), columns=['nan_count']).set_index(dfObj.index)
X_ = pd.concat([dfObj, nan_count], axis=1)
X_.set_index(dfObj.index, inplace=True)
It seems quite a convoluted way to just write :
print('NanCounter -> transform')
dfObj['nan_count'] = np.mean(dfObj.isna().values, axis=1).astype('float32')
Plus it seems to consume more memory. I am concerned I am missing something about calculcations. Are those expression equivalent ? Namely, what would be the interest with working with a supplementary variable ?
The difference between the two code snippets is that the first creates a new DataFrame while the second modifies the original DataFrame in-place.
The first snippet can be simplified to
X_ = dfObj.assign(nan_count=dfObj.isna().mean(axis=1).astype('float32'))
Answered by GZ0 on January 10, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP