Python : adding a columns with count of missing values by row

Question

I have a big python data-frame and I am trying to add a column to it with average number of missing values by row. I have inherited some code that is working but I'd like to reduce memory usage by removing intermediary values.
Here is a toy exemple :
students = [ ('jack', np.NaN, 'Sydeny' , 'Australia') ,
                 ('Riti', np.NaN, 'Delhi' , 'India' ) ,
                 ('Vikas', 31, np.NaN , 'India' ) ,
                 ('Neelu', 32, 'Bangalore' , 'India' ) ,
                 ('John', 16, 'New York' , 'US') ,
                 ('John' , 11, np.NaN, np.NaN ) ,
                (np.NaN , np.NaN, np.NaN, np.NaN ) 
                 ]
dfObj = pd.DataFrame(students, columns = ['Name' , 'Age', 'City' , 'Country'])

And the code I inherited :
print('NanCounter -> transform')
nan_count = pd.DataFrame(data = np.mean(dfObj.isna().values, axis=1).astype('float32'), columns=['nan_count']).set_index(dfObj.index)
X_ = pd.concat([dfObj, nan_count], axis=1)
X_.set_index(dfObj.index, inplace=True)

It seems quite a convoluted way to just write :
print('NanCounter -> transform')
dfObj['nan_count'] = np.mean(dfObj.isna().values, axis=1).astype('float32')

Plus it seems to consume more memory. I am concerned I am missing something about calculcations. Are those expression equivalent ? Namely, what would be the interest with working with a supplementary variable ?

GZ0 · Answer

The difference between the two code snippets is that the first creates a new DataFrame while the second modifies the original DataFrame in-place.
The first snippet can be simplified to
X_ = dfObj.assign(nan_count=dfObj.isna().mean(axis=1).astype('float32'))

Python : adding a columns with count of missing values by row

One Answer

Add your own answers!

Ask a Question