Geospatial way to optimise cluster entropy calculation per LSOA (Polygon)

Question

I've been trying to create the entropy data below after doing a spatial join of the LSOA codes to the pandas dataframe (starting with GeoPandas). My silly "for loop" way is super slow for 30000 LSOAs in England+Wales.
Maybe a geospatial (python or QGIS based etc) operation would be more appropriate rather than trying to find a groupby solution?
My dataset looks like this and Im trying to find the entropy per 'lsoa11cd' area (UK specific geospatial value representing an area/polygon).
The k_means_5 data indicate k = 5.
    test01[['k_means_5','lsoa11cd']].head(10)
    k_means_5   lsoa11cd
0   1   E01019240
1   1   E01019240
2   1   E01019238
3   1   E01019240
4   1   E01019240
5   1   E01019240
6   1   E01019316
7   1   E01019316
8   1   E01019316
9   1   E01019316

I can get the entropy with this super silly/lame (incorrect?) line but I would like to do it more efficiently as it will take 10 days to iterate with a for loop over the 'lsoa11cd' values.
len(test01.loc[test01['lsoa11cd'] == 'E01019238']['k_means_5'])
51
test01.loc[test01['lsoa11cd'] == 'E01019238']['k_means_5'].value_counts()
1    40
2     6
0     5
Name: k_means_5, dtype: int64
test01.loc[test01['lsoa11cd'] == 'E01019238']['k_means_5'].value_counts()/len(test01.loc[test01['lsoa11cd'] == 'E01019238']['k_means_5'])
1    0.784314
2    0.117647
0    0.098039
Name: k_means_5, dtype: float64
    
from scipy.stats import entropy
    test01.loc[test01['lsoa11cd'] == 'E01019238', 'entropy_k_means_5'] = entropy(test01.loc[test01['lsoa11cd'] == 'E01019238']['k_means_5'].value_counts()/len(test01.loc[test01['lsoa11cd'] == 'E01019238']['k_means_5']), base=5)

I have tried a bit but of course the last step cant work with the shape of the objects I'm sending. Any expert advice?
g1 = test01.groupby('lsoa11cd')['k_means_5'].transform('count')
0          172
1          172
2           51
3          172
4          172
          ... 
1758295     70
1758296     59
1758297     87
1758298     87
1758299    122
Name: k_means_5, Length: 1758300, dtype: int64

g2 = test01.groupby('lsoa11cd')['k_means_5'].value_counts()
lsoa11cd   k_means_5
E01000001  4                             17
           3                              9
E01000002  3                             24
           4                             22
E01000003  4                             13
                                         ..
W01001956  0                             42
           4                              3
W01001957  3                             23
           4                              9
W01001958  4                              9
Name: k_means_5, Length: 64908, dtype: int64

entropy( g2/g1, base=5) # NOP!

Geospatial way to optimise cluster entropy calculation per LSOA (Polygon)

Add your own answers!

Ask a Question