TransWikia.com

Multiprocess Spatial Join with Spatial Index

Geographic Information Systems Asked by rolo on July 22, 2021

I’m a newbie in implementing GIS on Python, but the lack of any GIS software at the office has forced me to learn about it.

I have a huge grid (a multipolygon) that I’m currently filling out with counts of points using spatial index. When using a subset of the grid the code works just fine but when I use anything bigger than that then it takes forever.

This is the code that I’m currently using

# build the r-tree index
sindex = points.sindex

# find the points that intersect with each subpolygon and add them to points_within_geometry
points_within_grid = pd.DataFrame()

# grid-level dataseet
grid_data = pd.DataFrame()

i=1
for poly in grid:
    clear_output(wait=True)

    # buffer by the <1 micron dist to account for any space lost in the quadrat cutting
    # otherwise may miss point(s) that lay directly on quadrat line
    poly = poly.buffer(1e-14).buffer(0)

    # find approximate matches with r-tree, then precise matches from those approximate ones
    possible_matches_index = list(sindex.intersection(poly.bounds))
    possible_matches = points.iloc[possible_matches_index]
    precise_matches = possible_matches[possible_matches.intersects(poly)]

    # count number of firms within polygon
    x=pd.DataFrame({'grid_index': i,
                    'geometry':[Polygon(poly.envelope)],
                    'n_firms': len(precise_matches)})

    for poi in range(1,9):
        x['sic_'+str(poi)]=len(precise_matches[precise_matches.sic1==poi])

    grid_data=grid_data.append(x)    

    # points within the grid
    points_within_grid = points_within_grid.append(precise_matches)
    print("Current Progress: ", np.round(i/len(grid)*100,0),"%")

    i=i+1


# drop duplicate points, if buffered poly caused an overlap on point(s) that lay directly on a quadrat line
points_within_grid = points_within_grid.drop_duplicates(subset=['x', 'y','poi_id'])
points_outside_grid = points[~points.poi_id.isin(points_within_grid.poi_id)]

So, I’ve been thinking about using multiprocessing.Pool to make it faster. I’ve been reading online but don’t really know how to do it. The first step would be to define a function that can be introduced to the Pool.

So far I have this function,

def grid_inter(poly)
    # buffer by the <1 micron dist to account for any space lost in the quadrat cutting
    # otherwise may miss point(s) that lay directly on quadrat line
    poly = poly.buffer(1e-14).buffer(0)

    # find approximate matches with r-tree, then precise matches from those approximate ones
    possible_matches_index = list(sindex.intersection(poly.bounds))
    possible_matches = points.iloc[possible_matches_index]
    precise_matches = possible_matches[possible_matches.intersects(poly)]

    # count number of firms within polygon
    x=pd.DataFrame({'grid_index': i,
                    'geometry':[Polygon(poly.envelope)],
                    'n_firms': len(precise_matches)})

    for poi in range(1,9):
        x['sic_'+str(poi)]=len(precise_matches[precise_matches.sic1==poi])
    return x

But I’m stuck in the next step which I think should be something like this

pool = mp.Pool(mp.cpu_count())

results = pool.map(grid_inter, [poly for poly in grid])

pool.close()

How can I make ithis work?

One Answer

This is what I ended up doing. This creates a count of firms with certain characteristics within a grid. It also creates an id based on the centroid of the grid cell that the data is filling in.

def grid_inter(poly):
    #initialize data
    x=pd.DataFrame({'grid_index_x': poly.centroid.x,
                    'grid_index_y': poly.centroid.y,
                    'geometry':[Polygon(poly.envelope)]})

    # buffer by the <1 micron dist to account for any space lost in the quadrat cutting
    # otherwise may miss point(s) that lay directly on quadrat line
    poly = poly.buffer(1e-9).buffer(0)

    # find approximate matches with r-tree, then precise matches from those approximate ones
    possible_matches_index = list(sindex.intersection(poly.bounds))
    possible_matches = points.iloc[possible_matches_index]
    precise_matches = possible_matches[possible_matches.intersects(poly)]

    # count number of firms within polygon
    x['n_firms']=len(precise_matches)

    for amenity in points.amenity.unique():
        x['poi_'+str(amenity)]=len(precise_matches[precise_matches.amenity==amenity])

    for poi in points.naics2.unique():
        x['NAICS_'+str(int(poi))]=len(precise_matches[precise_matches.naics2==poi])

    return x

start = time.time()
sindex = points.sindex
pool = Pool(cpu_count()-1)
grid_poi_data = pool.map(grid_inter,grid)
pool.close()
pool.join()
grid_poi_data = pd.concat(grid_poi_data)

end = time.time()
print (end - start)

Thanks for the previous comments. They helped a lot.

Answered by rolo on July 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP