GeoPandas distance optimisation

Question

I have created the following function that measures the distance from one point in a GDF to all points in another GDF and returns a table back with the shortest distance for each point. It works well for one point however I neglected the fact I have a table of 4000 points and so it takes 10 mins. I have ran it in PostGIS and can get it down to less than a second. Is there a way to do this in Python that could match the PostGIS speed?
def get_distance_to(gdf_in, aoi_df, aoi):
   dist_df_list = list()
   for row in range(len(gdf_in)):
      single_row = gdf_in.iloc[row]
      distances = aoi_df.geometry.distance(single_row.geometry)
      dist_list = distances.to_list()
      closest_aoi = min(dist_list)
      single_row["dist_to_"+aoi] = closest_aoi
      df = single_row.to_frame().T
      dist_df_list.append(df)
completed_distances = pd.concat(dist_df_list, ignore_index=True, sort=False)
return completed_distances

my input tables looks something like this

and the output table looks like this

gene · Answer

Iterate over rows in a (Geo)DataFrame in (Geo)Pandas is very slow, see Optimum approach for iterating over a DataFrame for example

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. (How to iterate over rows in a DataFrame in Pandas)

You can try to use (Geo)DataFrame.apply() and shapely:nearest point as in  GeoPandas: Find nearest point in other dataframe without for iteration (see comment)

import geopandas as gpd
from shapely.geometry import Point
from shapely.ops import nearest_points
gpd1 = gpd.read_file("point1.shp") # red points
gpd2 = gpd.read_file("point2.shp") # blue points
pts3 = gpd2.geometry.unary_union
 def near(point, pts=pts3):
     # find the nearest point and return the corresponding value
     nearest = gpd2.geometry == nearest_points(point, pts)[1]
     return gpd2[nearest].id.values[0],gpd2[nearest].geometry.values[0]
 gpd1['Nearest'] = gpd1.apply(lambda row: near(row.geometry)[0], axis=1)
 gpd1['geom2'] = gpd1.apply(lambda row: near(row.geometry)[1], axis=1)
 print(gpd1)
         id         geometry         Nearest        geom2
 0   1   POINT (-0.99013 0.48096)      3     POINT (-0.77574 0.64739)
 1   2   POINT (-1.00987 0.08039)      4     POINT (-0.73060 0.10860)
 2   3   POINT (-0.71932 -0.13117)     5     POINT (-0.57827 -0.08039)
 3   4   POINT (-0.90268 -0.28914)     5     POINT (-0.57827 -0.08039)

Compute the distance
gpd1['distance'] = gpd1.apply(lambda row: row.geometry.distance(row.geom2), axis=1)
gpd1.drop('geom2', axis=1, inplace=True)
print(gpd1)
    id         geometry             Nearest     distance
 0   1   POINT (-0.99013 0.48096)      3        0.271406
 1   2   POINT (-1.00987 0.08039)      4        0.280688
 2   3   POINT (-0.71932 -0.13117)     5        0.149905
 3   4   POINT (-0.90268 -0.28914)     5        0.385759

martinfleis · Answer

For any spatial operation of this kind, you should always try to use spatial index. If you are interested in the minimal distance only, the following should give you relatively performant option.
import geopandas as gpd
from shapely.geometry import Point
import pandas as pd
import random

gdf = gpd.GeoDataFrame(geometry=[Point(random.randint(0, 1000), random.randint(0, 1000)) for _ in range(1000)])
gdf2 = gpd.GeoDataFrame(geometry=[Point(random.randint(0, 1000), random.randint(0, 1000)) for _ in range(1000)])

def get_nearest_distance(left, right, initial_buffer):
    """get distance from left to right"""
    buffered = left.buffer(initial_buffer)

distances = []
    for i in range(len(buffered)):
        geom = buffered.geometry.iloc[i]
        query = right.sindex.query(geom)
        while query.size == 0:
            query = right.sindex.query(geom.buffer(b))
            b += initial_buffer
        distances.append(right.iloc[query].distance(left.geometry.iloc[i]).min())

return pd.Series(distances, index=left.index)

gdf['distance_to_x'] = get_nearest_distance(gdf, gdf2, 50)

For 1000 to 1000 points, it is less than a second, compared to about a minute @gene's code takes.
To make it efficient, you should guess the initial_buffer distance in which you think will be only a few points. If none is in, then it expands buffer until it hits some.
Generally, if you want the best performance from GeoPandas, you should use the latest version (this code requires 0.8) and optional dependency pygeos (https://geopandas.readthedocs.io/en/latest/getting_started/install.html#using-the-optional-pygeos-dependency), which can speed up the code above by the order of magnitude.

GeoPandas distance optimisation

2 Answers

Add your own answers!

Ask a Question