Nearest Neighbor

Question

We've got a table of approximately 300,000 points and would like to find the nearest neighbor (within 10km) to each point in the same table.  The most efficient query we've found so far is below, but it's slow enough that it's basically unworkable ... hasn't run to completion after multiple hours.  Geometry is EPSG 4326, PostGIS 2.1.8.

SELECT DISTINCT ON(p1.id) p1.id AS p1_id,
p2.id AS p2_id, ST_Distance_Sphere(p1.geom, p2.geom)
FROM points p1, points p2   
WHERE p1.id <> p2.id AND ST_DWithin(p1.geom, p2.geom, 10000)   
ORDER BY p1.id, ST_Distance_Sphere(p1.geom, p2.geom);

We've also tried a couple of other options including using the PostGIS KNN operator and looping through points individually using a cursor function.

Is there a more efficient query we can use, or maybe impoved nearest neighbor support in newer versions of PostGIS?

Losbaltica · Answer

Try to use <-> function with 2D index on geometry.

CREATE INDEX points_geom ON points USING GIST(geom)

After that you can do it something like that if you want to find only the closest point to another point (one to one relation)

SELECT
  first_point.*,
  second_points.*,
  st_distance(first_point.geom, second_points.geom) distance
FROM (SELECT
        p1.id     as p1_id,
        (SELECT p2.id
         FROM points p2
         WHERE p1.id <> p2.id AND ST_DWithin(p1.geom, p2.geom, 10000)
         ORDER BY p1.geom <-> p2.geom
         LIMIT 1) AS p2_id
      FROM points p1) knn
  LEFT JOIN points first_point ON first_point.id = knn.p1_id
  LEFT JOIN points second_points ON second_points.id = knn.p1_id;

If you are strugling with performance add EXPLAIN at the beginning to check if your indexes are in use.

geozelot · Answer

This is essentially a duplicate question of multiple others, with the sole difference being a table self-join.

However, all queries currently present in this post have delicate CRS misunderstandings, at least when it comes to distances:

the main problem here is the threshold given to ST_DWithin; the units of that value are CRS dependent, thus, as the data is in EPSG:4236, you are searching in a radius of 10000 degrees!
follow-up problem is the actual distance a degree represents in suface distance; one degree of Longitude does not represent the same suface distance over different Latitudes!

Arguably the best way to realize true KNN searches uses the LATERAL JOIN in conjunction with the <-> KNN operator, and, optionally but likely not required, a limiting radius filter (e.g. ST_DWithin or ST_Expand + && BBox comparator).
Concerning the units, one could choose a on-the-fly cast to geography type to tackle the CRS/distance issues and get the most precise distances in one go, using speroidal (or, quicker, spherical) algebra.

Runing

SELECT p1.id AS p1_id,
       p2.id AS p2_id,
       ST_Distance(p1.geom::geography, p2.geom::geography) AS dist
FROM   points AS p1
CROSS JOIN LATERAL (
  SELECT id,
         geom
  FROM   points
  WHERE  p1.id <> id
 -- AND  ST_DWithin(p1.geom::geography, geom::geography, 10000)
  ORDER BY
         p1.geom::geography <-> geom::geography
  LIMIT  1
) AS p2
;

will return dist in meter to the nearest neighbor p2.id for each p1.id.
Note: due to the cast to geography, units for any accepting function will be in meter as well, thus the 10000 again.

As already mentioned, it is essential that you have a proper index in place! Checking the EXPLAIN ANALYZE is crucial to find out if it is actually used (although you can tell it is if you get results within your lifetime I guess...), and running VACUUM ANALYZE <table_name> in advance can help to enforce its use.

Now, the liberal use of the on-the-fly cast to geography will take a heavy toll on execution speed. I´d recommend to either project the data to a suitable projection for distance measurements of your area, or, possibly better, change the geometry type to geography; both can be achieved by adding a new column, if you want your original geoms to stay untouched, and add an own index to it.

Using test data on 70.000 points with porperly indexed geography column (having added a second one) took about 1 min. to complete the initial, uncached run on a mid tech setup.

Nearest Neighbor

2 Answers

Add your own answers!

Ask a Question