Geographic Information Systems Asked by cokrzys on June 15, 2021
We’ve got a table of approximately 300,000 points and would like to find the nearest neighbor (within 10km) to each point in the same table. The most efficient query we’ve found so far is below, but it’s slow enough that it’s basically unworkable … hasn’t run to completion after multiple hours. Geometry is EPSG 4326, PostGIS 2.1.8.
SELECT DISTINCT ON(p1.id) p1.id AS p1_id,
p2.id AS p2_id, ST_Distance_Sphere(p1.geom, p2.geom)
FROM points p1, points p2
WHERE p1.id <> p2.id AND ST_DWithin(p1.geom, p2.geom, 10000)
ORDER BY p1.id, ST_Distance_Sphere(p1.geom, p2.geom);
We’ve also tried a couple of other options including using the PostGIS KNN operator and looping through points individually using a cursor function.
Is there a more efficient query we can use, or maybe impoved nearest neighbor support in newer versions of PostGIS?
Try to use <-> function with 2D index on geometry.
CREATE INDEX points_geom ON points USING GIST(geom)
After that you can do it something like that if you want to find only the closest point to another point (one to one relation)
SELECT
first_point.*,
second_points.*,
st_distance(first_point.geom, second_points.geom) distance
FROM (SELECT
p1.id as p1_id,
(SELECT p2.id
FROM points p2
WHERE p1.id <> p2.id AND ST_DWithin(p1.geom, p2.geom, 10000)
ORDER BY p1.geom <-> p2.geom
LIMIT 1) AS p2_id
FROM points p1) knn
LEFT JOIN points first_point ON first_point.id = knn.p1_id
LEFT JOIN points second_points ON second_points.id = knn.p1_id;
If you are strugling with performance add EXPLAIN at the beginning to check if your indexes are in use.
Answered by Losbaltica on June 15, 2021
This is essentially a duplicate question of multiple others, with the sole difference being a table self-join.
However, all queries currently present in this post have delicate CRS misunderstandings, at least when it comes to distances:
ST_DWithin
; the units of that value are CRS dependent, thus, as the data is in EPSG:4236, you are searching in a radius of 10000 degrees!Arguably the best way to realize true KNN searches uses the LATERAL JOIN
in conjunction with the <->
KNN operator, and, optionally but likely not required, a limiting radius filter (e.g. ST_DWithin
or ST_Expand
+ &&
BBox comparator).
Concerning the units, one could choose a on-the-fly cast to geography type to tackle the CRS/distance issues and get the most precise distances in one go, using speroidal (or, quicker, spherical) algebra.
Runing
SELECT p1.id AS p1_id,
p2.id AS p2_id,
ST_Distance(p1.geom::geography, p2.geom::geography) AS dist
FROM points AS p1
CROSS JOIN LATERAL (
SELECT id,
geom
FROM points
WHERE p1.id <> id
-- AND ST_DWithin(p1.geom::geography, geom::geography, 10000)
ORDER BY
p1.geom::geography <-> geom::geography
LIMIT 1
) AS p2
;
will return dist
in meter to the nearest neighbor p2.id
for each p1.id
.
Note: due to the cast to geography, units for any accepting function will be in meter as well, thus the 10000
again.
As already mentioned, it is essential that you have a proper index in place! Checking the EXPLAIN ANALYZE
is crucial to find out if it is actually used (although you can tell it is if you get results within your lifetime I guess...), and running VACUUM ANALYZE <table_name>
in advance can help to enforce its use.
Now, the liberal use of the on-the-fly cast to geography will take a heavy toll on execution speed. I´d recommend to either project the data to a suitable projection for distance measurements of your area, or, possibly better, change the geometry type to geography; both can be achieved by adding a new column, if you want your original geom
s to stay untouched, and add an own index to it.
Using test data on 70.000 points with porperly indexed geography column (having added a second one) took about 1 min. to complete the initial, uncached run on a mid tech setup.
Answered by geozelot on June 15, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP