TransWikia.com

KNN with "Big Data"

Geographic Information Systems Asked by tjmgis on February 5, 2021

I have a point dataset of over 30mill records and I am trying to select the closest 5 lines to each point. Reading this post there seems to be two methodologies

  1. Using <-> however as I want to select the 5 closest I have used the following
select a.id, a.code, array_agg(b.id) AS line_ids
from points p
CROSS JOIN LATERAL (
    SELECT p.id, l.id
    FROM lines l
    ORDER BY a.geom <-> c.geom LIMIT 5
) AS b
where p.id = b.id

This ran for a few hours and then failed on a memory allocation (was above 24Gb of RAM)

  1. Use ST_DWithin to limit to 500m
select p.id, p.code, array_agg(b.id) AS line_ids
from points p
CROSS JOIN LATERAL (
    SELECT p.id, b.id
    FROM lines l
    WHERE ST_DWithin(p.geom, l.geom, 500)
      ORDER BY p.geom <-> l.geom LIMIT 5
) AS b
where p.id = b.id

All columns have an index and a spatial index. IS there any better methods or things I am missing to successfully run this query?

Using PostgreSQL 12, PostGIS 3, Windows 64Bit, 32Gb of RAM

Explain for first method

Nested Loop  (cost=0.41..433780457.02 rows=5 width=24)
  ->  Seq Scan on points p  (cost=0.00..1643568.42 rows=30634442 width=48)
  ->  Subquery Scan on b  (cost=0.41..14.10 rows=1 width=16)
        Filter: (p.id = b.id)
        ->  Limit  (cost=0.41..14.03 rows=5 width=24)
              ->  Index Scan using geometry_geom_idx on open_usrn c  (cost=0.41..3997260.47 rows=1467289 width=24)
                    Order By: (geometry <-> a.geom)

Explain for second method

GroupAggregate  (cost=5244082420.90..5244082421.01 rows=5 width=48)
  Group Key: p.id, b.id
  ->  Sort  (cost=5244082420.90..5244082420.91 rows=5 width=24)
        Sort Key: p.id, p.code
        ->  Nested Loop  (cost=0.66..5244082420.84 rows=5 width=24)
              ->  Seq Scan on points a  (cost=0.00..1797087.20 rows=33052020 width=48)
              ->  Subquery Scan on b  (cost=0.66..158.60 rows=1 width=16)
                    Filter: (p.id = b.id)
                    ->  Limit  (cost=0.66..158.53 rows=5 width=24)
                          ->  Index Scan using geometry_geom_idx on open_usrn c  (cost=0.66..4642.10 rows=147 width=24)
                                Index Cond: (geometry && st_expand(a.geom, '500'::double precision))
                                Order By: (geometry <-> a.geom)
                                Filter: st_dwithin(a.geom, geometry, '500'::double precision)

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP