TransWikia.com

ST_DWithin exponentially slow. Cannot find what I am doing wrong

Geographic Information Systems Asked by Anne-Sophie on March 17, 2021

  • PostGIS version: 3.1
  • PostgreSQL version: 12.3
  • The machine I am working with has: 126G RAM, 48 CPU cores

Info:

I am getting started with PostGIS.

My goal is to get all the matching data between two points.

lv.geopoint and sub.geopoint both are GEOGRAPHY Points (SRID: 4326) and have GIST indexes on them.

My sub SELECT returns about 3k lines, my ‘valeurs_foncieres’ table however has 14 000 000 lines.

I do have BTREE indexes on valeurs_foncieres.id, caracteristiques_2018.id, caracteristiques_2018.num_acc, usagers_2018.id, usagers_2018.num_acc, vehicules_2018.id, vehicules_2018.num_acc.

The problem:

The query gets exponentially slow as I increase the distance of ST_DWithin.

  • Precision 100: 2sec
  • Precision 1 000: 10sec
  • Precision 10 000: 6min

Here is the query:

SELECT
    DISTINCT(sub.num_acc),
    sub.geopoint,
    sub.id
FROM
    (
    SELECT
        DISTINCT(u.num_acc) AS unumacc, c.*
    FROM
        usagers_2018 u
    INNER JOIN vehicules_2018 v ON
        u.num_acc = v.num_acc
    INNER JOIN caracteristiques_2018 c ON
        u.num_acc = c.num_acc
    WHERE
        u.grav = '2'
    ORDER BY
        c.id
) AS sub
INNER JOIN valeurs_foncieres vf ON
    ST_DWithin(vf.geopoint,
    sub.geog,
    1000,
    FALSE);

Here is the EXPLAIN:

HashAggregate  (cost=265577998.10..265578004.81 rows=671 width=49)
  Group Key: c.num_acc, c.geopoint, c.id
  ->  Nested Loop  (cost=9948.38..264845621.97 rows=97650150 width=49)
        ->  Unique  (cost=9947.84..10316.67 rows=6706 width=170)
              ->  Sort  (cost=9947.84..9964.60 rows=6706 width=170)
                    Sort Key: c.id, u.num_acc, c.an, c.mois, c.jour, c.hrmn, c.lum, c.agg, c."int", c.atm, c.col, c.com, c.adr, c.gps, c.lat, c.long, c.dep, c.lat_gps, c.long_gps, c.geopoint, c.geog
                    ->  Gather  (cost=3200.48..9521.63 rows=6706 width=170)
                          Workers Planned: 1
                          ->  Nested Loop  (cost=2200.48..7851.03 rows=3945 width=170)
                                Join Filter: ((u.num_acc)::text = (v.num_acc)::text)
                                ->  Parallel Hash Join  (cost=2200.06..6686.70 rows=2075 width=170)
                                      Hash Cond: ((c.num_acc)::text = (u.num_acc)::text)
                                      ->  Parallel Seq Scan on caracteristiques_2018 c  (cost=0.00..2859.90 rows=33990 width=157)
                                      ->  Parallel Hash  (cost=2174.12..2174.12 rows=2075 width=13)
                                            ->  Parallel Seq Scan on usagers_2018 u  (cost=0.00..2174.12 rows=2075 width=13)
                                                  Filter: ((grav)::text = '2'::text)
                                ->  Index Only Scan using vehicules_2018_num_acc_idx on vehicules_2018 v  (cost=0.42..0.54 rows=2 width=13)
                                      Index Cond: (num_acc = (c.num_acc)::text)
        ->  Index Scan using valeurs_foncieres_geopoint_idx on valeurs_foncieres vf  (cost=0.54..39477.72 rows=1456 width=32)
              Index Cond: (geopoint && _st_expand(c.geog, '1000'::double precision))
              Filter: st_dwithin(geopoint, c.geog, '1000'::double precision, false)
JIT:
  Functions: 30
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Questions:

Is this normal? How can I decrease the execution time?

One Answer

14 000 000 lines is not small. Also, If the geog that you have are uniformly distributed, the number of points concerned is around x100 when you multiply your radius x10 (the area of the circle depends of r²), so it's normal that your time augmentation seems squared. Here it seems to be more than that, but the more data you manipulate the more operations you will potentially needs because of all the cache gestion and disk call (not true for small data or big cache).

Here the explain seems ok, it uses the index so it's not the problem. You should juste be sure to VACUUM ANALYSE your tables but it shouldn't change much.

The main thing you can do if you didn't is tweak your postgresql. By default, the parameters are really conservative, if you have a big server you need to modify the parameters to use it properly. These parameters can be handle in this file on linux: /etc/postgresql/12/main/postgresql.conf then you need to restart postgres (you can easily find doc on internet if you have questions on that). Typically, what I modify are the following (adapted for around 120Go and 48 CPU of ram) :

  • shared_buffers = 30GB
  • effective_cache_size = 80GB
  • work_mem = 256MB
  • maintenance_work_mem = 5GB
  • autovacuum_work_mem = 5GB
  • effective_io_concurrency = 200 (for SSD, or 2 for disk)
  • max_worker_processes = 48
  • max_parallel_workers = 48
  • max_parallel_workers_per_gather = 12
  • wal_buffers = 16MB
  • min_wal_size = 1GB
  • max_wal_size = 2GB

Those are probably not perfect, and defined partly because of documentation that I found and partly from try and fail on big request. But if you didn't configure your postgresql at all (you said that you started) it should make a big difference in performance for big request (yours is not that big, but it should have an impact). The geometry data is usually big, so it should need more space than typical use of postgresql. Also, if you can, be sure to put your data on SSD, it can have a big impact too.

EDIT

I just reread your request, and I don't really understand why you need all the points whithin X meters if after you only keep one line by numacc. Either you didn't put the whole query, or you really only need one point. So I just rewrite it in case what you really wanted was to get the closest point. I used MATERIALIZED CTE, which create temporary table for each step, sometimes it can really improve the performance, so in case you wanted to get all the points and not just the closest neighboor, you can try to run it as is with removing the ORDER BY and the LIMIT in the INNER JOIN LATERAL at the end. And of course here I limit the search with ST_DWithin but if you want a true nearest neighboor you can remove this WHERE :

WITH usg AS MATERIALIZED
(
    SELECT
            DISTINCT(u.num_acc) AS unumacc
            , c.*
        FROM
            usagers_2018 u
        WHERE
            u.grav = '2'
        INNER JOIN caracteristiques_2018 c ON
            u.num_acc = c.num_acc
        ORDER BY
            c.id
), sub AS MATERIALIZED
(
    SELECT
            DISTINCT(usg.unumacc)
            , usg.*
            , v.*
        FROM
            usg
        INNER JOIN vehicules_2018 v ON
            usg.num_acc = v.num_acc
)
SELECT
        sub.*
        , vf.*
    FROM sub
    INNER JOIN LATERAL 
        (
            SELECT
                    vf.*
                FROM
                    valeurs_foncieres vf
                WHERE
                    ST_DWithin(
                        vf.geopoint
                        ,sub.geog
                        , 1000
                        ,FALSE
                    )
                ORDER BY vf.geopoint <-> sub.geog
                LIMIT 1
        )   
    ON TRUE;

Correct answer by robin loche on March 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP