ST_SnapToGrid takes a long time on

Question

My ultimate goal is to fetch all points from a 4 million record table that fit within a given rectangle and group the results in clusters. All of the queries I have tried up until now take too long. They take upwards of 15 seconds. We are shooting for a few hundred milliseconds max.

I obtained the fastest result from the following approach...

Created a table of 4 million records. Each record contains a geometry point called "location_point". The geometry points are based on SRID 4326.
Created a second column named "snapped_geometry" that is the location_point snapped to a postGIS grid.

set snapped_geometry = ST_SnapToGrid(location_point, 0.2);
Indexed this snapped_geometry column using GIST

create index on tablename using GIST (snapped_geometry)
Ran this query...

explain (analyze) select  count(snapped_geometry) as count, snapped_geometry from contacts_80 where st_contains(st_MakeEnvelope(-95, 30.5, -80, 45, 4326), snapped_geometry)
group by snapped_geometry
This is the response...

Some things I learned from researching the terms in this explain response...
1. The sort information is related to the "group by" clause.
2. The heap scan is related to the "where" clause.
3. Limited work_mem is not the reason for our query taking so long. Initially, there was not enough work_mem to execute the sort in working memory. As a result, the sort spilled to disk. See here. I increased work_mem with set work_mem = '800MB'. This fixed the issue as confirmed by the line "Sort Method: quicksort" in the Explain response.
4. The bitmap heap scan was not lossy. We were initially concerned that our query was lossy because a row in the Explain response displays a "recheck condition". I later learned that this line is in all explain responses even when the bitmap heap scan does not need to recheck the index conditions (i.e. even when the scan is not lossy). The scan only rechecks the condition when the scan is lossy. The absence of the word "lossy" in line 9 of the explain response indicates the scan was not lossy. See here.

Which leaves me still curious how I can speed up this query. 
Am I using ST_SnapToGrid incorrectly? 
Is there an error in how I created and used the GIST index? 
Is it impossible to speed up this query?

I have also experimented with PostGIS's kmeans, clusterDBScan and clusterWithin with no speed advantages.

Other links I used to learn about heap scans and sorting methods...

bitmap heap scans

Sorting methods

Link mentioning count() cannot be used with an index

robin loche · Answer

Did you used "ANALYZE contacts_80;" after your indexes so your planner could correctly plan? It seems to plan 4000 lines, and have 4000000 in the end...

Also, ST_Within is usually used instead of ST_Contains, there is some minor differences.

What I would try is:

ANALYZE tablename;
explain analyze 
  select count(snapped_geometry) as count, snapped_geometry 
    from contacts_80 
    where ST_Within(snapped_geometry, st_MakeEnvelope(-95, 30.5, -80, 45, 4326))
    group by snapped_geometry;

Note the inversion of the parameters for ST_Within.

Also, the geometry comparison seems really long in your group by, so maybe you can try to use st_geohash instead of the geometry as a key to group?

ST_SnapToGrid takes a long time on

One Answer

Add your own answers!

Ask a Question