Delete duplicate geometry in postgis tables

Question

After - I don't know what happend - all my entries in my PostGIS tables are doubled! I tried this to delete them but it does not delete any/all duplicates:

DELETE FROM planet_osm_point
       WHERE osm_id NOT IN (SELECT min(osm_id)
                        FROM planet_osm_point
                        GROUP BY osm_id)

or this:

DELETE FROM planet_osm_point
WHERE osm_id NOT IN (
    select max(dup.osm_id)
    from planet_osm_point as dup
    group by way);

EDIT:

I finally found an easy way, which is working in my case:

DELETE FROM planet_osm_point WHERE ctid NOT IN
(SELECT max(ctid) FROM planet_osm_point GROUP BY osm_id);

found on this page: http://technobytz.com/most-useful-postgresql-commands.html

John Powell · Answer

One way of doing this, is to use a window function and partition by geometry, so that each repeated geometry gets an id: 1, 2, 3, etc (or 1, 2) in your case, and then you just select from the table where the id = 1, to get a unique set of values (attributes and geometry) back, eg,

WITH unique_geoms (id, geom) as 
 (SELECT row_number() OVER (PARTITION BY ST_AsBinary(geom)) AS id, geom FROM some_table)
SELECT geom 
FROM unique_geoms 
WHERE id=1;

Obviously, you would need to add the other osm columns in the select too, this is just for illustration, but this is basically like grouping by geometry and just selecting the first instance of each one. Note, you need to use ST_AsBinary in the Partition By as otherwise the comparison is done on the bounding box, not the actual geometry.

As all the other attributes are presumably the same for each geometry pair, you would so something like this for all the other fields, including osm_id, and to actually create a new, unique table:

CREATE TABLE osm_unique AS
 WITH unique_geoms (id, osm_id, attr1, attr2,... attrn, geom) AS 
  (SELECT row_number() OVER (PARTITION BY ST_AsBinary(geom)) AS id, osm_id, attr1, attr2,... attrn, geom 
    FROM osm_planet_point)
 SELECT osm_id, attr1, attr2,... attrn, geom 
 FROM unique_geoms 
 WHERE id=1;

This might be quicker than deleting from an existing table, especially if there are lots of indexes in place.

EDIT. Rewritten for readability, but, leaving the credit to dbaston for drawing my attention to ST_AsBinary(geom)

Nate Wanner · Answer

Here is another method I used to remove duplicates from a SSURGO soil data download. The downloaded shapefiles did not have a unique key, so a serial pk column was generated when I imported to PostGIS. There were a few overlaps in the data sets, and I inadvertently imported some records more than once while developing the import script.

The group by statement includes all columns in the table, excluding the primary key.

It will only delete one set of duplicate rows each time it's run, so if a row is repeated 4 times you will need to run this a minimum of 3 times. This is likely not as fast as John's solution, but works within an existing table. It also works when you don't have a unique id for each unique geometry (such as the osm_id in the original question).

I used a python script to repeat until duplicates were gone, and then ran a full vacuum. I think the script and vacuum each took about 30 minutes for a few hundred thousand duplicates from around 1.5 million records in 6 tables. Plenty good for a one-off. It went through the small tables very quickly.

DELETE FROM schema.table 
  WHERE primary_key IN
    (SELECT MAX(primary_key)
     FROM schema.table 
     GROUP BY ST_AsBinary(geom), col_1, col_2, col_etc
     HAVING COUNT(primary_key) > 1);

EDIT: modified SQL to avoid running multiple times based on @dbaston suggestion (below). I tried this query method on a large table (~1.5 million records, ~25,000 duplicate point rows), and after it ran for 45 minutes I canceled its execution. Running with the SQL above (using a smaller subquery from the HAVING COUNT) reduced each run to less than 30 seconds. After running 3 times, it was done with all duplicates. The SQL below should be OK for small tables.

DELETE FROM schema.table 
  WHERE primary_key NOT IN
    (SELECT MAX(primary_key)
     FROM schema.table 
     GROUP BY ST_AsBinary(geom), col_1, col_2, col_etc);

Miro · Answer

A more general answer to easily delete geometry duplicates in PostGIS table. The following command deletes all features with duplicate geometry in "table_name" based on primary key (column "gid") and equality of geometry (column "geom"). Be aware it really deletes all geometry duplicates, they will be gone, forever! Maybe back up first?

DELETE FROM schema_name.table_name a
    USING schema_name.table_name b 
WHERE a.gid > b.gid AND st_equals(a.geom, b.geom);

Answered by Miro on July 24, 2021

DELETE FROM schema_name.table_name a
    USING schema_name.table_name b 
WHERE a.gid > b.gid AND st_equals(a.geom, b.geom);

Timothy Dalton · Answer

To make Miro's generic response a quite a bit faster you can use ST_DWithin which will only filter geometries with 0 distance between them using your spatial index (if you have added one).
DELETE FROM schema_name.table_name a
USING schema_name.table_name b 
WHERE a.id > b.id 
AND ST_Equals(a.geom, b.geom)
AND ST_DWithin(a.geom, b.geom, 0);

Delete duplicate geometry in postgis tables

4 Answers

Add your own answers!

Ask a Question