Database Administrators Asked on January 2, 2022
On a mysql table with about 1.7M rows, I tried to delete duplicates posts:
delete a FROM comment a
INNER JOIN comment a2
WHERE a.id < a2.id
AND a.body = a2.body;
The result was:
Query OK, 35071 rows affected (5 hours 36 min 48.79 sec)
The table schema:
CREATE TABLE `comment` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`author_id` int(11) NOT NULL,
`created` datetime NOT NULL,
`title` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`body` longtext COLLATE utf8_unicode_ci NOT NULL,
`post_id` int(11) NOT NULL,
`published` tinyint(1) NOT NULL,
`cid` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `id_postid_idx` (`id`,`post_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1774682 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
This happened on my almost idle workstation with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
.
I’m wondering if there are some tricks to make this delete operation faster?
Your algorithm is "Order(N^2)". If you have a million rows, it will take about a trillion comparisions! (Actually closer to N*N/2.)
I assume you want to run this again later? Or you want to avoid inserting duplicates? In either case, the following will speed things up greatly.
ALTER TABLE `comment` ADD COLUMN hash BINARY(20);
UPDATE `comment` SET hash = UNHEX(SHA1(body));
ALTER TABLE `comment` ADD UNIQUE(hash); -- (see discussion below)
When INSERTing
, include hash
with a value of UNHEX(SHA1(body))
. If the hash is a dup, then body
is extremely likely to be a dup, and the INSERT
will suitably fail.
INDEX
versus UNIQUE
:
INDEX
gives you a much faster way to do the dup-elimination: check hash
instead of body
.UNIQUE
prevents future insertion of dups.So, let's do it in two phases:
INDEX(hash)
(not UNIQUE(hash)
)INDEX(hash)
, the DELETE
is Order(N) -- a million tests, not a trillion.DROP INDEX
+ ADD UNIQUE
. -- Now future INSERTs
will balk; so add suitable code to deal with such. Consider INSERT IGNORE
.All of that will take minutes, not hours. And you won't have to do any of it again.
(With a newer version of MySQL, you could use GENERATED ALWAYS
to avoid some of the code.)
One shot
To do the delete once and do it without modifying the table:
CREATE TABLE t (
id INT NOT NULL,
hash BINARY(20),
PRIMARY KEY(id),
INDEX(hash) ) ENGINE=InnoDB;
INSERT INTO t (id, hash)
SELECT id, UNHEX(SHA1(body))
FROM `comment`;
DELETE FROM `comment`
JOIN t AS t1 ON t1.id = comment.id
JOIN t AS t2 ON t2.hash = t1.hash
AND t2.id < t1.id;
DROP TABLE t;
Caveat: I have not tested this code. Recommend you test it separately, perhaps with a subset of the data.
Answered by Rick James on January 2, 2022
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP