王新阳

wangxinyang

千万级数据库表去重复的方法

20240124更新,MYSQL(有多个重复时只保留第一行):
DELETE FROM my_table WHERE  id NOT IN (SELECT id FROM (SELECT MIN(id) AS id FROM my_table GROUP BY fieldname123) AS a)

20210910更新,MYSQL(只能删除最后一行重复):
DELETE FROM my_extend_weixin_fans WHERE wxid=123 AND id IN (SELECT id FROM (SELECT MAX(id) AS id FROM my_extend_weixin_fans WHERE wxid=123 GROUP BY openid HAVING COUNT(openid)>1)AS a)


CREATE TABLE mypwd
(id INT UNSIGNED NOT NULL AUTO_INCREMENT,
pwd VARCHAR(14) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY pwd(pwd)
)ENGINE=MYISAM, CHARSET='utf8'

INSERT INTO mypwd(pwd) SELECT DISTINCT pwd FROM oldtable

采用这种方式,从已简单处理过的1300多W的表中复制出DISTINCT pwd 1200多W条到mypwd,用时11分39秒,E5400的CPU占用率低于60%,内存占用也很少。
如果不给pwd字段建索引,仅用110秒就复制完,再给pwd字段创建UNIQUE KEY用时9分34秒。
所以用这种方式还是不错的


/**************下面的方法待验证********************/

如果要删除手机(mobilePhone),电话(officePhone),邮件(email)同时都相同的数据,以前一直使用这条语句

delete from 表 where id not in (
select max(id) from 表 group by mobilePhone,officePhone,email )

or

delete from 表 where id not in (
select min(id) from 表 group by mobilePhone,officePhone,email )

下面的方法效率更高

效率对比:用delete方法对500万数据去重(1/2重复)约4小时
用临时表插入对500万数据去重(1/2重复)不到10分钟


//查询出唯一数据的ID,并把他们导入临时表tmp中
select min(id) as mid into tmp from 表 group by mobilePhone,officePhone,email

//查询出去重后的数据并插入finally表中
insert into finally select (除ID以外的字段) from customers_1 where id in (select mid from tmp)

2016-01-10
2024-11-23 星期六 农历十月二十三