千万级数据库表去重复的方法
20240124更新,MYSQL(有多个重复时只保留第一行):
DELETE FROM my_table WHERE id NOT IN (SELECT id FROM (SELECT MIN(id) AS id FROM my_table GROUP BY fieldname123) AS a)
20210910更新,MYSQL(只能删除最后一行重复):
DELETE FROM my_extend_weixin_fans WHERE wxid=123 AND id IN (SELECT id FROM (SELECT MAX(id) AS id FROM my_extend_weixin_fans WHERE wxid=123 GROUP BY openid HAVING COUNT(openid)>1)AS a)
CREATE TABLE mypwd
(id INT UNSIGNED NOT NULL AUTO_INCREMENT,
pwd VARCHAR(14) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY pwd(pwd)
)ENGINE=MYISAM, CHARSET='utf8'
INSERT INTO mypwd(pwd) SELECT DISTINCT pwd FROM oldtable
采用这种方式,从已简单处理过的1300多W的表中复制出DISTINCT pwd 1200多W条到mypwd,用时11分39秒,E5400的CPU占用率低于60%,内存占用也很少。
如果不给pwd字段建索引,仅用110秒就复制完,再给pwd字段创建UNIQUE KEY用时9分34秒。
所以用这种方式还是不错的
/**************下面的方法待验证********************/
如果要删除手机(mobilePhone),电话(officePhone),邮件(email)同时都相同的数据,以前一直使用这条语句
delete from 表 where id not in (
select max(id) from 表 group by mobilePhone,officePhone,email )
or
delete from 表 where id not in (
select min(id) from 表 group by mobilePhone,officePhone,email )
下面的方法效率更高
效率对比:用delete方法对500万数据去重(1/2重复)约4小时
用临时表插入对500万数据去重(1/2重复)不到10分钟
//查询出唯一数据的ID,并把他们导入临时表tmp中
select min(id) as mid into tmp from 表 group by mobilePhone,officePhone,email
//查询出去重后的数据并插入finally表中
insert into finally select (除ID以外的字段) from customers_1 where id in (select mid from tmp)
两表组合更新示例
Update [INVOICE] SET INVOICE.BUREAU_NO=TP_USER.BUREAU_NO
FROM [INVOICE] JOIN [TP_USER] ON INVOICE.TAXPAYER_ID=TP_USER.TAXPAYER_ID
WHERE INVOICE.TAX_FLAG='正常' And INVOICE.TAX_DATE<>'1900-1-1'
以上不适用于access
MYSQL:
UPDATE customer SET customer.amount=(SELECT SUM(works_renewals.contractamount) FROM works_renewals WHERE customer.cid=works_renewals.cid)
MYSQL
UPDATE book SET price=100 WHERE id IN (SELECT id FROM book AS a WHERE a.price=99)
错误代码: 1093
You can't specify target table 'book' for update in FROM clause
mysql同表组合更新时,需要把SELECT出来的数据放到临时表中。SELECT的表和临时表都要有别名。正确语句:
UPDATE book SET price=100 WHERE id IN (SELECT b.id FROM (SELECT id FROM book AS a WHERE a.price=99) AS b)
MySQL自表更新示例
update
wxy_enum, (select tmp.id, tmp.addfield1 from wxy_enum as tmp) tmp2
set wxy_enum.parentid=tmp2.id
where wxy_enum.addfield2=tmp2.addfield1 and wxy_enum.rootid>0
SQL随机读取数据库记录
Sql server:
select top 10 * from 表 order by newid()
Access:
SELECT top 10 * FROM 表 ORDER BY Rnd(id-timer()) '每次刷新都不同
Rnd(id) 其中的id是自动编号字段,可以利用其他任何数值来完成
比如用姓名字段(UserName)
SELECT top 10 * FROM 表 ORDER BY Rnd(len(UserName))
MySql:
Select * From 表 Order By rand() Limit 10