王新阳

wangxinyang

千万级数据库表去重复的方法

20240124更新,MYSQL(有多个重复时只保留第一行):
DELETE FROM my_table WHERE  id NOT IN (SELECT id FROM (SELECT MIN(id) AS id FROM my_table GROUP BY fieldname123) AS a)

20210910更新,MYSQL(只能删除最后一行重复):
DELETE FROM my_extend_weixin_fans WHERE wxid=123 AND id IN (SELECT id FROM (SELECT MAX(id) AS id FROM my_extend_weixin_fans WHERE wxid=123 GROUP BY openid HAVING COUNT(openid)>1)AS a)


CREATE TABLE mypwd
(id INT UNSIGNED NOT NULL AUTO_INCREMENT,
pwd VARCHAR(14) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY pwd(pwd)
)ENGINE=MYISAM, CHARSET='utf8'

INSERT INTO mypwd(pwd) SELECT DISTINCT pwd FROM oldtable

采用这种方式,从已简单处理过的1300多W的表中复制出DISTINCT pwd 1200多W条到mypwd,用时11分39秒,E5400的CPU占用率低于60%,内存占用也很少。
如果不给pwd字段建索引,仅用110秒就复制完,再给pwd字段创建UNIQUE KEY用时9分34秒。
所以用这种方式还是不错的


/**************下面的方法待验证********************/

如果要删除手机(mobilePhone),电话(officePhone),邮件(email)同时都相同的数据,以前一直使用这条语句

delete from 表 where id not in (
select max(id) from 表 group by mobilePhone,officePhone,email )

or

delete from 表 where id not in (
select min(id) from 表 group by mobilePhone,officePhone,email )

下面的方法效率更高

效率对比:用delete方法对500万数据去重(1/2重复)约4小时
用临时表插入对500万数据去重(1/2重复)不到10分钟


//查询出唯一数据的ID,并把他们导入临时表tmp中
select min(id) as mid into tmp from 表 group by mobilePhone,officePhone,email

//查询出去重后的数据并插入finally表中
insert into finally select (除ID以外的字段) from customers_1 where id in (select mid from tmp)

两表组合更新示例

MSSQL:

Update [INVOICE] SET INVOICE.BUREAU_NO=TP_USER.BUREAU_NO
FROM [INVOICE] JOIN [TP_USER] ON INVOICE.TAXPAYER_ID=TP_USER.TAXPAYER_ID
WHERE INVOICE.TAX_FLAG='正常' And INVOICE.TAX_DATE<>'1900-1-1'

以上不适用于access


MYSQL:

UPDATE customer SET customer.amount=(SELECT SUM(works_renewals.contractamount) FROM works_renewals WHERE customer.cid=works_renewals.cid)


MYSQL

UPDATE book SET price=100 WHERE id IN (SELECT id FROM book AS a WHERE a.price=99)

错误代码: 1093
You can't specify target table 'book' for update in FROM clause

mysql同表组合更新时,需要把SELECT出来的数据放到临时表中。SELECT的表和临时表都要有别名。正确语句:

UPDATE book SET price=100 WHERE id IN (SELECT b.id FROM  (SELECT id FROM book AS a WHERE a.price=99) AS b)

MySQL自表更新示例

update
wxy_enum, (select tmp.id, tmp.addfield1 from wxy_enum as tmp) tmp2
set wxy_enum.parentid=tmp2.id
where wxy_enum.addfield2=tmp2.addfield1 and wxy_enum.rootid>0


SQL随机读取数据库记录

随机提取10条记录

Sql server:
select top 10 * from 表 order by newid()

Access:
SELECT top 10 * FROM 表 ORDER BY Rnd(id-timer()) '每次刷新都不同

Rnd(id) 其中的id是自动编号字段,可以利用其他任何数值来完成
比如用姓名字段(UserName)
SELECT top 10 * FROM 表 ORDER BY Rnd(len(UserName))

MySql:
Select * From 表 Order By rand() Limit 10
2024-11-24 星期日 农历十月二十四