您的位置:首页 > 数据库


2013-02-19 15:15 232 查看
本文讲述如何查找数据库里重复的行。这是初学者十分普遍遇到的问题。方法也很简单。这个问题还可以有其他演变,例如,如何查找“两字段重复的行”(#mysql IRC 频道问到的问题)


[sql] view plaincopyprint?create table test(id int not null primary key, day date not null);

insert into test(id, day) values(1, '2006-10-08');
insert into test(id, day) values(2, '2006-10-08');
insert into test(id, day) values(3, '2006-10-09');

select * from test;
| id | day |
| 1 | 2006-10-08 |
| 2 | 2006-10-08 |
| 3 | 2006-10-09 |
create table test(id int not null primary key, day date not null);

insert into test(id, day) values(1, '2006-10-08');
insert into test(id, day) values(2, '2006-10-08');
insert into test(id, day) values(3, '2006-10-09');

select * from test;
| id | day        |
|  1 | 2006-10-08 |
|  2 | 2006-10-08 |
|  3 | 2006-10-09 |

前面两行在day字段具有相同的值,因此如何我将他们当做重复行,这里有一查询语句可以查找。查询语句使用GROUP BY子句把具有相同字段值的行归为一组,然后计算组的大小。
[sql] view plaincopyprint? select day, count(*) from test GROUP BY day;
| day | count(*) |
| 2006-10-08 | 2 |
| 2006-10-09 | 1 |
select day, count(*) from test GROUP BY day;
| day        | count(*) |
| 2006-10-08 |        2 |
| 2006-10-09 |        1 |

[sql] view plaincopyprint?select day, count(*) from test group by day H***ING count(*) > 1;
| day | count(*) |
| 2006-10-08 | 2 |
select day, count(*) from test group by day H***ING count(*) > 1;
| day        | count(*) |
| 2006-10-08 |        2 |





也许最简单的方法是通过临时表。尤其对于MYSQL,有些限制是不能在一个查询语句中select的同时update一个表。在我的另一篇文章中 How to select from an update target in MySQL, 讲述了如何绕过这些限制。简单起见,这里只用到了临时表的方法。
[sql] view plaincopyprint?create temporary table to_delete (day date not null, min_id int not null);

insert into to_delete(day, min_id)
select day, MIN(id) from test group by day having count(*) > 1;

select * from to_delete;
| day | min_id |
| 2006-10-08 | 1 |
create temporary table to_delete (day date not null, min_id int not null);

insert into to_delete(day, min_id)
   select day, MIN(id) from test group by day having count(*) > 1;

select * from to_delete;
| day        | min_id |
| 2006-10-08 |      1 |

有了这些数据,你可以开始删除“脏数据”行了。可以有几种方法,各有优劣(详见我的文章many-to-one problems in SQL),但这里不做详细比较,只是说明在支持查询子句的关系数据库中,使用的标准方法。
[sql] view plaincopyprint?delete from test
where exists(
select * from to_delete
where to_delete.day = test.day and to_delete.min_id <> test.id
delete from test
   where exists(
      select * from to_delete
      where to_delete.day = test.day and to_delete.min_id <> test.id


[sql] view plaincopyprint?create table a_b_c(
a int not null primary key auto_increment,
b int,
c int

insert into a_b_c(b,c) values (1, 1);
insert into a_b_c(b,c) values (1, 2);
insert into a_b_c(b,c) values (1, 3);
insert into a_b_c(b,c) values (2, 1);
insert into a_b_c(b,c) values (2, 2);
insert into a_b_c(b,c) values (2, 3);
insert into a_b_c(b,c) values (3, 1);
insert into a_b_c(b,c) values (3, 2);
insert into a_b_c(b,c) values (3, 3);
create table a_b_c(
   a int not null primary key auto_increment,
   b int,
   c int

insert into a_b_c(b,c) values (1, 1);
insert into a_b_c(b,c) values (1, 2);
insert into a_b_c(b,c) values (1, 3);
insert into a_b_c(b,c) values (2, 1);
insert into a_b_c(b,c) values (2, 2);
insert into a_b_c(b,c) values (2, 3);
insert into a_b_c(b,c) values (3, 1);
insert into a_b_c(b,c) values (3, 2);
insert into a_b_c(b,c) values (3, 3);

现在,你可以轻易看到表里面有一些重复的行,但找不到两行具有相同的二元组{b, c}。这就是为什么问题会变得困难了。


[sql] view plaincopyprint?select b, c, count(*) from a_b_c
group by b, c
having count(distinct b > 1)
or count(distinct c > 1);
select b, c, count(*) from a_b_c
group by b, c
having count(distinct b > 1)
   or count(distinct c > 1);

结果返回所有的行,因为CONT(*)总是1.为什么?因为 >1 写在COUNT()里面。这个错误很容易被忽略,事实上等效于
[sql] view plaincopyprint?select b, c, count(*) from a_b_c
group by b, c
having count(1)
or count(1);
select b, c, count(*) from a_b_c
group by b, c
having count(1)
   or count(1);

为什么?因为(b > 1)是一个布尔值,根本不是你想要的结果。你要的是
[sql] view plaincopyprint?select b, c, count(*) from a_b_c
group by b, c
having count(distinct b) > 1
or count(distinct c) > 1;
select b, c, count(*) from a_b_c
group by b, c
having count(distinct b) > 1
   or count(distinct c) > 1;

[sql] view plaincopyprint?select b, count(*) from a_b_c group by b having count(distinct c) > 1;
| b | count(*) |
| 1 | 3 |
| 2 | 3 |
| 3 | 3 |
select b, count(*) from a_b_c group by b having count(distinct c) > 1;
| b    | count(*) |
|    1 |        3 |
|    2 |        3 |
|    3 |        3 |

事实上,单纯用GROUP BY 是不可行的。为什么?因为当你对某一字段使用group by时,就会把另一字段的值分散到不同的分组里。对这些字段排序可以看到这些效果,正如分组做的那样。首先,对b字段排序,看看它是如何分组的

当你对b字段排序(分组),相同值的c被分到不同的组,因此不能用COUNT(DISTINCT c)来计算大小。COUNT()之类的内部函数只作用于同一个分组,对于不同分组的行就无能为力了。类似,如果排序的是c字段,相同值的b也会分到不同的组,无论如何是不能达到我们的目的的。


[sql] view plaincopyprint?select b as value, count(*) as cnt, 'b' as what_col
from a_b_c group by b having count(*) > 1
select c as value, count(*) as cnt, 'c' as what_col
from a_b_c group by c having count(*) > 1;
| value | cnt | what_col |
| 1 | 3 | b |
| 2 | 3 | b |
| 3 | 3 | b |
| 1 | 3 | c |
| 2 | 3 | c |
| 3 | 3 | c |
select b as value, count(*) as cnt, 'b' as what_col
 from a_b_c group by b having count(*) > 1
 select c as value, count(*) as cnt, 'c' as what_col
 from a_b_c group by c having count(*) > 1;
| value | cnt | what_col |
|     1 |   3 | b        |
|     2 |   3 | b        |
|     3 |   3 | b        |
|     1 |   3 | c        |
|     2 |   3 | c        |
|     3 |   3 | c        |

[sql] view plaincopyprint?select a, b, c from a_b_c
where b in (select b from a_b_c group by b having count(*) > 1)
or c in (select c from a_b_c group by c having count(*) > 1);
| a | b | c |
| 7 | 1 | 1 |
| 8 | 1 | 2 |
| 9 | 1 | 3 |
| 10 | 2 | 1 |
| 11 | 2 | 2 |
| 12 | 2 | 3 |
| 13 | 3 | 1 |
| 14 | 3 | 2 |
| 15 | 3 | 3 |
select a, b, c from a_b_c
 where b in (select b from a_b_c group by b having count(*) > 1)
    or c in (select c from a_b_c group by c having count(*) > 1);
| a  | b    | c    |
|  7 |    1 |    1 |
|  8 |    1 |    2 |
|  9 |    1 |    3 |
| 10 |    2 |    1 |
| 11 |    2 |    2 |
| 12 |    2 |    3 |
| 13 |    3 |    1 |
| 14 |    3 |    2 |
| 15 |    3 |    3 |

[sql] view plaincopyprint?select a, a_b_c.b, a_b_c.c
from a_b_c
left outer join (
select b from a_b_c group by b having count(*) > 1
) as b on a_b_c.b = b.b
left outer join (
select c from a_b_c group by c having count(*) > 1
) as c on a_b_c.c = c.c
where b.b is not null or c.c is not null
select a, a_b_c.b, a_b_c.c
from a_b_c
   left outer join (
      select b from a_b_c group by b having count(*) > 1
   ) as b on a_b_c.b = b.b
   left outer join (
      select c from a_b_c group by c having count(*) > 1
   ) as c on a_b_c.c = c.c
where b.b is not null or c.c is not null

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息