您的位置：首页 > 运维架构

分组求topn

2018-01-05 02:23 113 查看

遇见一个难题：分组求组内topn，由于mysql没有row_number()查阅了许些资料，mark。

需求：取出每个班级的前两名同学(允许并列)

创建表

create table tmp(
id int not null auto_increment,
sname varchar(10),
cname varchar(10),
score int,
PRIMARY key(id)
)engine=InnoDB CHARSET=UTF8;

插入数据

insert into tmp values(null,'AAAA','C1',67);
insert into tmp values(null,'BBBB','C1',55);
insert into tmp values(null,'CCCC','C1',67);
insert into tmp values(null,'DDDD','C1',65);
insert into tmp values(null,'EEEE','C1',95);
insert into tmp values(null,'FFFF','C2',57);
insert into tmp values(null,'GGGG','C2',87);
insert into tmp values(null,'HHHH','C2',74);
insert into tmp values(null,'IIII','C2',52);
insert into tmp values(null,'JJJJ','C2',81);
insert into tmp values(null,'KKKK','C2',67);
insert into tmp values(null,'LLLL','C2',66);
insert into tmp values(null,'MMMM','C2',63);
insert into tmp values(null,'NNNN','C3',99);
insert into tmp values(null,'OOOO','C3',50);
insert into tmp values(null,'PPPP','C3',59);
insert into tmp values(null,'QQQQ','C3',66);
insert into tmp values(null,'RRRR','C3',76);
insert into tmp values(null,'SSSS','C3',50);
insert into tmp values(null,'TTTT','C3',50);
insert into tmp values(null,'UUUU','C3',64);
insert into tmp values(null,'VVVV','C3',74);

方法一：

select a.id,a.sname,a.cname,a.score
from tmp a left join tmp b on a.cname=b.cname and a.score<b.score
group by a.id,a.sname,a.cname,a.score
having count(b.id)<2
order by a.cname,a.score desc

方法二：

select *
from tmp a
where 2>(select count(*) from tmp where cname=a.cname and score>a.score)
order by a.cname,a.score desc

如果是求top1

select *
from tmp a
where not exists (select 1 from tmp where cname=a.cname and score>a.score);

select a.*
from tmp a inner join (select cname, max(score) as score from tmp group by cname) b
on a.cname=b.cname and a.score=b.score

select *
from (select * from tmp order by score desc) t
group by cname

但在hive中有row_number()函数，举个例子：

需求：按照时间维度，比如，统计一天内各小时产生最多pvs的来源topN

insert into table zs.dw_pvs_refhost_topn_h partition(datestr='2016-03-18')
select t.hour,t.od,t.ref_host,t.ref_host_cnts from
(select ref_host,ref_host_cnts,concat(month,day,hour) as hour,
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
from zs.dw_ref_host_visit_cnts_h) t where od<=3;

看一下内层表实现：对每个小时内的来访host次数倒序排序标号

select ref_host,ref_host_cnts,concat(month,hour,day),
row_number() over (partition by concat(month,hour,day) order by ref_host_cnts desc) as od
from dw_ref_host_visit_cnts_h

结果

+-------------------------+----------------+----------+-----+--+
|        ref_host         | ref_host_cnts  |    c2    | od  |
+-------------------------+----------------+----------+-----+--+
| blog.fens.me            | 111            | 09 0019  | 1   |
| www.fens.me             | 13             | 09 0019  | 2   |
| h2w.iask.cn             | 6              | 09 0019  | 3   |
| angularjs.cn            | 3              | 09 0019  | 4   |
| www.google.com.hk       | 3              | 09 0019  | 5   |
| www.leonarding.com      | 1              | 09 0019  | 6   |
| cnodejs.org             | 1              | 09 0019  | 7   |
| www.itpub.net           | 1              | 09 0019  | 8   |
| blog.fens.me            | 89             | 09 0119  | 1   |
| cos.name                | 3              | 09 0119  | 2   |
| www.google.com.tw       | 2              | 09 0119  | 3   |
| www.angularjs.cn        | 2              | 09 0119  | 4   |
| mp.weixin.qq.com        | 1              | 09 0119  | 5   |
| h2w.iask.cn             | 1              | 09 0119  | 6   |
| cnodejs.org             | 1              | 09 0119  | 7   |
| weibo.com               | 1              | 09 0119  | 8   |
| www.google.com.hk       | 1              | 09 0119  | 9   |
| blog.fens.me            | 311            | 09 0219  | 1   |
| www.google.com          | 3              | 09 0219  | 2   |
| www.google.com.hk       | 3              | 09 0219  | 3   |

row_number()从1开始，为每一条分组记录返回一个数字

row_number() over (order by col desc) 是先把col列降序，再为降序以后的每条col记录返回一个序号。

col row_num
999   1
888   2
777   3
666   4

row_number() over (partition by col1 order by col2) 表示根据col1分组，在分组内部根据col22排序，

而此函数计算的值就表示每组内部排序后的顺序编号（组内连续的唯一的)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航