Hive中SQL的优化技巧
2015-12-22 17:12
465 查看
Hive中SQL的优化技巧,核心思想是避免数据倾斜。
1、避免在同一个查询中同时出现count, distinct,group by
2、left join 时把小数据量的表放在前面
3、尽量使用子查询
参数配置
涉及数据倾斜的话,主要是reduce中数据倾斜的问题,可能通过设置hive中reduce的并行数,reduce的内存大小单位为m,reduce中 shuffle的刷磁盘的比例,来解决。
实例一
实例二
实例三
1、避免在同一个查询中同时出现count, distinct,group by
2、left join 时把小数据量的表放在前面
3、尽量使用子查询
参数配置
SET mapred.reduce.tasks=50; SET mapreduce.reduce.memory.mb=6000; SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;
涉及数据倾斜的话,主要是reduce中数据倾斜的问题,可能通过设置hive中reduce的并行数,reduce的内存大小单位为m,reduce中 shuffle的刷磁盘的比例,来解决。
实例一
--分月 select substr(a.day,1,6)month,count(distinct a.userid) from dms.tracklog_5min a join default.site_activeuser_tmp c on a.userid=c.id where a.day>='201505' and a.day<'201506' group by substr(a.day,1,6) ; --优化后 select '201505',count(*) from ( select distinct c.userid from (select userid from default.site_activeuser_tmp where month='201505') c left join ( select userid from dms.tracklog_5min where day>='201505' and day<'201506' ) tmp on tmp.userid=c.userid ) t;
实例二
--分事业部
select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name
from dms.tracklog_5min a join default.d_channel b
on a.host=b.host
join default.site_activeuser_tmp c
on a.userid=c.id
where a.day>='201505' and a.day<'201506'
group by substr(a.day,1,6),b.dept_name;
--优化后
SET mapred.reduce.tasks=50; SET mapreduce.reduce.memory.mb=6000; SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;
select "201505" month,count(t.userid),t.dept_name
from
(select userid from default.site_activeuser_tmp where month='201505') c
left join
(
select distinct a.userid userid,b.dept_name dept_name from default.d_channel b
left join
(select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a
on a.host=b.host
)t
on t.userid=c.userid
group by t.dept_name ;
实例三
--分产品 select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name,b.prod_name from dms.tracklog_5min a join default.d_channel b on a.host=b.host join default.site_activeuser_tmp c on a.userid=c.id where a.day>='201505' and a.day<'201506' group by substr(a.day,1,6),b.dept_name,b.prod_name; --优化后 select "201505" month,count(t.userid) cnt,t.dept_name dept_name,t.prod_name prod_name from (select userid from default.site_activeuser_tmp where month='201505') c left join ( select distinct a.userid userid,b.dept_name dept_name,b.prod_name prod_name from default.d_channel b left join (select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a on a.host=b.host )t on t.userid=c.userid group by t.prod_name,t.dept_name ;
相关文章推荐
- Mybatis - SqlMapConfig.xml , 输入映射 ,输出映射 ,动态sql ,sql片段
- 解决mysql数据库中无法插入中文数据的问题
- MongoDB简单的增删改查
- MySQL force Index 强制索引概述
- mysql内联接、左联接、右联接
- SQL多表联合查询实现插入/更新/删除
- mysql表数据增删改查、子查询
- MySql 删除相同前缀的表名
- ASP.NET WebForm & MongoDB
- oracle 查询表属于哪个表空间
- mysql建表时候的五种约束
- cmd操作mysql
- mysql监控工具spotlighto
- xbrl-创建表空间以及用户并给用户授权
- Oracle中慎用Like等通配符
- JetBrains发布DataGrip 1.0——数据库与SQL领域中的瑞士军刀
- 数据库镜像札记
- memcache和redis区别
- MySQL binlog基本操作
- 数据库调优要点