[Hive]Hive调优:让任务并行执行
2018-01-02 21:53
344 查看
业务背景
extract_trfc_page_kpi的hive sql如下:set mapred.job.queue.name=pms; set hive.exec.reducers.max=8; set mapred.reduce.tasks=8; set mapred.job.name=extract_trfc_page_kpi; insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday') select distinct page_type_id, pv, uv, '$yesterday' update_time from ( --针对PC、H5 select page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 group by page_type_id union all --PC搜索页特殊处理 select 5 as page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52) union all --针对APP select a.page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi a left outer join ( select distinct page_type_id, old_page_type_id from tandem.mobile_backend_page_url_rule where is_delete = 0 ) b on (a.page_type_id = b.old_page_type_id) where a.ds = '$yesterday' and stat_type = 1 group by a.page_type_id ) t;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
上面的sql中存在两个union all操作,顺序执行下来的话,需要耗时20分钟。
优化策略
分析以上的sql,其中union all前后的三个查询操作并无直接关联,因此没有必要顺序执行,因此优化的思路是让这三个查询操作并行执行,hive提供了如下参数实现job的并行操作:// 开启任务并行执行 set hive.exec.parallel=true; // 同一个sql允许并行任务的最大线程数 set hive.exec.parallel.thread.number=8;1
2
3
4
方案一
在执行sql时加上上面的两个hive参数,如:set mapred.job.queue.name=pms; set hive.exec.reducers.max=8; set mapred.reduce.tasks=8; set hive.exec.parallel=true; set hive.exec.parallel.thread.number=8; set mapred.job.name=extract_trfc_page_kpi; insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday') select distinct page_type_id, pv, uv, '$yesterday' update_time from ( --针对PC、H5 select page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 group by page_type_id union all --PC搜索页特殊处理 select 5 as page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52) union all --针对APP select a.page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi a left outer join ( select distinct page_type_id, old_page_type_id from tandem.mobile_backend_page_url_rule where is_delete = 0 ) b on (a.page_type_id = b.old_page_type_id) where a.ds = '$yesterday' and stat_type = 1 group by a.page_type_id ) t;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
方案二
在hive-site.xml中进行设置,查看当前版本hive的配置参数:hive> set -v; ... hive.exec.orc.zerocopy=false hive.exec.parallel=false hive.exec.parallel.thread.number=8 hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger hive.exec.rcfile.use.explicit.header=true hive.exec.rcfile.use.sync.cache=true hive.exec.reducers.bytes.per.reducer=1000000000 hive.exec.reducers.max=999 hive.exec.rowoffset=false hive.exec.scratchdir=/tmp/hive-pms hive.exec.script.allow.partial.consumption=false hive.exec.script.maxerrsize=100000 hive.exec.script.trust=false hive.exec.show.job.failure.debug.info=true ...1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
这些参数是配置在$HIVE_HOME/conf/hive-site.xml中的,现在在这个配置文件中加入:
<property> <name>hive.exec.parallel</name> <value>true</value> </property> <property> <name>hive.exec.parallel.thread.number</name> <value>16</value> </property>1
2
3
4
5
6
7
8
重新启动hive,看到刚刚配置的参数已经生效了:
hive> set -v; ... hive.exec.orc.skip.corrupt.data=false hive.exec.orc.zerocopy=false hive.exec.parallel=true hive.exec.parallel.thread.number=16 hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger hive.exec.rcfile.use.explicit.header=true hive.exec.rcfile.use.sync.cache=true hive.exec.reducers.bytes.per.reducer=1000000000 hive.exec.reducers.max=999 hive.exec.rowoffset=false hive.exec.scratchdir=/tmp/hive-pms hive.exec.script.allow.partial.consumption=false ...1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
结论
经过测试,添加了这两个参数以后,extract_trfc_page_kpi脚本执行时间从耗时20分钟,优化为耗时3分钟。
相关文章推荐
- [Hive]Hive调优:让任务并行执行
- hive实现任务并行执行
- 大数据Spark “蘑菇云”行动第94课:Hive性能调优之Mapper和Reducer设置、队列设置和并行执行、JVM重用和动态分区、Join调优
- hive-调优笔记:JVM重用,并行执行、调整reducer个数的用处
- hive-调优笔记:JVM重用,并行执行、调整reducer个数的用处
- [置顶] 第94课:Hive性能调优之Mapper和Reducer设置、队列设置和并行执行、JVM重用和动态分区、Join调优等
- hive优化之并行执行任务
- hive-调优笔记:JVM重用,并行执行、调整reducer个数的用处
- Java中使用ThreadPoolExecutor并行执行独立的单线程任务
- 取消框架 取消并行或任务的超长时间执行 CancellationToken
- hive执行作业时reduce任务个数设置为多少合适
- hive执行任务报错Execution failed with exit status: 3
- 使用ThreadPoolExecutor并行执行独立的单线程任务
- Java7任务并行执行神器:Fork&Join框架
- Jenkins 在声明式 pipeline 中并行执行任务
- ant并行执行打包任务
- PowerShell 并行执行任务
- 重新想象 Windows 8 Store Apps (43) - 多线程之任务: Task 基础, 多任务并行执行, 并行运算(Parallel)
- c#执行并行任务之Parallel与TaskFactory
- Java7任务并行执行神器:Fork&Join框架