hive的优化方式
2012-02-13 10:45
337 查看
hive相关优化方式:
Column Pruning
As name suggests -discard columns which are not needed
> select a,b from t where e<10;
>t contains 5 columns (a,b,c,d,e)
Columns c,d are discarded
Select only the relevant columns
Enabled by defaut
> hive.optimize.cp=true
Predicate Pushdown
Move predicate closer to the table scan only.
Enabled by default:
hive.optimize.ppd=true
Predicates moved up across joins.
select * from t1 join t2 on(t1.c1=t2.c2 and t1.c1<10)
select * from t1 join t2 on(t1.c1=t2.c2) where t1.c1<10
Special needs for outer joins:
Left outer join: predicates on the left side aliases are pushed
Right outer join:predicates on the right side aliases are pushed
Full outer join:none of the predicates are pushed
Non-deterministic functions(eg.rand()) not pushed.
Use annotation:
> #UDFType(deterministic=false)
The entire expression containing non-deterministic function is not pushed up
> c1>10 and c2<rand()
Partition Pruning
Reduce list of partitions to be scanned
Works on parse tree currently- some known bugs
select * from
(select c1, count(1) from t group by c1) subq
where subq.prtn=100;
select * from t1 join
(select * from t2) subq on(t1.c1=subq.c2)
where subq.prtn=100;
hive.mapred.mode=nonstrict
Strict mode, scan of a complete partitioned table fails.
Hive QL-Join
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view pv
JOIN user u
ON (pv.userid=u.userid);
Rightmost table streamed - whereas inner tables data is
kept in memory for a given key. Use largest table as the
right most table.
hive.mapred.mode=nonstrict
In strict mode,Cartesian product not allowed
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view p JOIN user u
ON (pv.userid=u.useid)
JOIN new user x on (u.userid=x.useid);
Same join key - merge into 1 map-reduce job -true for any number of tables with the same join key.
1 map-reduce job instead of 'n'
The merging happens for OUTER joins also
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view p JOIN user u
ON (pv.userid=u.userid)
JOIN new user x on (u.age=x.age);
Different join keys - 2 map-reduce jobs
Same as:
INSERT OVERWRITE TABLE tmptable SELECT *
FROM page_view p JOIN user u
ON(pv.useid=u.userid);
INSERT OVERWRITE TABLE pv_users
SELECT x.pageid,x.age
FROM temptable x JOIN newuser y on(x.age=y.age);
Join Optimizations
Map Joins
> User specified small tables stored in hash tables on the mapper backed by jdbm
> No reducer needed
INSERT INTO TABLE pv_users
SELECT /*+MAPJOIN(pv) */ pv.pageid,u.age
FROM page_view pv JOIN user u
ON(pv.userid=u.userid);
Map Join
>>Optimization phase
>> n-way map-join if(n-1) tables are map side readable
>>Mapper reads all (n-1) table before processing the main table under consideration
>>Map-side readable tables are cached in memory and backed by JDBM persistent hash tables
Parameters
>>hive.join.emit.interval=1000
>>hive.mapjoin.size.key=10000
>>hive.mapjoin.cache.numrows=10000
Future
>>Sizes/statistics to determine join order
>>Sizes to enforce map-join
>>Better techniques for handling skew for a given key
>>Using sorted properties of the table
>>Fragmented joins
>>n-way joins for different join keys by replicating data
Hive QL - Group By
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY paged,age;
Group by Optimizations
>>Map side partial aggregations
> Hash-based aggregates
> Serialized key/values in hash tables
> 90% speed improvement on Query
select count(1) from t;
>> Load balancing for data skew
Parameters
>> hive.map.aggr=true
>> hive.groupby.skewindata=false
>> hive.groupby.mapaggr.checkinterval=100000
>> hive.map.aggr.hash.percentmemory=0.5
>> hive.map.aggr.hash.min.reduction=0.5
Multi GroupBy
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT gender, count(DISTINCT userid),count(userid)
GROUP BY gender
INSERT OVERWRITE TABLE pv_age_sum
SELECT age, count(DISTINCT userid)
GROUP BY age
>>n+1 map-reduce jobs instead of 2n
>> Single scan of input table
>>Same distinct key across all groupies
>>Always user multi-groupby
Merging of small files
>>Lots of small files creates problems for downstream jobs
> SELECT * FROM T WHERE x<10;
>> hive.merge.mapfiles=true
>> hive.merge.mapredfiles=false
>> hive.merge.size.per.task=256*1000*1000
>> Increases time for current query
Column Pruning
As name suggests -discard columns which are not needed
> select a,b from t where e<10;
>t contains 5 columns (a,b,c,d,e)
Columns c,d are discarded
Select only the relevant columns
Enabled by defaut
> hive.optimize.cp=true
Predicate Pushdown
Move predicate closer to the table scan only.
Enabled by default:
hive.optimize.ppd=true
Predicates moved up across joins.
select * from t1 join t2 on(t1.c1=t2.c2 and t1.c1<10)
select * from t1 join t2 on(t1.c1=t2.c2) where t1.c1<10
Special needs for outer joins:
Left outer join: predicates on the left side aliases are pushed
Right outer join:predicates on the right side aliases are pushed
Full outer join:none of the predicates are pushed
Non-deterministic functions(eg.rand()) not pushed.
Use annotation:
> #UDFType(deterministic=false)
The entire expression containing non-deterministic function is not pushed up
> c1>10 and c2<rand()
Partition Pruning
Reduce list of partitions to be scanned
Works on parse tree currently- some known bugs
select * from
(select c1, count(1) from t group by c1) subq
where subq.prtn=100;
select * from t1 join
(select * from t2) subq on(t1.c1=subq.c2)
where subq.prtn=100;
hive.mapred.mode=nonstrict
Strict mode, scan of a complete partitioned table fails.
Hive QL-Join
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view pv
JOIN user u
ON (pv.userid=u.userid);
Rightmost table streamed - whereas inner tables data is
kept in memory for a given key. Use largest table as the
right most table.
hive.mapred.mode=nonstrict
In strict mode,Cartesian product not allowed
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view p JOIN user u
ON (pv.userid=u.useid)
JOIN new user x on (u.userid=x.useid);
Same join key - merge into 1 map-reduce job -true for any number of tables with the same join key.
1 map-reduce job instead of 'n'
The merging happens for OUTER joins also
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid,u.age
FROM page_view p JOIN user u
ON (pv.userid=u.userid)
JOIN new user x on (u.age=x.age);
Different join keys - 2 map-reduce jobs
Same as:
INSERT OVERWRITE TABLE tmptable SELECT *
FROM page_view p JOIN user u
ON(pv.useid=u.userid);
INSERT OVERWRITE TABLE pv_users
SELECT x.pageid,x.age
FROM temptable x JOIN newuser y on(x.age=y.age);
Join Optimizations
Map Joins
> User specified small tables stored in hash tables on the mapper backed by jdbm
> No reducer needed
INSERT INTO TABLE pv_users
SELECT /*+MAPJOIN(pv) */ pv.pageid,u.age
FROM page_view pv JOIN user u
ON(pv.userid=u.userid);
Map Join
>>Optimization phase
>> n-way map-join if(n-1) tables are map side readable
>>Mapper reads all (n-1) table before processing the main table under consideration
>>Map-side readable tables are cached in memory and backed by JDBM persistent hash tables
Parameters
>>hive.join.emit.interval=1000
>>hive.mapjoin.size.key=10000
>>hive.mapjoin.cache.numrows=10000
Future
>>Sizes/statistics to determine join order
>>Sizes to enforce map-join
>>Better techniques for handling skew for a given key
>>Using sorted properties of the table
>>Fragmented joins
>>n-way joins for different join keys by replicating data
Hive QL - Group By
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY paged,age;
Group by Optimizations
>>Map side partial aggregations
> Hash-based aggregates
> Serialized key/values in hash tables
> 90% speed improvement on Query
select count(1) from t;
>> Load balancing for data skew
Parameters
>> hive.map.aggr=true
>> hive.groupby.skewindata=false
>> hive.groupby.mapaggr.checkinterval=100000
>> hive.map.aggr.hash.percentmemory=0.5
>> hive.map.aggr.hash.min.reduction=0.5
Multi GroupBy
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT gender, count(DISTINCT userid),count(userid)
GROUP BY gender
INSERT OVERWRITE TABLE pv_age_sum
SELECT age, count(DISTINCT userid)
GROUP BY age
>>n+1 map-reduce jobs instead of 2n
>> Single scan of input table
>>Same distinct key across all groupies
>>Always user multi-groupby
Merging of small files
>>Lots of small files creates problems for downstream jobs
> SELECT * FROM T WHERE x<10;
>> hive.merge.mapfiles=true
>> hive.merge.mapredfiles=false
>> hive.merge.size.per.task=256*1000*1000
>> Increases time for current query
相关文章推荐
- hive优化方式和使用技巧
- hive优化方式和使用技巧
- hive优化方式和使用技巧
- hive常用优化方式
- hive优化方式和使用技巧
- hive入门学习:join的三种优化方式
- hive优化方式和使用技巧
- 大数据Hive的案例、参数、动态分区、分桶、视图、索引、运行方式、权限管理、Hive的优化_03_03
- hive优化方式和使用技巧
- webservice 优化性能的方法,以返回Dataset为例,当大数据量时采用先压缩的方式,调用再解压缩
- hadoop & hive任务优化之系统概况及任务详情查看方法
- PHP-Websockets 上传文件2 优化支持php socket客户端和websocket连接websocket服务器 以守护进程方式运行编码
- 有关DBLINK操作的语句执行机制及优化方式
- Hive四种数据导入方式
- hive性能优化
- Hive:表1inner join表2结果group by优化
- hive优化之------控制hive任务中的map数和reduce数
- 以hive为例说明程序的三种使用方式
- hive的三种安装方式
- hive查询优化总结