您的位置：首页 > 其它

hive优化

2017-11-14 11:06 337 查看

Hive优化

1、fetchTask直接抓取数据

Single query

设置：hive.fetch.task.conversion为more

默认的select start *、分区表的过滤(filter on partition columns)、limit only不走MR

2、大表拆成子表、外部表与分区表结合使用、设置数据的存储格式与数据压缩。

大表拆成子表：create table tablename as select clomns1,clomns2 from bigtablename;

外部表与分区表结合使用：create external table tablename(......)partitoned by (col_name=xx,clo_name=xx)row format delimited fields terminated by ‘\t’

设置存储格式与压缩格式：

Set parquet.compression=snappy;

create external table tablename(......)partitoned by (col_name=xx,clo_name=xx)row format delimited fields terminated by ‘\t’ stored as parquet as select * from othertable ;

3、Sql语句的优化

Jion在相应属性设置为true

比如子查询之后再jion要优于jion之后在进行子查询。

Create table a (k1 string,v1 string);

Create table b(k2 string,v2 string);

Select k1,v1,k2,v2 from a jion b on k1=k2;

Map jion

链接发生的阶段，发生在map task

小表对大表

-大表数据从文件中读取

-小表的数据从内存中获取DistributedCache

Reduce jion

链接发生的阶段

大表对大表

每个表的数据都是从文件中获取的。

SMB join

Sort ---merge--bucker

两个表的桶的数目一致。

比如查询两张表的全字段数据Select * from customer jion order on customer.id=order.id

Cluster and sort by the most common jion key.

Create table order(cid int ,price float ,quantity int)clustered by (cid) sorted by (cid) into 32 buckets;（0-20,20-40,40-60 3个桶）

Create table clustomer(cd int ,first string,last string )clustered by (id) sorted by (id) into 32 buckets;（0-20,20-40,40-60 3个桶）

分桶进行join

4、数据倾斜

Group by 容易数据倾斜，by null

Count(distinct xx)容易导致数据倾斜。

5、Explain执行计划（查看有几个mr任务）

几个阶段Stage(几个MR任务)----root stage

---depends stage(依赖于root stage)

---depends stage(依赖于上一个depends stage)

Explain select * from emp; stage-0

Explain extended select deptname avg(salary) group by deptname;

6、并行执行

Job1 a join b aa

Job2 c jion d bb

Job3 aa jion bb ee

让job1与job2并行执行，设置hive.exec.paraliel.thread.number 10

Hive.exec.paraliel true

7、JVM重用

Mapreduce.job.jvm.numtasks 不要超过9个

节约jvm启动时间

一个JVM运行多个map任务

8、Reduce数目

Mapreduce.job.reduces 数量（测试）

根据测试结果设置

9、推测执行（mapreduce调优）

运行reduce有快慢，慢的，application Master检测到，它会另外启用一个任务，然后慢的与启用的新的，两个哪个快使用哪个。Hive操作本身耗时，不建议推测执行。

Hive.mapred.reduce.tasks.speculative.execution 默认是true改成false

同时设置一下两个参数为false：

Mapreduce.map.speculativetive

mapreduce.reduce.specula

10、Map的数量设置，主要根据文件的大小。

11、动态分区调整

日志Log_src表：

表的类型：外部分区表（yyyy-MM/dd）（
4000
flume加载数据，或者手动加载数据）

子表：都要分区yyyy-MM/dd，所以开启动态分区。

Ip: log_ip表

Page:log_page表

Refer:log_refer表

12、Hive.mapred.mode=strict

如果是在严格模式下strict禁止3种类型的查询：

1、对于分区表，不加分区字段过滤条件不能执行（where）

2、对于order by 语句，必须使用limit语句

3、限制笛卡尔积查询（jion的时候不适用on，而使用where）

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航