您的位置：首页 > 其它

数据仓库(八)---hive的性能优化---hive动态分区

2018-02-27 11:13 471 查看

我们在上一篇文章中已经学习了如何进行分区，手动分区。

数据仓库(七)—hive的性能优化—hive的分区和分桶

但是分区之后再插入数据时，并不会自动的进行分区，而是需要再次手动分区。

关系型数据库（如Oracle）中，对分区表Insert数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过，使用Hive的动态分区，需要进行相应的配置。

分区种类

分区分为两种：

静态分区static partition

动态分区dynamic partition

静态分区和动态分区的区别在于导入数据时，是手动输入分区名称，还是通过数据来判断数据分区。对于大数据批量导入来说，显然采用动态分区更为简单方便。

动态分区配置方法

修改一下hive的默认设置以支持动态分区：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

其他参数

hive.exec.dynamic.partition

默认值：false

是否开启动态分区功能，默认false关闭。

使用动态分区时候，该参数必须设置成true;

hive.exec.dynamic.partition.mode

默认值：strict

动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。

一般需要设置为nonstrict

hive.exec.max.dynamic.partitions.pernode

默认值：100

在每个执行MR的节点上，最大可以创建多少个动态分区。

该参数需要根据实际的数据来设定。

比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

hive.exec.max.dynamic.partitions

默认值：1000

在所有执行MR的节点上，最大一共可以创建多少个动态分区。

同上参数解释。

hive.exec.max.created.files

默认值：100000

整个MR Job中，最大可以创建多少个HDFS文件。

一般默认值足够了，除非你的数据量非常大，需要创建的文件数大于100000，可根据实际情况加以调整。

hive.error.on.empty.partition

默认值：false

当有空分区生成时，是否抛出异常。

一般不需要设置。

实例

静态分区

新建一个静态分区表t_student，把原信息表t_student_info_25根据分区age=25放入分区，也就是说如果有30多个年龄阶段就需要执行30多次类似的命令。

create table if not exists t_student(id int,name string,tel string) partitioned by(age int)

row format delimited fields terminated by ','

stored as textfile;

–overwrite是覆盖，into是追加

insert into table t_student

partition(age='25')

select id,name,tel,age from t_student_info_25;

动态分区

设置为true表示开启动态分区功能（默认为false）

set hive.exec.dynamic.partition=true;

设置为nonstrict,表示允许所有分区都是动态的（默认为strict）

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite是覆盖，insert into是追加

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table t_student

partition(age)

select id,name,tel,age from t_student_info;

注意事项

从原表中select出来的字段顺序需要与分区表的列一致，因为分区后的表是把用来分区的字段放在最后的，如果直接select * 进行insert的话，会导致列不对应。

如果查询出来的数据类型和插入表格对应的列数据类型不一致，将会进行转换，但是不能保证转换一定成功，比如如果查询出来的数据类型为int，插入表格对应的列类型为string，可以通过转换将int类型转换为string类型；但是如果查询出来的数据类型为string，插入表格对应的列类型为int，转换过程可能出现错误，因为字母就不可以转换为int，转换失败的数据将会为NULL。

静态分区和动态分区混合使用

全部DP

INSERT OVERWRITE TABLE t_student PARTITION (time, age)
SELECT id, name, time, age FROM t_student_info WHERE time is not null and age>10;

DP/SP结合

INSERT OVERWRITE TABLE t_student PARTITION (time='2018-02-27', age)
SELECT id, name, time,age FROM t_student_info WHERE time is not null and age>10;

注意

当SP是DP的子分区时，以下DML会报错，因为分区顺序决定了HDFS中目录的继承关系，这点是无法改变的

INSERT OVERWRITE TABLE t_student PARTITION (time, age = 11)
SELECT id, name, time,age FROM t_student_info WHERE time is not null and age=11;

多张表插入

FROM student
INSERT OVERWRITE TABLE t_student PARTITION (time='2018-02-27', age)
SELECT id, name, time, age FROM t_student_info WHERE time is not null and age>10
INSERT OVERWRITE TABLE t_student_12 PARTITION (time='2018-02-27, age=12)
SELECT id, name, time, age from t_student_info where time is not null and age = 12;

CTAS

CREATE-AS语句，DP与SP下的CTAS语法稍有不同，因为目标表的schema无法完全的从select语句传递过去。这时需要在create语句中指定partition列

CREATE TABLE t_student (id int, name string) PARTITIONED BY (time string, age int) AS
SELECT id, name, time, age+1 age1 FROM t_student_info WHERE time is not null and age>10;

上面展示了DP下的CTAS用法，如果希望在partition列上加一些自己的常量，可以这样做

CREATE TABLE t_student (id int, name string) PARTITIONED BY (time string, age int) AS
SELECT id, name, "2018-02-27", age+1 age1 FROM t_student_info WHERE time is not null and age>10;

总结

结果本章学习，我们应该知道了对于那些存在很大数量的二级分区的表，使用动态分区可以非常智能的加载表，而在动静结合使用时需要注意静态分区值必须在动态分区值的前面

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航