您的位置：首页 > 其它

hive 传递外部变量的方式

2016-02-07 18:29 302 查看

Hive开发中使用变量的两种方法

2013/09/13 by Crazyant
暂无评论

在使用hive开发数据分析代码时，经常会遇到需要改变运行参数的情况，比如select语句中对日期字段值的设定，可能不同时间想要看不同日期的数据，这就需要能动态改变日期的值。如果开发量较大、参数多的话，使用变量来替代原来的字面值非常有必要，本文总结了几种可以向hive的SQL中传入参数的方法，以满足类似的需要。

准备测试表和测试数据

第一步先准备测试表和测试数据用于后续测试：

123	hive> create database test;OKTime taken: 2.606 seconds

然后执行建表和导入数据的sql文件：

1

2

3

4

5

6

7

8

9

10

11

[czt@www.crazyant.net
testHivePara]$
hive
-f
student.sql

Hive
history
file=/tmp/crazyant.net/hive_job_log_czt_201309131615_1720869864.txt

OK

Time
taken:
2.131
seconds

OK

Time
taken:
0.878
seconds

Copying
data
from
file:/home/users/czt/testdata_student

Copying
file:
file:/home/users/czt/testdata_student

Loading
data
to
table
test.student

OK

Time
taken:
1.76
seconds

其中student.sql内容如下：

123456789101112131415161718

use test; ---学生信息表create table IF NOT EXISTS student( sno bigint comment '学号' , sname string comment '姓名' , sage bigint comment '年龄' , pdate string comment '入学日期')COMMENT '学生信息表'ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'LINES TERMINATED BY '\n'STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/home/users/czt/testdata_student'INTO TABLE student;

testdata_student测试数据文件内容如下：

1

2

3

4

5

6

7

8

9

10

11

12

13

1
name1
21
20130901

2
name2
22
20130901

3
name3
23
20130901

4
name4
24
20130901

5
name5
25
20130902

6
name6
26
20130902

7
name7
27
20130902

8
name8
28
20130902

9
name9
29
20130903

10
name10
30
20130903

11
name11
31
20130903

12
name12
32
20130904

13
name13
33
20130904

方法1：shell中设置变量，hive -e中直接使用

测试的shell文件名：

12345

#!/bin/bashtablename="student"limitcount="8" hive -S -e "use test; select * from ${tablename} limit ${limitcount};"

运行结果：

1

2

3

4

5

6

7

8

9

10

11

12

[czt@www.crazyant.net
testHivePara]$
sh
-x
shellhive.sh

+
tablename=student

+
limitcount=8

+
hive
-S
-e
'use test; select * from student limit 8;'

1
name1 21 20130901

2
name2 22 20130901

3
name3 23 20130901

4
name4 24 20130901

5
name5 25 20130902

6
name6 26 20130902

7
name7 27 20130902

8
name8 28 20130902

由于hive自身是类SQL语言，缺乏shell的灵活性和对过程的控制能力，所以采用shell+hive的开发模式非常常见，在shell中直接定义变量，在hive -e语句中就可以直接引用；

注意：使用-hiveconf定义，在hive -e中是不能使用的

修改一下刚才的shell文件，采用-hiveconf的方法定义日期参数：

123456789101112131415

#!/bin/bashtablename="student"limitcount="8" hive -S \ -hiveconf enter_school_date="20130902" \ -hiveconf min_age="26" \ -e \ " use test; \ select * from ${tablename} \ where \ pdate='${hiveconf:enter_school_date}' \ and \ sage>'${hiveconf:min_age}' \ limit ${limitcount};"

运行会失败，因为该脚本在shell环境中运行的，于是shell试图去解析${hiveconf:enter_school_date}和${hiveconf:min_age}变量，但是这两个SHELL变量并没有定义，所以会以空字符串放在这个位置。运行时该SQL语句会被解析成下面这个样子：

1	+ hive -S -hiveconf enter_school_date=20130902 -hiveconf min_age=26 -e 'use test; explain select * from student where pdate='\'''\'' and sage>'\'''\'' limit 8;'

方法2：使用-hiveconf定义，在SQL文件中使用

因为换行什么的很不方便，hive -e只适合写少量的SQL代码，所以一般都会写很多hql文件，然后使用hive –f的方法来调用，这时候可以通过-hiveconf定义一些变量，然后在SQL中直接使用。

先编写调用的SHELL文件：

123	#!/bin/bash hive -hiveconf enter_school_date="20130902" -hiveconf min_ag="26" -f testvar.sql

被调用的testvar.sql文件内容：

1

2

3

4

5

6

7

8

use
test;

select *
from
student

where

pdate='${hiveconf:enter_school_date}'

and

sage
>
'${hiveconf:min_ag}'

limit
8;

执行过程：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

[czt@www.crazyant.net
testHivePara]$
sh
-x
shellhive.sh

+
hive
-hiveconf
enter_school_date=20130902
-hiveconf
min_ag=26
-f
testvar.sql

Hive
history
file=/tmp/czt/hive_job_log_czt_201309131651_2035045625.txt

OK

Time
taken:
2.143
seconds

Total
MapReduce
jobs
=
1

Launching
Job
1
out
of
1

Number
of
reduce
tasks
is
set
to
0
since
there's
no
reduce
operator

Kill
Command
=
hadoop
job
-kill
job_20130911213659_42303

2013-09-13
16:52:00,300
Stage-1
map
=
0%, reduce
=
0%

2013-09-13
16:52:14,609
Stage-1
map
=
28%, reduce
=
0%

2013-09-13
16:52:24,642
Stage-1
map
=
71%, reduce
=
0%

2013-09-13
16:52:34,639
Stage-1
map
=
98%, reduce
=
0%

Ended
Job
=
job_20130911213659_42303

OK

7
name7
27 20130902

8
name8
28 20130902

Time
taken:
54.268
seconds

总结

本文主要阐述了两种在hive中使用变量的方法，第一种是在shell中定义变量然后在hive -e的SQL语句中直接用${var_name}的方法调用；第二种是使用hive –hiveconf key=value –f run.sql模式使用-hiveconf来设置变量，然后在SQL文件中使用${hiveconf:varname}的方法调用。用这两种方法可以满足开发的时候向hive传递参数的需求，会很好的提升开发效率和代码质量。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航