您的位置:首页 > 其它

在Hive中使用Avro

2015-12-04 15:10 555 查看
http://www.iteblog.com/archives/1007

为了解析Avro格式的数据,我们可以在Hive建表的时候用下面语句:

01
hive>
CREATE EXTERNAL TABLE tweets
02
>
COMMENT"A table backed by Avro data with the
03
>
Avro schema embedded in the CREATE TABLE statement"
04
>
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
05
>
STORED AS
06
>
INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
07
>
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
08
>
LOCATION
'/user/wyp/examples/input/'
09
>
TBLPROPERTIES (
10
>
'avro.schema.literal'
='{
11
>
"type"
:
"record"
,
12
>
"name"
:
"Tweet"
,
13
>
"namespace"
:
"com.miguno.avro"
,
14
>
"fields"
:
[
15
>
{
"name"
:
"username"
,
"type"
:
"string"
},
16
>
{
"name"
:
"tweet"
,
"type"
:
"string"
},
17
>
{
"name"
:
"timestamp"
,
"type"
:
"long"
}
18
>
]
19
>
}'
20
>
);
21
OK
22
Time
taken:
0.076
seconds
23
24
hive>
describe tweets;
25
OK
26
username
string from deserializer
27
tweet
string from deserializer
28
timestamp
bigint from deserializer
然后用Snappy压缩我们需要的数据,下面是压缩前我们的数据:

01
{
02
"username"
:
"miguno"
,
03
"tweet"
:
"Rock:
Nerf paper,scissors is fine."
,
04
"timestamp"
:
1366150681
05
},
06
{
07
"username"
:
"BlizzardCS"
,
08
"tweet"
:
"Works
as intended.  Terran is IMBA."
,
09
"timestamp"
:
1366154481
10
},
11
{
12
"username"
:
"DarkTemplar"
,
13
"tweet"
:
"From
the shadows I come!"
,
14
"timestamp"
:
1366154681
15
},
16
{
17
"username"
:
"VoidRay"
,
18
"tweet"
:
"Prismatic
core online!"
,
19
"timestamp"
:
1366160000
20
}
压缩完的数据假如存放在/home/wyp/twitter.avsc文件中,我们将这个数据复制到HDFS中的/user/wyp/examples/input/目录下:

1
hadoop
fs -put /home/wyp/twitter.avro  /user/wyp/examples/input/
然后我们就可以在Hive中使用了:

1
hive>
select * from tweets limit
5
;;
2
OK
3
miguno
Rock:Nerf paper,scissors is fine.
1366150681
4
BlizzardCS
Works as intended.  Terran is IMBA.
1366154481
5
DarkTemplar
From the shadows I come!
1366154681
6
VoidRay
Prismatic core online!
1366160000
7
Time
taken:
0.495
seconds,
Fetched:
4
row(s)
当然,我们也可以将avro.schema.literal中的

01
{
02
"type"
:
"record"
,
03
"name"
:
"Tweet"
,
04
"namespace"
:
"com.miguno.avro"
,
05
"fields"
:
[
06
  
{
07
 
"name"
:
"username"
,
08
 
"type"
:
"string"
09
  
},
10
  
{
11
 
"name"
:
"tweet"
,
12
 
"type"
:
"string"
13
  
},
14
  
{
15
 
"name"
:
"timestamp"
,
16
 
"type"
:
"long"
17
  
}
18
]
19
}
存放在一个文件中,比如:twitter.avsc,然后上面的建表语句就可以修改为:

01
CREATE
EXTERNAL TABLE tweets
02
COMMENT
"A
table backed by Avro data with the Avro schema stored in HDFS"
03
ROW
FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
04
STORED
AS
05
INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
06
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
07
LOCATION
'/user/wyp/examples/input/'
08
TBLPROPERTIES
(
09
'avro.schema.url'
=
'hdfs:///user/wyp/examples/schema/twitter.avsc'
10
);
效果和上面的一样。
本博客文章除特别声明,全部都是原创!

尊重原创,转载请注明: 转载自过往记忆(http://www.iteblog.com/)

本文链接地址:《在Hive中使用Avro》(http://www.iteblog.com/archives/1007)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: