您的位置：首页 > 其它

在Hive中使用Avro

2015-12-04 15:10 555 查看

http://www.iteblog.com/archives/1007

为了解析Avro格式的数据，我们可以在Hive建表的时候用下面语句：

01	hive> CREATE EXTERNAL TABLE tweets

02	> COMMENT"A table backed by Avro data with the

03	> Avro schema embedded in the CREATE TABLE statement"

04	> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

05	> STORED AS

06	> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

07	> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

08	> LOCATION '/user/wyp/examples/input/'

09	> TBLPROPERTIES (

10	> 'avro.schema.literal' ='{

11	> "type" : "record" ,

12	> "name" : "Tweet" ,

13	> "namespace" : "com.miguno.avro" ,

14	> "fields" : [

15	> { "name" : "username" , "type" : "string" },

16	> { "name" : "tweet" , "type" : "string" },

17	> { "name" : "timestamp" , "type" : "long" }

>
]

>
}'

>
);

OK

22	Time taken: 0.076 seconds

24	hive> describe tweets;

OK

26	username string from deserializer

27	tweet string from deserializer

28	timestamp bigint from deserializer

然后用Snappy压缩我们需要的数据，下面是压缩前我们的数据：

02	"username" : "miguno" ,

03	"tweet" : "Rock: Nerf paper,scissors is fine." ,

04	"timestamp" : 1366150681

},

07	"username" : "BlizzardCS" ,

08	"tweet" : "Works as intended. Terran is IMBA." ,

09	"timestamp" : 1366154481

},

12	"username" : "DarkTemplar" ,

13	"tweet" : "From the shadows I come!" ,

14	"timestamp" : 1366154681

},

17	"username" : "VoidRay" ,

18	"tweet" : "Prismatic core online!" ,

19	"timestamp" : 1366160000

压缩完的数据假如存放在/home/wyp/twitter.avsc文件中，我们将这个数据复制到HDFS中的/user/wyp/examples/input/目录下：

1	hadoop fs -put /home/wyp/twitter.avro /user/wyp/examples/input/

然后我们就可以在Hive中使用了：

1	hive> select * from tweets limit 5 ;;

OK

3	miguno Rock:Nerf paper,scissors is fine. 1366150681

4	BlizzardCS Works as intended. Terran is IMBA. 1366154481

5	DarkTemplar From the shadows I come! 1366154681

6	VoidRay Prismatic core online! 1366160000

7	Time taken: 0.495 seconds, Fetched: 4 row(s)

当然，我们也可以将avro.schema.literal中的

02	"type" : "record" ,

03	"name" : "Tweet" ,

04	"namespace" : "com.miguno.avro" ,

05	"fields" : [

07	"name" : "username" ,

08	"type" : "string"

},

11	"name" : "tweet" ,

12	"type" : "string"

},

15	"name" : "timestamp" ,

16	"type" : "long"

存放在一个文件中，比如：twitter.avsc,然后上面的建表语句就可以修改为：

01	CREATE EXTERNAL TABLE tweets

02	COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"

03	ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED
AS

05	INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

06	OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

07	LOCATION '/user/wyp/examples/input/'

08	TBLPROPERTIES (

09	'avro.schema.url' = 'hdfs:///user/wyp/examples/schema/twitter.avsc'

);

效果和上面的一样。
本博客文章除特别声明，全部都是原创！

尊重原创，转载请注明：转载自过往记忆（http://www.iteblog.com/）

本文链接地址:《在Hive中使用Avro》（http://www.iteblog.com/archives/1007）

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航