您的位置:首页 > 移动开发

hive 存储格式和压缩方式 一:Snappy + SequenceFile

2013-08-30 02:15 525 查看
为什么要用Sequence File:

a).压缩 b).这种格式可分割,可供多个mapper 并发读取

贴一段《Programming Hive》的:

Compressing files results in space savings but one of the downsides of storing raw
compressed files in Hadoop is that often these files are not splittable. Splittable files
can be broken up and processed in parts by multiple mappers in parallel. Most com-
pressed files are not splittable because you can only start reading from the beginning.
The sequence file format supported by Hadoop breaks a file into blocks and then op-
tionally compresses the blocks in a splittable way.
下面就用它一用

1、设置三个参数:

hive.exec.compress.output 声明对 hive 查询的输出结果进行压缩,并指定压缩方式为 Snappy。
对SequenceFile 有 mapred.output.compression.type,在CDH4中默认就是 BLOCK。

SET hive.exec.compress.output=true;

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET mapred.output.compression.type=BLOCK;

> SET hive.exec.compress.output;
hive.exec.compress.output=false
hive (sequence)> SET hive.exec.compress.output=true;
hive (sequence)> SET mapred.output.compression.codec;
mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
hive (sequence)> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive (sequence)> SET mapred.output.compression.type;
mapred.output.compression.type=BLOCK
hive (sequence)> set io.seqfile.compress.blocksize;
io.seqfile.compress.blocksize=1000000
2、创建TEXTFILE表,用来提供数据。

CREATE TABLE info_text(
......)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
3、建目标SequenceFile 表
CREATE TABLE info_sequ(
...... )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE;
4、为text表导入数据,一个400+M的数据文件
load data local inpath 'file.txt' OVERWRITE into table info_text;
5、为目标 SequenceFile 表导入数据
insert into table info_sequ select * from info_text;
6、测试结果是否相同
select * from info_sequ limit 20;
select * from info_text limit 20;
7、查看HDFS文件大小
drwxrwxrwx   - hive hadoop          0 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_sequ
-rw-r--r--   3 root hadoop  124330943 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_sequ/000000_0
-rw-r--r--   3 root hadoop   77730350 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_sequ/000001_0
drwxr-xr-x   - root hadoop          0 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_text
-rw-r--r--   3 root hadoop  438474539 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_text/file.txt


000000_0 文件开头:

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^A)org.apache.hadoop.io.compress.SnappyCodec^@^@^@^@<84>çé<9f>³ÅZ<97>/.¹*¿I²6ÿÿÿÿ<84>çé<9f>³ÅZ<97>/.¹*¿I²6<8e>2^M<8e>^Bg^@^@2


000001_0 文件开头:

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^A)org.apache.hadoop.io.compress.SnappyCodec^@^@^@^@[Ü^^Ü<9c>rx1µå'^HçÕcöÿÿÿÿ[Ü^^Ü<9c>rx1µå'^HçÕcö<8e>2Y<8e>^Bj^@^@2Y^@^@^BbÙ


Hive 中有两个虚拟列:

INPUT__FILE__NAME  map 任务的输入文件名

BLOCK__OFFSET__INSIDE__FILE   当前文件的 position。据我理解,就如NIO的Buffer中的position。

hive> describe formatted locallzo;
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: com.hadoop.mapred.DeprecatedLzoTextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

hive> describe formatted raw;
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

可以比较两个表 BLOCK__OFFSET__INSIDE__FILE 的区别:

存储文件为文本文件:
hive> select INPUT__FILE__NAME, unum, BLOCK__OFFSET__INSIDE__FILE from raw limit 10;
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 3246065 0
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 2037265 73
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 2287465 149
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 6581865 225
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 6581865 298
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 6581865 371
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 1629568 447
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt 2185765 523

存储文件为lzo压缩文件:
hive> select INPUT__FILE__NAME, office_no, BLOCK__OFFSET__INSIDE__FILE from locallzo limit 20;
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo 3246065 0
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo 2037265 28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo 2287465 28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo 6581865 28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo 6581865 28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo 6581865 28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo 1629568 28778
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: