Snappy,Lzo,bzip2,gzip,deflate文件解压
2013-12-02 14:26
232 查看
Snappy,Lzo,bzip2,gzip,deflate 都是hive常用的文件压缩格式,各有所长,这里咱们只关注具体文件的解压
org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,
首先我们需要这些依赖,我把解压需要的依赖都放在了 /home/apache/test/lib/ 目录下
此外还需要文件压缩需要的本地库文件,找到一台装有hadoop的环境,将 $HADOOP_HOME/lib/native 目录复制过来,我放到了 /tmp/decompress 目录下
这只需要两个参数:
hive.exec.compress.output 设置为 true 来声明将结果文件进行压缩
mapred.output.compression.codec 用来设置具体的结果文件压缩格式
在 hive shell 中检查这两个参数,设置为我们需要的 Snappy 格式后,随便运行一个SQL将结果写到本地文件
cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar /usr/lib/hive/lib
全部就绪了,咱编译好上面的类后就开始吧
一、先贴代码:
package compress; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.CompressionCodecFactory; import org.apache.hadoop.io.compress.CompressionInputStream; public class Decompress { public static final Log LOG = LogFactory.getLog(Decompress.class.getName()); public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String name = "io.compression.codecs"; String value = "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec"; conf.set(name, value); CompressionCodecFactory factory = new CompressionCodecFactory(conf); for (int i = 0; i < args.length; ++i) { CompressionCodec codec = factory.getCodec(new Path(args[i])); if (codec == null) { System.out.println("Codec for " + args[i] + " not found."); } else { CompressionInputStream in = null; try { in = codec.createInputStream(new java.io.FileInputStream( args[i])); byte[] buffer = new byte[100]; int len = in.read(buffer); while (len > 0) { System.out.write(buffer, 0, len); len = in.read(buffer); } } finally { if (in != null) { in.close(); } } } } } }
二、准备工作
1、准备依赖
简要说明一下,这几种压缩文件相关的核心类为:org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,
首先我们需要这些依赖,我把解压需要的依赖都放在了 /home/apache/test/lib/ 目录下
此外还需要文件压缩需要的本地库文件,找到一台装有hadoop的环境,将 $HADOOP_HOME/lib/native 目录复制过来,我放到了 /tmp/decompress 目录下
2、准备压缩文件
2.1、Snappy 文件
因为我没安装Snappy库,所以就用hive来创建snappy压缩文件:这只需要两个参数:
hive.exec.compress.output 设置为 true 来声明将结果文件进行压缩
mapred.output.compression.codec 用来设置具体的结果文件压缩格式
在 hive shell 中检查这两个参数,设置为我们需要的 Snappy 格式后,随便运行一个SQL将结果写到本地文件
> set hive.exec.compress.output; hive.exec.compress.output=true hive> set mapred.output.compression.codec;mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/snappy' select * from info900m limit 20;至此,我们获得了结果文件 /tmp/snappy/000000_0.snappy
2.2、Lzo文件
同上,我们指定压缩格式为 lzo> set hive.exec.compress.output; hive.exec.compress.output=true hive> set mapred.output.compression.codec; mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/lzo' select * from info900m limit 20;获得了结果文件 /tmp/lzo/000000_0.lzo
2.3、创建 bz2 文件和 gz 文件
创建bz2文件 [apache@indigo bz2]$ cp /etc/resolv.conf . [apache@indigo bz2]$ cat resolv.conf # Generated by NetworkManager domain dhcp search dhcp server nameserver 192.168.0.1 创建 gz 文件 [apache@indigo bz2]$ tar zcf resolv.conf.gz resolv.conf
2.4、创建 deflate 文件
> set mapred.output.compression.codec; mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec; hive> > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/deflate' select * from info900m limit 20; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1385947742139_0006, Tracking URL = http://indigo:8088/proxy/application_1385947742139_0006/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1385947742139_0006 Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1 2013-12-02 13:30:48,522 Stage-1 map = 0%, reduce = 0% 2013-12-02 13:30:56,271 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 1.2 sec 2013-12-02 13:30:57,330 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec ...... 2013-12-02 13:31:15,508 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec 2013-12-02 13:31:16,552 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.85 sec MapReduce Total cumulative CPU time: 4 seconds 850 msec Ended Job = job_1385947742139_0006 with errors Error during job, obtaining debugging information... Examining task ID: task_1385947742139_0006_m_000003 (and more) from job job_1385947742139_0006 Task with the most failures(4): ----- Task ID: task_1385947742139_0006_r_000000 URL: http://indigo:8088/taskdetails.jsp?jobid=job_1385947742139_0006&tipid=task_1385947742139_0006_r_000000 ----- Diagnostic Messages for this Task: Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:270) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:460) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:258) ... 7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found. at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:479) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:543) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800) at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800) at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:249) ... 7 more Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found. at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:94) at org.apache.hadoop.hive.ql.exec.Utilities.getFileExtension(Utilities.java:910) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:469) ... 16 more Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.io.compress.DefaultCode not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:91) ... 18 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 4 Reduce: 1 Cumulative CPU: 4.85 sec HDFS Read: 460084 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 4 seconds 850 msec这里 hive 居然没读 hadoop 的 classpath ,所以只好将依赖放到 hive classpath下,重启hive ,重新查询即可
cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar /usr/lib/hive/lib
全部就绪了,咱编译好上面的类后就开始吧
三、解压
1、snappy文件
需要注意一下,参数为要解压的文件名,创建对应的 Decompression 的依据是压缩文件扩展名,所以这里扩展名不能随便改,下面解压刚才获取的snappy文件[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/snappy/000000_0.snappy log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. ..................文件内容省略................................
2、lzo文件
因为我安装了lzo库,所以可以直接解压[apache@indigo lzo]$ lzop -d 000000_0.lzo [apache@indigo lzo]$ ll total 8 -rw-r--r--. 1 apache apache 1650 Dec 2 13:12 000000_0 -rwxr-xr-x. 1 apache apache 848 Dec 2 13:12 000000_0.lzo用compress.Decompress 指定lzo压缩文件名即可:
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/lzo/000000_0.lzo log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
3、bzip2 文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.bz2 log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. resolv.conf0000644000175000017500000000012412247014427012463 0ustar apacheapache# Generated by NetworkManager domain dhcp search dhcp server nameserver 192.168.0.1
4、gzip 文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.gz log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. resolv.conf0000644000175000017500000000012412247014427012463 0ustar apacheapache# Generated by NetworkManager domain dhcp search dhcp server nameserver 192.168.0.1
5、deflate文件
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/deflate/000000_0.deflate log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. ..................文件内容省略................................
相关文章推荐
- 指令篇:文件与文件系统的压缩与解压与打包(归档)与解压___gzip、zcat;bzip2、bzcat;zip; tar
- [case用法]自动解压bzip2, gzip 和zip 类型的压缩文件
- 压缩和解压文件:tar gzip bzip2 compress(转)
- Delphi中获取IE网页后,对GZIP方式的网页解压(gzip,deflate)
- zip bzip2 gzip xz tar文件解压缩
- 利用python中的gzip模块压缩和解压数据流和文件
- 利用.net2.0中的GZip或Deflate压缩文件【zhuan】
- Python处理各种压缩文件(bzip2,gzip,zip)
- linux-解压bz2文件提示tar (child): bzip2: Cannot exec: No such file or directory
- Linux 压缩 解压命令 tar, gzip, zcat,bzip2, bzcat,compres
- gzip文件内存解压后处理,再保存到文件
- linux下面tar解压文件提示gzip: stdin: not in gzip format处理
- 使用GZIPOutputStream和GZIPInputStream进行压缩解压文件
- Snappy-java 解压文件
- Linux普通文件压缩工具gzip、Bzip2、xz
- java Gzip方式 解压,压缩文件Utils
- Iphone 上使用libz库解压zip,gzip文件
- 压缩解压命令gzip,gunzip,tar,zip,unzip,bzip2,bunzip2
- gzip,bzip2, xz , zip ,unzip ,解压,压缩;tar打包,解压。
- C#中用SharpZipLib生成gzip/解压文件