hive各种文件格式与压缩方式的结合测试
2013-11-06 19:25
477 查看
最近在给整个集群做一个整体各种压缩方式的测试,稍候带来测试的结果报告。
测试环境:
Linux master 2.6.18-348.12.1.el5 #1 SMP Wed Jul 10 05:28:41 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
hadoop-1.0.3
hive-0.9.0
Oracle JRockit(R) (build R28.1.5-20-146757-1.6.0_29-20111004-1750-linux-x86_64, compiled mode)
共5台datanode
hive测试的文件格式:
RCFile
SequenceFile
压缩模式:
snappy
bz2
(后续再加入对Lzo、Gzip的压缩测试)
测试的指标包含:
1、压缩率 2、读取数据量、3、hive执行速度
压缩率:
通过以上结果,发现使用什么格式与压缩率关系不大。
下面测试下,hive在这两种压缩情况下的SQL执行效果:
使用sequenceFile(bz压缩)进行读取统计时,如下图:
Time taken: 194.226 seconds
通过以上对比:
1、发现采用RCfile的格式读取的数据量(373.94MB)远远小于sequenceFile的读取量(2.59GB)
2、执行速度前者(68秒)比后者(194秒)快很多
压缩率:
从节约磁盘空间来看bz优势很大,(注:这里没有对lzo进行测试,是因为通过hbase的测试效果lzo的节省空间不会有太大优势)
下面测试下bz和snappy压缩在sql执行的效果:
从以上的运行进度看,snappy的执行进度远远高于bz的执行进度。
接着我们在分析下采用snappy压缩模式执行SQL的MR状态如下:
读取的数据总量在608.77MB,也还好。
总结:
在hive中使用压缩需要灵活的方式,如果是数据源的话,采用RCFile+bz或RCFile+gz的方式,这样可以很大程度上节省磁盘空间;而在计算的过程中,为了不影响执行的速度,可以浪费一点磁盘空间,建议采用RCFile+snappy的方式,这样可以整体提升hive的执行速度。
至于lzo的方式,也可以在计算过程中使用,只不过综合考虑(速度和压缩比)还是考虑snappy适宜。
测试环境:
Linux master 2.6.18-348.12.1.el5 #1 SMP Wed Jul 10 05:28:41 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
hadoop-1.0.3
hive-0.9.0
Oracle JRockit(R) (build R28.1.5-20-146757-1.6.0_29-20111004-1750-linux-x86_64, compiled mode)
共5台datanode
hive测试的文件格式:
RCFile
SequenceFile
压缩模式:
snappy
bz2
(后续再加入对Lzo、Gzip的压缩测试)
测试的指标包含:
1、压缩率 2、读取数据量、3、hive执行速度
第一:bz压缩
以下是各文件格式的bz2压缩对比原始数据大小 | RCFile压缩后大小 | SequenceFile压缩后大小 |
---|---|---|
12.89GB | 2.29GB | 2.59GB |
RCFile压缩后大小 | SequenceFile压缩后大小 |
---|---|
82.23% | 79.91% |
下面测试下,hive在这两种压缩情况下的SQL执行效果:
使用rcfile(bz压缩)在进行统计读取时,如下图:
Time taken: 68.169 seconds
使用sequenceFile(bz压缩)进行读取统计时,如下图:
Time taken: 194.226 seconds
通过以上对比:
1、发现采用RCfile的格式读取的数据量(373.94MB)远远小于sequenceFile的读取量(2.59GB)
2、执行速度前者(68秒)比后者(194秒)快很多
第二:snappy压缩
在进行snappy压缩时,我只对RCFile进行测试(sequenceFile基本不在我后期考虑优化的范围内)原始数据大小 | bz压缩后大小 | snappy压缩后大小 |
---|---|---|
12.89GB | 2.29GB | 4.87GB |
bz压缩后 | snappy压缩后 |
---|---|
82.23% | 62.22% |
下面测试下bz和snappy压缩在sql执行的效果:
bz执行的进度 | snappy执行的进度 |
---|---|
2013-11-06 18:18:56,840 Stage-1 map = 0%, reduce = 0% 2013-11-06 18:19:24,020 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:25,028 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:26,036 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:27,045 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:28,053 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:29,060 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:30,068 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:31,074 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:32,081 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:33,088 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:34,095 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:35,101 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:36,108 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:37,115 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:38,122 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:39,129 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:40,135 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:41,141 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:42,148 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:43,154 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:44,161 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:45,168 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:46,175 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:47,182 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:48,188 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:49,195 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:50,201 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:51,207 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:52,214 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:53,220 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:54,227 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:55,233 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:56,239 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:57,246 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:58,252 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:19:59,259 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:20:00,265 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:20:01,272 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:20:02,279 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:20:03,286 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:20:04,293 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 66.32 sec 2013-11-06 18:20:05,309 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 405.79 sec 2013-11-06 18:20:06,316 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:07,323 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:08,331 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:09,338 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:10,345 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:11,352 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:12,359 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:13,366 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:14,373 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:15,380 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:16,387 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:17,394 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:18,401 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:19,408 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 1373.04 sec 2013-11-06 18:20:20,415 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 1373.04 sec | 2013-11-06 19:06:33,666 Stage-1 map = 0%, reduce = 0% 2013-11-06 19:06:43,699 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 4.65 sec 2013-11-06 19:06:44,704 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 4.65 sec 2013-11-06 19:06:45,709 Stage-1 map = 14%, reduce = 0%, Cumulative CPU 4.65 sec 2013-11-06 19:06:46,714 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 20.37 sec 2013-11-06 19:06:47,719 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 20.37 sec 2013-11-06 19:06:48,724 Stage-1 map = 41%, reduce = 0%, Cumulative CPU 42.85 sec 2013-11-06 19:06:49,729 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 52.4 sec 2013-11-06 19:06:50,734 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 52.4 sec 2013-11-06 19:06:51,739 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:52,744 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:53,749 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:54,754 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:55,759 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:56,764 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:57,769 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:58,774 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:06:59,779 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:07:00,784 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:07:01,789 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:07:02,794 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:07:03,799 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 85.82 sec 2013-11-06 19:07:04,804 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 116.32 sec 2013-11-06 19:07:05,809 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 116.32 sec 2013-11-06 19:07:06,814 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 116.32 sec 2013-11-06 19:07:07,820 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 209.05 sec 2013-11-06 19:07:08,825 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 209.05 sec 2013-11-06 19:07:09,831 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 209.05 sec 2013-11-06 19:07:10,836 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 209.05 sec 2013-11-06 19:07:11,841 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 209.05 sec 2013-11-06 19:07:12,846 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 209.05 sec 2013-11-06 19:07:13,851 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 210.72 sec 2013-11-06 19:07:14,857 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 210.72 sec 2013-11-06 19:07:15,863 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 210.72 sec 2013-11-06 19:07:16,868 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 219.67 sec 2013-11-06 19:07:17,873 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 219.67 sec 2013-11-06 19:07:18,878 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 219.67 sec 2013-11-06 19:07:19,884 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 225.03 sec 2013-11-06 19:07:20,889 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 225.03 sec 2013-11-06 19:07:21,894 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 225.03 sec 2013-11-06 19:07:22,900 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:23,905 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:24,920 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:25,925 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:26,930 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:27,935 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:28,940 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:29,946 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:30,959 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 225.03 sec 2013-11-06 19:07:31,964 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec 2013-11-06 19:07:32,970 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec 2013-11-06 19:07:33,975 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec 2013-11-06 19:07:34,981 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec 2013-11-06 19:07:35,987 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec 2013-11-06 19:07:36,993 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec 2013-11-06 19:07:37,999 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 227.32 sec |
接着我们在分析下采用snappy压缩模式执行SQL的MR状态如下:
读取的数据总量在608.77MB,也还好。
总结:
在hive中使用压缩需要灵活的方式,如果是数据源的话,采用RCFile+bz或RCFile+gz的方式,这样可以很大程度上节省磁盘空间;而在计算的过程中,为了不影响执行的速度,可以浪费一点磁盘空间,建议采用RCFile+snappy的方式,这样可以整体提升hive的执行速度。
至于lzo的方式,也可以在计算过程中使用,只不过综合考虑(速度和压缩比)还是考虑snappy适宜。
相关文章推荐
- Text Reverse
- 动态规划_最大子路径和问题
- sqlmap简易教程–帮助文档个人使用经验解析
- 两人为一组,注册账号密码。交换Cookie,验证利用对方Cookie是否能够登陆。
- 交换cookies
- [转]WIN下成功安装PEAR
- Pku2406 Power Strings
- 几个有趣的算术题
- Linux网站架构系列之Mysql----部署篇
- WIN下成功安装PEAR
- Cocos2d-x:用继承自CCSprite的子类实现点到点之间的画线
- linux上ruby on rails框架的各个插件的版本
- QT图形视图框架(The QGraphics View Framework)(转)(六)
- 【备份】虚拟机装debian找不到网卡解决办法
- tiny6410的Linux系统修改IP地址
- 图形推理题
- 图形图像处理-之-任意角度的高质量的快速的图像旋转 上篇 纯软件的任意角度的快速旋转
- linux扩展目录
- SQL Server自定义函数
- 第八章 右左法则----复杂指针解析