hadoop note-(no map&reduce sequencefile)
2014-03-05 15:42
323 查看
no map&reduce
the framework will choose TextInputFormat as the default InputFormat, if no mapper is defined ,then the default mapper will out.write(offset, lineText),
and output from ruduce.
SequenceFile
SequenceFile的操作中有三种处理方式:
1) 不压缩数据直接存储。 //enum.NONE
2) 压缩value值不压缩key值存储的存储方式。//enum.RECORD
3)key/value值都压缩的方式存储。//enum.BLOCK
SequeceFile是Hadoop
API提供的一种二进制文件支持。这种二进制文件直接将<key, value>对序列化到文件中。一般对小文件可以使用这种文件合并,即将文件名作为key,文件内容作为value序列化到大文件中。这种文件格式有以下好处:
1)支持压缩,且可定制为基于Record或Block压缩(Block级压缩性能较优)
2)本地化任务支持:因为文件可以被切分,因此MapReduce任务时数据的本地化情况应该是非常好的。
3)难度低:因为是Hadoop框架提供的API,业务逻辑侧的修改比较简单。
坏处是需要一个合并文件的过程,且合并后的文件将不方便查看。
参考:http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720101121103928847/
usage:
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
1. job.setOutputFormatClass(SequenceFileOutputFormat.class);
after set eht outputFormatClass,the reducer will output the <key,value> as the sequenceFile
2. another job ,job.setInputFormatClass(SequenceFileInputFormat.class)
the input of mapper <key,value> is the <key,value> from the last reduce job
SequenceFileDirValueIterable . SequenceFileDirValueIterator . SequenceFileDirIterator . SequenceFileDirIterable
new SequenceFileDirValueIterable<VectorWritable>(input, PathType.LIST,PathFilters.logsCRCFilter(), conf)
With the PathFilters , we can use the output of last job as the next job input and ignore the log
file and other useless files
![](http://blog.csdn.net/xiewenbo/article/details/file:/D:/workplace-sem2/.metadata/.plugins/org.eclipse.jdt.ui/jdt-images/4.png)
PathFilter
org.apache.mahout.common.iterator.sequencefile.PathFilters.logsCRCFilter()
Returns:
![](http://blog.csdn.net/xiewenbo/article/details/file:/D:/workplace-sem2/.metadata/.plugins/org.eclipse.jdt.ui/jdt-images/4.png)
PathFilter org.apache.mahout.common.iterator.sequencefile.PathFilters.partFilter()
Returns:
the framework will choose TextInputFormat as the default InputFormat, if no mapper is defined ,then the default mapper will out.write(offset, lineText),
and output from ruduce.
SequenceFile
SequenceFile的操作中有三种处理方式:
1) 不压缩数据直接存储。 //enum.NONE
2) 压缩value值不压缩key值存储的存储方式。//enum.RECORD
3)key/value值都压缩的方式存储。//enum.BLOCK
SequeceFile是Hadoop
API提供的一种二进制文件支持。这种二进制文件直接将<key, value>对序列化到文件中。一般对小文件可以使用这种文件合并,即将文件名作为key,文件内容作为value序列化到大文件中。这种文件格式有以下好处:
1)支持压缩,且可定制为基于Record或Block压缩(Block级压缩性能较优)
2)本地化任务支持:因为文件可以被切分,因此MapReduce任务时数据的本地化情况应该是非常好的。
3)难度低:因为是Hadoop框架提供的API,业务逻辑侧的修改比较简单。
坏处是需要一个合并文件的过程,且合并后的文件将不方便查看。
参考:http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720101121103928847/
usage:
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
1. job.setOutputFormatClass(SequenceFileOutputFormat.class);
after set eht outputFormatClass,the reducer will output the <key,value> as the sequenceFile
2. another job ,job.setInputFormatClass(SequenceFileInputFormat.class)
the input of mapper <key,value> is the <key,value> from the last reduce job
SequenceFileDirValueIterable . SequenceFileDirValueIterator . SequenceFileDirIterator . SequenceFileDirIterable
new SequenceFileDirValueIterable<VectorWritable>(input, PathType.LIST,PathFilters.logsCRCFilter(), conf)
With the PathFilters , we can use the output of last job as the next job input and ignore the log
file and other useless files
![](http://blog.csdn.net/xiewenbo/article/details/file:/D:/workplace-sem2/.metadata/.plugins/org.eclipse.jdt.ui/jdt-images/4.png)
PathFilter
org.apache.mahout.common.iterator.sequencefile.PathFilters.logsCRCFilter()
Returns:
PathFilterthat rejects paths whose file name starts with "_" (e.g. Cloudera _SUCCESS files or Hadoop _logs), or "." (e.g. local hidden files), or ends with ".crc"
![](http://blog.csdn.net/xiewenbo/article/details/file:/D:/workplace-sem2/.metadata/.plugins/org.eclipse.jdt.ui/jdt-images/4.png)
PathFilter org.apache.mahout.common.iterator.sequencefile.PathFilters.partFilter()
Returns:
PathFilterthat accepts paths whose file name starts with "part-". Excludes ".crc" files.
相关文章推荐
- hadoop学习笔记<四>----map-reduce工作原理
- Hadoop & Map-reduce
- Hadoop 笔记之Map && Reduce数量确定
- Hadoop笔记之map &&shuffle && reduce 工作流程图及其分析
- 如何在Hadoop中控制Map&Reduce任务的数量
- Hadoop Map&Reduce个数优化设置以及JVM重用
- hadoop环境安装及简单Map-Reduce示例
- Hadoop Map/Reduce编程模型实现海量数据处理—数字求和-Hadoop学习
- hadoop的map和reduce任务的执行步骤
- Hadoop Map/Reduce 示例程序WordCount
- 使用SAS实现HADOOP Map/Reduce程序-wordcount
- Can't match key hostname in map hosts.byname. Reason: No such key in map
- Hadoop-Map/Reduce实现实现倒排索引
- hadoop中map到reduce的过程详解
- Hadoop跑map-reduce任务时停滞不前的问题(二)
- hadoop异常之 reduce拉取数据失败  (error in shuffle in fetcher)
- hadoop 一个Job多个MAP与REDUCE的执行
- Hadoop MapReduce执行过程中map和reduce执行过程
- Hadoop Map/Reduce教程
- hadoop中每个节点map和reduce个数的设置调优