您的位置：首页 > 产品设计 > UI/UE

hadoop note-(no map&reduce sequencefile)

2014-03-05 15:42 323 查看

no map&reduce

the framework will choose TextInputFormat as the default InputFormat, if no mapper is defined ,then the default mapper will out.write(offset, lineText),

and output from ruduce.

SequenceFile

SequenceFile的操作中有三种处理方式：

1）不压缩数据直接存储。 //enum.NONE

2）压缩value值不压缩key值存储的存储方式。//enum.RECORD

3）key/value值都压缩的方式存储。//enum.BLOCK

SequeceFile是Hadoop
API提供的一种二进制文件支持。这种二进制文件直接将<key, value>对序列化到文件中。一般对小文件可以使用这种文件合并，即将文件名作为key，文件内容作为value序列化到大文件中。这种文件格式有以下好处：

1)支持压缩，且可定制为基于Record或Block压缩（Block级压缩性能较优）

2)本地化任务支持：因为文件可以被切分，因此MapReduce任务时数据的本地化情况应该是非常好的。

3)难度低：因为是Hadoop框架提供的API，业务逻辑侧的修改比较简单。

坏处是需要一个合并文件的过程，且合并后的文件将不方便查看。

参考：http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720101121103928847/

usage:

org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

1. job.setOutputFormatClass(SequenceFileOutputFormat.class);

after set eht outputFormatClass,the reducer will output the <key,value> as the sequenceFile

2. another job ,job.setInputFormatClass(SequenceFileInputFormat.class)

the input of mapper <key,value> is the <key,value> from the last reduce job

SequenceFileDirValueIterable . SequenceFileDirValueIterator . SequenceFileDirIterator . SequenceFileDirIterable

new SequenceFileDirValueIterable<VectorWritable>(input, PathType.LIST,PathFilters.logsCRCFilter(), conf)

With the PathFilters , we can use the output of last job as the next job input and ignore the log
file and other useless files

PathFilter
org.apache.mahout.common.iterator.sequencefile.PathFilters.logsCRCFilter()

Returns:

PathFilter

that rejects paths whose file name starts with "_" (e.g. Cloudera _SUCCESS files or Hadoop _logs), or "." (e.g. local hidden files), or ends with ".crc"

PathFilter org.apache.mahout.common.iterator.sequencefile.PathFilters.partFilter()

Returns:

PathFilter

that accepts paths whose file name starts with "part-". Excludes ".crc" files.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

hadoop note-(no map&amp;reduce sequencefile)

hadoop note-(no map&reduce sequencefile)