您的位置:首页 > 产品设计 > UI/UE

hadoop note-(no map&reduce sequencefile)

2014-03-05 15:42 323 查看
no map&reduce

the framework will choose TextInputFormat as the default InputFormat, if no mapper is defined ,then the default mapper will out.write(offset, lineText),

and output from ruduce.

SequenceFile

SequenceFile的操作中有三种处理方式:

1) 不压缩数据直接存储。 //enum.NONE

2) 压缩value值不压缩key值存储的存储方式。//enum.RECORD

3)key/value值都压缩的方式存储。//enum.BLOCK

SequeceFile是Hadoop
API提供的一种二进制文件支持。这种二进制文件直接将<key, value>对序列化到文件中。一般对小文件可以使用这种文件合并,即将文件名作为key,文件内容作为value序列化到大文件中。这种文件格式有以下好处:

1)支持压缩,且可定制为基于Record或Block压缩(Block级压缩性能较优)

2)本地化任务支持:因为文件可以被切分,因此MapReduce任务时数据的本地化情况应该是非常好的。

3)难度低:因为是Hadoop框架提供的API,业务逻辑侧的修改比较简单。

坏处是需要一个合并文件的过程,且合并后的文件将不方便查看。

参考:http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720101121103928847/

usage:

org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

1. job.setOutputFormatClass(SequenceFileOutputFormat.class);

after set eht outputFormatClass,the reducer will output the <key,value> as the sequenceFile

2. another job ,job.setInputFormatClass(SequenceFileInputFormat.class)

the input of mapper <key,value> is the <key,value> from the last reduce job

SequenceFileDirValueIterable . SequenceFileDirValueIterator . SequenceFileDirIterator . SequenceFileDirIterable

new SequenceFileDirValueIterable<VectorWritable>(input, PathType.LIST,PathFilters.logsCRCFilter(), conf)

With the PathFilters , we can use the output of last job as the next job input and ignore the log
file and other useless files




PathFilter
org.apache.mahout.common.iterator.sequencefile.PathFilters.logsCRCFilter()

Returns:
PathFilter
that rejects paths whose file name starts with "_" (e.g. Cloudera _SUCCESS files or Hadoop _logs), or "." (e.g. local hidden files), or ends with ".crc"




PathFilter org.apache.mahout.common.iterator.sequencefile.PathFilters.partFilter()

Returns:
PathFilter
that accepts paths whose file name starts with "part-". Excludes ".crc" files.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: