您的位置：首页 > 其它

Crunch 学习（一）

2016-07-03 14:20 323 查看

Crunch 提供一种Mapreduce顶层抽象，简化Mapreduce的创建，降低入门成本。Crunch的亮点在于：允许在不使用Mapreduce结构的情况下，使用java对Mapreduce管道进行建模；可以不必像使用Pig和Hive那样在编写UDF时强制使用自带的数据类型，而且Crunch不强迫程序员使用自带的类型系统。

简单例子

public class MaxTemperatureCrunch extends Configured implements Tool,Serializable {

public static void main(String[] args) throws Exception {
int result = ToolRunner.run(new Configuration(), new MaxTemperatureCrunch(), args);
System.exit(result);
}

static DoFn<String,Pair<String,Integer>> toYearTempPairsFn(){

return new DoFn<String, Pair<String, Integer>>() {
NcdcRecordParser parser = new NcdcRecordParser();
@Override
public void process(String input, Emitter<Pair<String, Integer>> emitter) {

parser.parse(input);
if (parser.isValidTemperature()){
emitter.emit(Pair.of(parser.getYear(), parser.getAirTemperature()));
}
}
};
}

@Override
public int run(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(MaxTemperatureCrunch.class,"max temperature");

PCollection<String> records = pipeline.readTextFile("hdfs://hadoop:9000/user/hadoop/temp/");
PTable<String, Integer> yearTemperatures = records
.parallelDo(toYearTempPairsFn(), tableOf(strings(), ints()));

PTable<String, Integer> maxTemps = yearTemperatures
.groupByKey()
.combineValues(Aggregators.MAX_INTS());

maxTemps.write(To.textFile("hdfs://hadoop:9000/user/hadoop/temp-out"));
PipelineResult result = pipeline.done();
return result.succeeded() ? 0 : 1;
}
}

在Crunch中，每个Job都是开始与一个pipeline实例，它管理这数据管道的生命周期。pipeline主要有3种，MRPipeline 在本地或者Hadoop集群上运行mapreduce任务；MemPipeline在内存中运行一个pipeline，主要用于测试；SparkPipeline在本地或者hadoop集群上运行spark job。

PCollection<String> lines = pipeline.readTextFile(inputPath);

Pcollection是Crunch API中的核心数据抽象概念，是一个分布式的，无序的集合。pipeline接口上的readTextFile()方法可以很方便的将一个文本文件转换成一个String泛型的Pcollection对象，当然也可以创建各种Hadoop InputFormat类型Pcollection。Source是一个接口，定义包装 InputFormat的配置，并且将InputFormat格式的数据读入pipeline。

读取数据后，需要对Pcollection每一个行记录进行处理。

PTable<String, Integer> yearTemperatures = records
.parallelDo(toYearTempPairsFn(), tableOf(strings(), ints()));

toYearTempPairsFn()方法返回一个DoFn

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Crunch mapreduce

相关文章推荐

新的分享

章节导航