Hadoop 案例2----数据去重问题
2015-09-04 20:36
323 查看
1、原始数据
1)file1:
2)file2:
2.mapper:
3.Reducer:
4.主程序:
1)file1:
2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-7 c 2012-3-3 c
2)file2:
2012-3-1 b 2012-3-2 a 2012-3-3 b 2012-3-4 d 2012-3-5 a 2012-3-6 c 2012-3-7 d 2012-3-3 c
2.mapper:
package cn.edu.bjut.del; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class DelMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString().trim(); if(!"".equals(line)) { context.write(new Text(line), new IntWritable(1)); } } }
3.Reducer:
package cn.edu.bjut.del; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class DelReducer extends Reducer<Text, IntWritable, Text, NullWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> value, Context context) throws IOException, InterruptedException { context.write(key, NullWritable.get()); } }
4.主程序:
package cn.edu.bjut.del; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class MainJob { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "del"); job.setJarByClass(MainJob.class); job.setMapperClass(DelMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(DelReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); Path path = new Path(args[1]); FileSystem fs = FileSystem.get(conf); if(fs.exists(path)) { fs.delete(path, true); } FileOutputFormat.setOutputPath(job, path); job.waitForCompletion(true); } }
相关文章推荐
- squid代理服务简单配置
- Linux系统进程管理工具
- OpenCV学习:HOG+SVM物体分类
- 浅谈阅读LINUX内核源码
- hadoop yarn 实战错误汇总
- OpenCV学习:找出人脸,同时比较两张图片中的人脸相似度
- linux怎样从vi退出到shell
- linux+iptables搭建网关服务器
- shell syntax
- Linux LVM简明教程
- 两个IP实现IIS和Apache公用80端口的设置方法
- Schlumberger.OilField.Manager.V2014.1(OFM)油田日常监控和管理软件包
- Linux系统管理-(9)-yum工具
- 如何查询centos查看系统内核版本,系统版本,32位还是64位
- Hadoop开篇之Mapreduce实现多类别流量统计的两种实现方式
- Calculus on Computational Graphs: Backpropagation
- ENGINX--简单篇
- Linux下推荐应用程序列表【2008-07-31】
- Hadoop 实例1---通过采集的气象数据分析每年的最高温度
- Docker认识基础