MapReduce之去除标点的WordCount
2013-07-20 18:01
344 查看
这是一个wordcount的简单改进版,因为wordcount不处理标点,输出的单词里面含有很多的标点符号。
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private String pattern = "[^a-zA-Z0-9-']"; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString();// line = line.replaceAll(pattern, " ");// StringTokenizer itr = new StringTokenizer(line);// while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); //String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); //conf.set("Hadoop.job.ugi", "sunguoli,cs402"); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path("input")); FileOutputFormat.setOutputPath(job, new Path("output")); //FileInputFormat.addInputPath(job, new Path(otherArgs[0])); //FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
相关文章推荐
- MapReduce-WordCount实现按照value降序排序、字符小写、识别不同标点
- wordCount程序中MapReduce工作过程分析
- mapreduce-从wordcount开始
- 使用python实现MapReduce的wordcount实例
- 从WordCount看MapReduce框架执行流程
- CDH5.0.0使用hue中的oozie编辑器创建一个wordcount的mapreduce job
- 第一个MapReduce程序----wordcount(编写并运行)
- hadoop基础----hadoop实战(五)-----myeclipse开发MapReduce---WordCount例子---解析MapReduce的写法
- hadoop源代码分析(一)从wordCount开始,剖析mapreduce的运行机制
- Hadoop之MapReduce WordCount详细分析
- spark版WordCount(Java),将输出结果排序,并去除输出文件中的括号。
- MapReduce算法一、简单求和计数(类似WordCount)
- Hadoop MapReduce 之wordcount
- Hadoop 第七课 从wordCount 看MapReduce模型
- MapReduce编程模型及实现WordCount
- 【学习笔记】用Hadoop在MapReduce中WordCount简单程序运行详细流程
- MapReduce 的简单例子 WordCount的实现
- 初学Hadoop之图解MapReduce与WordCount示例分析
- 用mapreduce计算wordCount和手机流量统计程序运行过程
- WordCount MapReduce调试