您的位置:首页 > 其它

MapReduce程序初探 -------------- WordCount

2017-06-22 14:50 375 查看

程序代码

package test;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
/*
* LongWritable 为输入的key的类型
* Text 为输入value的类型
* Text-IntWritable 为输出key-value键值对的类型
*/
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());  // 将TextInputFormat生成的键值对转换成字符串类型
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
/*
* Text-IntWritable 来自map的输入key-value键值对的类型
* Text-IntWritable 输出key-value 单词-词频键值对
*/
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();  // job的配置
Job job = Job.getInstance(conf, "word count");  // 初始化Job
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));  // 设置输入路径
FileOutputFormat.setOutputPath(job, new Path(args[1]));  // 设置输出路径
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}


创建目录

[root@Vm90 wxl]# ls
output  wordcount01
[root@Vm90 wxl]# mkdir wordcount01
[root@Vm90 wxl]# cd wordcount01
[root@Vm90 wxl]# mkdir src
[root@Vm90 wxl]# mkdir classes
[root@Vm90 wordcount01]# ls
classes  src
[root@Vm90 wordcount01]#
[root@Vm90 wordcount01]# cd src/
[root@Vm90 wordcount01]# vim WordCount.java
#将上述代码粘贴到WordCount.java中
#然后执行编译
[root@Vm90 src]# cd ..
[root@Vm90 wordcount01]# ls
classes  src
#编译需要引用三个jar包

hadoop-common-2.6.0.jar
hadoop-mapreduce-client-core-2.6.0.jar
hadoop-test-1.2.1.jar
#根据本身hadoop版本自行选取jar包,一下为本实验用到的jar包
hadoop-common-2.6.0-cdh5.5.0.jar
hadoop-mapreduce-client-core-2.6.0-cdh5.5.0.jar
hadoop-test-2.6.0-mr1-cdh5.5.0.jar

[root@Vm90 wordcount01]# javac -Xlint:deprecation -classpath /opt/cloudera/parcels/CDH/jars/hadoop-common-2.6.0-cdh5.5.0.jar:/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-client-core-2.6.0-cdh5.5.0.jar:/opt/cloudera/parcels/CDH/jars/hadoop-test-2.6.0-mr1-cdh5.5.0.jar -d classes/ src/*.java
-------输出为(不用在意警告):
/opt/cloudera/parcels/CDH/jars/hadoop-common-2.6.0-cdh5.5.0.jar(org/apache/hadoop/fs/Path.class): 警告: 无法找到类型 'LimitedPrivate' 的注释方法 'value()': 找不到org.apache.hadoop.classification.InterfaceAudience的类文件
1 个警告

#打jar包
[root@Vm90 wordcount01]# jar -cvf wordcount.jar classes/*
已添加清单
正在添加: classes/test/(输入 = 0) (输出 = 0)(存储了 0%)
正在添加: classes/test/WordCount.class(输入 = 1516) (输出 = 815)(压缩了 46%)
正在添加: classes/test/WordCount$TokenizerMapper.class(输入 = 1746) (输出 = 758)(压缩了 56%)
正在添加: classes/test/WordCount$IntSumReducer.class(输入 = 1749) (输出 = 742)(压缩了 57%)
[root@Vm90 wordcount01]# ls
classes  src  wordcount.jar
[root@Vm90 wordcount01]#

#上次测试数据
[root@Vm90 input]# cat 2.txt
hello hadoop
bye hadoop
good java
great pytho
[root@Vm90 wxl]# ls
input  output  wordcount01
[root@Vm90 wxl]# hadoop fs -put input /hbase/


此时运行程序会报错

[root@Vm90 wxl]# hadoop jar wordcount01/wordcount.jar test.WordCount /hbase/input/ /hbase/output11
17/06/22 14:29:48 INFO client.RMProxy: Connecting to ResourceManager at Vm90/172.16.2.90:8032
17/06/22 14:29:49 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/06/22 14:29:49 WARN mapreduce.JobResourceUploader: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
17/06/22 14:29:49 INFO input.FileInputFormat: Total input paths to process : 1
17/06/22 14:29:49 INFO mapreduce.JobSubmitter: number of splits:1
17/06/22 14:29:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1497340925516_0008
17/06/22 14:29:50 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
17/06/22 14:29:50 INFO impl.YarnClientImpl: Submitted application application_1497340925516_0008
17/06/22 14:29:50 INFO mapreduce.Job: The url to track the job: http://Vm90:8088/proxy/application_1497340925516_0008/ 17/06/22 14:29:50 INFO mapreduce.Job: Running job: job_1497340925516_0008
17/06/22 14:29:58 INFO mapreduce.Job: Job job_1497340925516_0008 running in uber mode : false
17/06/22 14:29:58 INFO mapreduce.Job:  map 0% reduce 0%
17/06/22 14:30:03 INFO mapreduce.Job: Task Id : attempt_1497340925516_0008_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class test.WordCount$TokenizerMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2199)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class test.WordCount$TokenizerMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
... 8 more


原因是:



路径问题:报的是 test.WordCount$TokenizerMapper not found

而打的jar包时 classes/test/

重新打包

[root@Vm90 wordcount01]# cd classes/
[root@Vm90 classes]# ls
test
[root@Vm90 classes]# jar -cvf ../wordcount.jar test
已添加清单
正在添加: test/(输入 = 0) (输出 = 0)(存储了 0%)
正在添加: test/WordCount.class(输入 = 1516) (输出 = 815)(压缩了 46%)
正在添加: test/WordCount$TokenizerMapper.class(输入 = 1746) (输出 = 758)(压缩了 56%)
正在添加: test/WordCount$IntSumReducer.class(输入 = 1749) (输出 = 742)(压缩了 57%)
[root@Vm90 classes]# ls
test
[root@Vm90 classes]# cd ..
[root@Vm90 wordcount01]# ls
classes  src  wordcount.jar
[root@Vm90 wordcount01]#
#再次运行
[root@Vm90 wxl]# hadoop jar wordcount01/wordcount.jar test.WordCount /hbase/input/ /hbase/output12
17/06/22 14:38:15 INFO client.RMProxy: Connecting to ResourceManager at Vm90/172.16.2.90:8032
17/06/22 14:38:16 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/06/22 14:38:16 INFO input.FileInputFormat: Total input paths to process : 1
17/06/22 14:38:16 INFO mapreduce.JobSubmitter: number of splits:1
17/06/22 14:38:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1497340925516_0009
17/06/22 14:38:17 INFO impl.YarnClientImpl: Submitted application application_1497340925516_0009
17/06/22 14:38:17 INFO mapreduce.Job: The url to track the job: http://Vm90:8088/proxy/application_1497340925516_0009/ 17/06/22 14:38:17 INFO mapreduce.Job: Running job: job_1497340925516_0009
17/06/22 14:38:24 INFO mapreduce.Job: Job job_1497340925516_0009 running in uber mode : false
17/06/22 14:38:24 INFO mapreduce.Job:  map 0% reduce 0%
17/06/22 14:38:31 INFO mapreduce.Job:  map 100% reduce 0%
17/06/22 14:38:38 INFO mapreduce.Job:  map 100% reduce 50%
17/06/22 14:38:39 INFO mapreduce.Job:  map 100% reduce 100%
17/06/22 14:38:40 INFO mapreduce.Job: Job job_1497340925516_0009 completed successfully
17/06/22 14:38:40 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=120
FILE: Number of bytes written=344705
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=146
HDFS: Number of bytes written=54
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4784
Total time spent by all reduces in occupied slots (ms)=11119
Total time spent by all map tasks (ms)=4784
Total time spent by all reduce tasks (ms)=11119
Total vcore-seconds taken by all map tasks=4784
Total vcore-seconds taken by all reduce tasks=11119
Total megabyte-seconds taken by all map tasks=4898816
Total megabyte-seconds taken by all reduce tasks=11385856
Map-Reduce Framework
Map input records=4
Map output records=8
Map output bytes=79
Map output materialized bytes=112
Input split bytes=99
Combine input records=8
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=112
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=134
CPU time spent (ms)=3220
Physical memory (bytes) snapshot=874729472
Virtual memory (bytes) snapshot=4710256640
Total committed heap usage (bytes)=860356608
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=47
File Output Format Counters
Bytes Written=54
[root@Vm90 wxl]#




其他参考:

第一个MapReduce程序——WordCount

MapReduce程序-map和reduce的过程
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  mapreduce WordCount