您的位置:首页 > 其它

MapReduce实战之WordCount

2016-10-19 17:03 489 查看

打开eclipse,新建一个WordCount的java project工程,写WordMapper类继承于Mapper抽象类,覆写map函数,写WordReducer类继承于Reducer,覆写reduce函数,最后写一个场景调用类,调用WordMapper和Reducer类



WordMapper类



WordReduce类



WordMain类





接下来就是导出jar包文件的步骤











Linux桌面上就会出现wordcount.jar文件



准备好要测试的文件file1.txt,file2.txt,两个文件里面的内容是一些单词



启动hadoop集群,命令start-all.sh,创建文件输入路径:hadoop fs -mkdir /user/gznc/input将本地上的file1.txt和file2.txt文件上传到集群的输入文件中,有两种方法可以上传文件,第一种方法是命令:hdfs dfs /home/gznc/file1.txt /user/gznc/input,第一个是本地路径,第二个是集群路径。第二个文件类似。第二种方法是用eclipse写方法上传。注意路径可以变化,不一定要和我的一样



查看文件file1.txt和file2.txt是否已经上传到集群,命令:hadoop fs -ls /user/gznc/input



最后一步,运算,输入命令:hadoop jar /home/gznc/Desktop/wordcount.jar wordcount.WordMain /user/gznc/input /user/gznc/output 格式:hadoop+jar+jar包的路径,因为我是放在本地的桌面上+工程下的包名.main函数所在的那个类+集群上所存放文件的路径+结果输出的路径

成功后的结果:

16/10/19 16:30:11 INFO input.FileInputFormat: Total input paths to process : 2

16/10/19 16:30:11 INFO mapreduce.JobSubmitter: number of splits:2

16/10/19 16:30:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1476858901736_0002

16/10/19 16:30:13 INFO impl.YarnClientImpl: Submitted application application_1476858901736_0002

16/10/19 16:30:13 INFO mapreduce.Job: The url to track the job: http://master:18088/proxy/application_1476858901736_0002/

16/10/19 16:30:13 INFO mapreduce.Job: Running job: job_1476858901736_0002

16/10/19 16:30:30 INFO mapreduce.Job: Job job_1476858901736_0002 running in uber mode : false

16/10/19 16:30:30 INFO mapreduce.Job: map 0% reduce 0%

16/10/19 16:30:53 INFO mapreduce.Job: map 100% reduce 0%

16/10/19 16:31:15 INFO mapreduce.Job: map 100% reduce 100%

16/10/19 16:31:16 INFO mapreduce.Job: Job job_1476858901736_0002 completed successfully

16/10/19 16:31:16 INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read=126

FILE: Number of bytes written=290714

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=304

HDFS: Number of bytes written=40

HDFS: Number of read operations=9

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=2

Total time spent by all maps in occupied slots (ms)=43849

Total time spent by all reduces in occupied slots (ms)=18909

Total time spent by all map tasks (ms)=43849

Total time spent by all reduce tasks (ms)=18909

Total vcore-seconds taken by all map tasks=43849

Total vcore-seconds taken by all reduce tasks=18909

Total megabyte-seconds taken by all map tasks=44901376

Total megabyte-seconds taken by all reduce tasks=19362816

Map-Reduce Framework

Map input records=4

Map output records=16

Map output bytes=150

Map output materialized bytes=132

Input split bytes=218

Combine input records=16

Combine output records=10

Reduce input groups=5

Reduce shuffle bytes=132

Reduce input records=10

Reduce output records=5

Spilled Records=20

Shuffled Maps =2

Failed Shuffles=0

Merged Map outputs=2

GC time elapsed (ms)=480

CPU time spent (ms)=3950

Physical memory (bytes) snapshot=510222336

Virtual memory (bytes) snapshot=2516795392

Total committed heap usage (bytes)=256647168

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=86

File Output Format Counters

Bytes Written=40

如果结果路径如:/user/gznc/output在集群上已经存在了,就会报如下错误,所以要保证结果输出的路径不一样

Exception in thread “main” org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master:9000/user/gznc/output already exists

at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)

at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:458)

at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)

at org.apache.hadoop.mapreduce.Job10.run(Job.java:1285)atorg.apache.hadoop.mapreduce.Job10.run(Job.java:1282)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)

at wordcount.WordMain.main(WordMain.java:34)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

现在我们去查看集群所统计单词频数的结果,命令:hadoop fs -ls /user/gznc/output

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: