您的位置：首页 > 运维架构

验证hadoop伪分布式

2015-07-08 09:02 357 查看

启动hadoop，调用jps命令，会看到总共有6个进程在运行。首先介绍下 hadoop进程的作用及地位意义

1）ResourceManager YARN(Yet Another Resource Negotiate)的老大

2)SecondaryNameNode NameNode的助理，

3)NameNode HDFS的老大，“仓库管理员”

4)DataNode HDFS的小弟，“具体的仓库”

5)Jps

6)NodeManager YARN(Yet Another Resource Negotiate)的小弟

这些进程在一台机器上并不好，会互相争抢资源，最后是分布在不同的机器上。

为了验证HDFS好不好用，试试上传一个文件。操作hadoop的命令在bin目录下，sbin目录下是hadoop的启动停止命令。

如下命令，转到hadoop的目录文件下

[root@itcast01 ~]# cd /itcast/hadoop-2.4.1/bin/
[root@itcast01 bin]# ls
container-executor  hdfs      mapred.cmd               yarn
hadoop              hdfs.cmd  rcc                      yarn.cmd
hadoop.cmd          mapred    test-container-executor

可以看到里面有很多脚本，老的程序员习惯用 hadoop ，现在我们可以用 hdfs 、 yarn 等命令

不知道怎么样，可以用帮助，例如想要查看hadoop怎么用如下

[root@itcast01 bin]# hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
fs                   run a generic filesystem user client
version              print the version
jar <jar>            run a jar file
checknative [-a|-h]  check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath            prints the class path needed to get the
Hadoop jar and the required libraries
daemonlog            get/set the log level for each daemon
or
CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

在这里，可以调用 hadoop version 命令，查看hadoop的版本信息，可以用hadoop fs

查看本机HDFS系统含有的文件，发现在上传文件之前，一个文件夹都没有。

[root@itcast01 bin]# hadoop fs -ls hdfs://itcast01:9000/
[root@itcast01 bin]#

上传本地的/root/install.log文件到HDFS并且改名字为log.txt，然后查看 hdfs://itcast01:9000/下的文件，会看到这个log.txt文件,这就将一个文件从本地文件系统上传到HDFS上了

[root@itcast01 bin]# hadoop fs -put /root/install.log hdfs://itcast01:9000/log.txt
[root@itcast01 bin]# hadoop fs -ls hdfs://itcast01:9000/
Found 1 items
-rw-r--r--   1 root supergroup      49448 2015-07-08 13:43 hdfs://itcast01:9000/log.txt
[root@itcast01 bin]#

还可以通过hdfs的文件管理界面来查看，在浏览器输入 192.168.8.118:50070 会看到 hdfs的管理界面，然后，在utility-------browser the file system 里面看到里面有一个叫log.txt的文件。主要的参数包括

permission：权限

owner：拥有着，属于哪个用户

group:组

Size:文件大小

Replication:副本这里副本的数量实在hdfs-set.xml里面设置的副本数量为1，因为是伪分布式，现在只有一台机器，因此只保存了一个副本。

我们尝试现在这个log.txt 点开log.txt -------download后，发现先网址跳转到http://itcast01:50075/webhdfs/v1/log.txt?op=OPEN&namenoderpcaddress=itcast01:9000&offset=0，无法打开该页，发现跳转后的网址访问的是itcast而不是IP地址了，所以需要我们配置windows下的etc文件。地址是C:\Windows\System32\drivers\etc\hosts文件在最后一行追加
“192.168.8.118 itcast01” 注意修改完host文件后，文件类型不要改变。改完以后，点击download就能够下载文件了。

将文件从HDFS下载到本机

[root@itcast01 bin]# hadoop fs -get hdfs://itcast01:9000/log.txt /home/123.txt         下载HDFS上的文件到
[root@itcast01 bin]# cd /home
[root@itcast01 home]# ls
123.txt  lost+found  wec                                                                有了123.txt说明已经下载完成，可以通过 more 123.txt命令查看具体内容
[root@itcast01 home]#

至此为止，HDFS文件的上传、查看、下载均没有问题了，说明HDFS是可以使用的，下面验证YARN能否使用

通过管理界面来验证YARN ，管理界面为192.168.8.118:8088

在管理界面，会看到活跃节点 Active Nodes 有1活跃的节点，这个节点代表子节点，也就是NodeManager。YARN的小弟叫做NodeManager。YARN的老大叫做ResourceManager

验证MapReduce的统计功能

Linux自带的wc命令可以完成统计贴代码如下

首选是新建一个word.txt文件

[root@itcast01 ~]# vim word.txt
hello tom
hello jerry
hello tom
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
"word.txt" 3L, 32C written

统计word.txt里面有多少个单词，看代码

[root@itcast01 ~]# wc word.txt
3  6 32 word.txt                                              3行  6个单词  32个字符

hadoop可以完成词频的统计功能，也就是统计相同词频出现的次数

下面用mapreduce统计这些单词，因为MapReduce设计之初是打算用来做海量统计的，而海量数据应该存在HDFS上面，因此应该将word.txt上传到HDFS上。代码如下

[root@itcast01 ~]# hadoop fs -put /root/word.txt hdfs://itcast01:9000/word.avi              linux下文件没有格式，因此.avi .txt其实都是文本文件u
[root@itcast01 ~]#

仅仅是应mapReduce统计词频，发现很慢，主要是因为需要启动，需要读取文件，操作对象是海量数据，当有海量数据的时候，MR的优势就会显现出来。

spark是内存计算，可以运行在yarn上，发现yarn很牛X，

利用MR统计次词频，命令原码以及结果如下图

[root@itcast01 ~]# wc word.txt                    linux自带的统计文本词语的命令
3  6 32 word.txt
[root@itcast01 ~]# hadoop fs -put /root/word.txt hdfs://itcast01:9000/word.avi        上传文件到HHDFS
[root@itcast01 ~]# cd /itcast/hadoop-2.4.1/share/hadoop/                              转到HHadoop文件夹
[root@itcast01 hadoop]# ls
common  hdfs  httpfs  mapreduce  tools  yarn
[root@itcast01 hadoop]# cd mapreduce/                                                 转到MR文件夹
[root@itcast01 mapreduce]# ls                                                         发现有很多JAR包
hadoop-mapreduce-client-app-2.4.1.jar
hadoop-mapreduce-client-common-2.4.1.jar
hadoop-mapreduce-client-core-2.4.1.jar
hadoop-mapreduce-client-hs-2.4.1.jar
hadoop-mapreduce-client-hs-plugins-2.4.1.jar
hadoop-mapreduce-client-jobclient-2.4.1.jar
hadoop-mapreduce-client-jobclient-2.4.1-tests.jar
hadoop-mapreduce-client-shuffle-2.4.1.jar
hadoop-mapreduce-examples-2.4.1.jar
lib
lib-examples
sources
[root@itcast01 mapreduce]# hadoop                                                     看看hadoop文件下有什么命令，输入hadoop回车，会有命令提示的
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
fs                   run a generic filesystem user client
version              print the version
jar <jar>            run a jar file
checknative [-a|-h]  check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath            prints the class path needed to get the
Hadoop jar and the required libraries
daemonlog            get/set the log level for each daemon
or
CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.
[root@itcast01 mapreduce]# hadoop jar                                                 看需要什么参数
RunJar jarFile [mainClass] args...
[root@itcast01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.4.1.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
[root@itcast01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount
Usage: wordcount <in> <out>
[root@itcast01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount hdfs://itcast01:9000/word.avi hdfs://itcast01:9000/out  <span style="color:#ff0000;">此时立刻克隆一个连接，打开jps，后面有对此的介绍</span>
15/07/08 21:12:25 INFO client.RMProxy: Connecting to ResourceManager at itcast01/192.168.8.118:8032
15/07/08 21:12:48 INFO input.FileInputFormat: Total input paths to process : 1
15/07/08 21:12:56 INFO mapreduce.JobSubmitter: number of splits:1
15/07/08 21:13:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1436259707754_0001
15/07/08 21:13:23 INFO impl.YarnClientImpl: Submitted application application_1436259707754_0001
15/07/08 21:13:23 INFO mapreduce.Job: The url to track the job: http://itcast01:8088/proxy/application_1436259707754_0001/ 15/07/08 21:13:23 INFO mapreduce.Job: Running job: job_1436259707754_0001
15/07/08 21:14:48 INFO mapreduce.Job: Job job_1436259707754_0001 running in uber mode : false
15/07/08 21:14:48 INFO mapreduce.Job:  map 0% reduce 0%
15/07/08 21:16:06 INFO mapreduce.Job:  map 100% reduce 0%
15/07/08 21:16:56 INFO mapreduce.Job:  map 100% reduce 100%
15/07/08 21:16:59 INFO mapreduce.Job: Job job_1436259707754_0001 completed successfully
15/07/08 21:17:00 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=40
FILE: Number of bytes written=185833
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=126
HDFS: Number of bytes written=22
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=76880
Total time spent by all reduces in occupied slots (ms)=41905
Total time spent by all map tasks (ms)=76880
Total time spent by all reduce tasks (ms)=41905
Total vcore-seconds taken by all map tasks=76880
Total vcore-seconds taken by all reduce tasks=41905
Total megabyte-seconds taken by all map tasks=78725120
Total megabyte-seconds taken by all reduce tasks=42910720
Map-Reduce Framework
Map input records=3
Map output records=6
Map output bytes=56
Map output materialized bytes=40
Input split bytes=94
Combine input records=6
Combine output records=3
Reduce input groups=3
Reduce shuffle bytes=40
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=539
CPU time spent (ms)=5880
Physical memory (bytes) snapshot=320163840
Virtual memory (bytes) snapshot=1685929984
Total committed heap usage (bytes)=136122368
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=32
File Output Format Counters
Bytes Written=22
[root@itcast01 mapreduce]#

至此，结果统计完毕

下面是查看结果文件

[root@itcast01 mapreduce]# hadoop fs -ls hdfs://itcast01:9000/                            转到HDFS文件上，看内容
Found 4 items
-rw-r--r--   1 root supergroup      49448 2015-07-08 13:43 hdfs://itcast01:9000/log.txt
drwxr-xr-x   - root supergroup          0 2015-07-08 21:16 hdfs://itcast01:9000/out
drwx------   - root supergroup          0 2015-07-08 21:12 hdfs://itcast01:9000/tmp
-rw-r--r--   1 root supergroup         32 2015-07-08 19:57 hdfs://itcast01:9000/word.avi
[root@itcast01 mapreduce]# hadoop fs -ls hdfs://itcast01:9000/out                         前面设置的输出到out文件夹
Found 2 items
-rw-r--r--   1 root supergroup          0 2015-07-08 21:16 hdfs://itcast01:9000/out/_SUCCESS
-rw-r--r--   1 root supergroup         22 2015-07-08 21:16 hdfs://itcast01:9000/out/part-r-00000
[root@itcast01 mapreduce]# hadoop fs -cat hdfs://itcast01:9000/out/part-r-00000           out文件夹下的part-r-00000是结果文件
hello   3
jerry   1
tom     2
[root@itcast01 mapreduce]#

发现统计结果是对的 hello 3个，jerry 1个，tom2个

发现统计的时候很慢，因为用的是伪分布式，从始至终就还是一台机器在运行，还是一个人干活，因此伪分布式还是不能利用更多的资源，所以还是慢的

在上面运行hadoop的wordcount的时候，我们打开了一个链接，来看看她的JPS进程。

jps的意思就是查看java进程运行状态。

5423 NameNode
5832 ResourceManager
5515 DataNode
14498 RunJar        使用hadoop jar命令，启动一个java程序
5927 NodeManager
14604 Jps
5683 SecondaryNameNode

资源的分配交给 resource Manager 进程

任务的监控交给 MRAppMaster 进程

YarnChild进程我没有找到呢。。。里面运行着map对象会在reduce对象

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航