您的位置:首页 > Web前端 > Node.js

模拟namenode崩溃,使用secondarynamenode恢复

2015-06-12 00:00 686 查看
摘要: 模拟namenode崩溃,使用secondarynamenode恢复

方法一、使用namespaceID
1、在namenode节点上,将dfs.name.dir指定的目录中(这里是name目录)的内容情况,以此来模拟故障发生。
1 [hadoop@node1 name]$ ls2 current  image  in_use.lock3 [hadoop@node1 name]$ rm -rf *

2、将集群关闭后,再重启我们看到namenode守护进程消失。



1 [hadoop@node1 name]$ stop-all.sh
2 stopping jobtracker 3 192.168.1.152: stopping tasktracker 4 192.168.1.153: stopping tasktracker 5 stopping namenode 6 192.168.1.152: stopping datanode 7 192.168.1.153: stopping datanode 8 192.168.1.152: stopping secondarynamenode 9 [hadoop@node1 name]$ start-all.sh
10 starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out11 192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out12 192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out13 192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out14 starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out15 192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out16 192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out17 [hadoop@node1 name]$ jps18 31942 Jps19 31872 JobTracker




而且namenode的日志中有报错:



1 2013-11-14 06:19:59,172 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: 2 /************************************************************ 3 STARTUP_MSG: Starting NameNode 4 STARTUP_MSG:   host = node1/192.168.1.151 5 STARTUP_MSG:   args = [] 6 STARTUP_MSG:   version = 0.20.2 7 STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 8 ************************************************************/ 9 2013-11-14 06:19:59,395 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=900010 2013-11-14 06:19:59,400 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: node1.com/192.168.1.151:900011 2013-11-14 06:19:59,403 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null12 2013-11-14 06:19:59,407 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext13 2013-11-14 06:19:59,557 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop14 2013-11-14 06:19:59,558 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup15 2013-11-14 06:19:59,558 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true16 2013-11-14 06:19:59,568 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext17 2013-11-14 06:19:59,569 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean18 2013-11-14 06:19:59,654 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.19 java.io.IOException: NameNode is not formatted.20         at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:317)21         at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)22         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)23         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)24         at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)25         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)26         at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)27         at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)28 2013-11-14 06:19:59,658 INFO org.apache.hadoop.ipc.Server: Stopping server on 900029 2013-11-14 06:19:59,663 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted.30         at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:317)31         at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)32         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)33         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)34         at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)35         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)36         at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)37         at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)38 39 2013-11-14 06:19:59,664 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:40 /************************************************************41 SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.15142 ************************************************************/




3、查看HDFS的文件失败:
1 [hadoop@node1 name]$ hadoop dfs -ls /user/hive/warehouse2 13/11/14 06:21:06 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 0 time(s).3 13/11/14 06:21:07 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 1 time(s).4 13/11/14 06:21:08 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 2 time(s).5 13/11/14 06:21:09 INFO ipc.Client: Retrying connect to server: node1/192.168.1.151:9000. Already tried 3 time(s).

4、关闭集群,格式化namenode:



1 [hadoop@node1 name]$ stop-all.sh
2 stopping jobtracker 3 192.168.1.152: stopping tasktracker 4 192.168.1.153: stopping tasktracker 5 no namenode to stop 6 192.168.1.152: stopping datanode 7 192.168.1.153: stopping datanode 8 192.168.1.152: stopping secondarynamenode 9 [hadoop@node1 name]$ hadoop namenode -format10 13/11/14 06:21:37 INFO namenode.NameNode: STARTUP_MSG:
11 /************************************************************12 STARTUP_MSG: Starting NameNode13 STARTUP_MSG:   host = node1/192.168.1.15114 STARTUP_MSG:   args = [-format]15 STARTUP_MSG:   version = 0.20.216 STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 201017 ************************************************************/18 Re-format filesystem in /app/user/hdfs/name ? (Y or N) Y19 13/11/14 06:21:39 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop20 13/11/14 06:21:39 INFO namenode.FSNamesystem: supergroup=supergroup21 13/11/14 06:21:39 INFO namenode.FSNamesystem: isPermissionEnabled=true22 13/11/14 06:21:39 INFO common.Storage: Image file of size 96 saved in 0 seconds.23 13/11/14 06:21:39 INFO common.Storage: Storage directory /app/user/hdfs/name has been successfully formatted.24 13/11/14 06:21:39 INFO namenode.NameNode: SHUTDOWN_MSG:
25 /************************************************************26 SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.15127 ************************************************************/




5、从任意datanode中获取namenode格式化之前namespaceID并修改namenode的namespaceID跟datanode一致:



#Thu Nov  :: CST  namespaceID storageIDDS. cTime storageType layoutVersion  apphdfsdata ----修改namenode的namespaceID----
#Thu Nov  :: CST  namespaceID cTime storageType layoutVersion




6、删除新的namenode的fsimage文件:



1 [hadoop@node1 current]$ ll2 total 163 -rw-rw-r-- 1 hadoop hadoop   4 Nov 14 06:21 edits4 -rw-rw-r-- 1 hadoop hadoop  96 Nov 14 06:21 fsimage5 -rw-rw-r-- 1 hadoop hadoop   8 Nov 14 06:21 fstime6 -rw-rw-r-- 1 hadoop hadoop 101 Nov 14 06:22 VERSION7 [hadoop@node1 current]$ rm fsimage




7、从Secondarynamenode拷贝fsimage到Namenode的current目录下:



[hadoop@node2 current]$ ll
total 16-rw-rw-r-- 1 hadoop hadoop    4 Nov 14 05:38 edits-rw-rw-r-- 1 hadoop hadoop 2410 Nov 14 05:38 fsimage-rw-rw-r-- 1 hadoop hadoop    8 Nov 14 05:38 fstime-rw-rw-r-- 1 hadoop hadoop  101 Nov 14 05:38 VERSION[hadoop@node2 current]$ scp fsimage node1:/app/user/hdfs/name/currentThe authenticity of host 'node1 (192.168.1.151)' can't be established.
RSA key fingerprint is ca:9a:7e:19:ee:a1:35:44:7e:9d:d4:09:5c:fc:c5:0a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'node1,192.168.1.151' (RSA) to the list of known hosts.
fsimage                                                                                                                                                      100% 2410     2.4KB/s   00:00




8、重启集群:



[hadoop@node1 current]$ start-all.sh
starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out
starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out[hadoop@node1 current]$ jps32486 Jps32419 JobTracker32271 NameNode




9、验证数据的完整性:



1 [hadoop@node1 current]$ hadoop dfs -ls /user/hive/warehouse 2 Found 8 items 3 drwxr-xr-x   - hadoop supergroup          0 2013-10-17 16:18 /user/hive/warehouse/echo 4 drwxr-xr-x   - hadoop supergroup          0 2013-10-28 13:48 /user/hive/warehouse/jack 5 drwxr-xr-x   - hadoop supergroup          0 2013-09-18 15:54 /user/hive/warehouse/table4 6 drwxr-xr-x   - hadoop supergroup          0 2013-09-18 15:53 /user/hive/warehouse/table5 7 drwxr-xr-x   - hadoop supergroup          0 2013-09-18 15:48 /user/hive/warehouse/test 8 drwxr-xr-x   - hadoop supergroup          0 2013-10-25 14:50 /user/hive/warehouse/test1 9 drwxr-xr-x   - hadoop supergroup          0 2013-10-25 14:52 /user/hive/warehouse/test210 drwxr-xr-x   - hadoop supergroup          0 2013-10-25 14:30 /user/hive/warehouse/test311 12 [hadoop@node3 conf]$ hive13 14 Logging initialized using configuration in jar:file:/app/hive/lib/hive-common-0.11.0.jar!/hive-log4j.properties15 Hive history file=/tmp/hadoop/hive_job_log_hadoop_7451@node3_201311111325_424288589.txt16 hive> show tables;17 OK18 echo19 jack20 table421 table522 test23 test124 test225 test326 Time taken: 27.589 seconds, Fetched: 8 row(s)27 hive> select * from table4;28 OK29 NULL    NULL    NULL30 1    1    531 2    4    532 3    4    533 4    5    634 5    6    735 6    1    536 7    5    637 8    3    638 NULL    NULL    NULL39 Time taken: 2.124 seconds, Fetched: 10 row(s)




之前里面的数据没有丢失。
方法二:使用hadoop namenode -importCheckpoint
1、删除name目录:
1 [hadoop@node1 hdfs]$ rm -rf name

2、关闭集群,从secondarynamenode拷贝namesecondary目录到dfs.name.dir:



[hadoop@node2 hdfs]$ scp -r namesecondary node1:/app/user/hdfs/fsimage                                                                                                                                                      100%  157     0.2KB/s   00:00    fstime                                                                                                                                                       100%    8     0.0KB/s   00:00    fsimage                                                                                                                                                      100% 2410     2.4KB/s   00:00    VERSION                                                                                                                                                      100%  101     0.1KB/s   00:00    edits                                                                                                                                                        100%    4     0.0KB/s   00:00    fstime                                                                                                                                                       100%    8     0.0KB/s   00:00    fsimage                                                                                                                                                      100% 2410     2.4KB/s   00:00    VERSION                                                                                                                                                      100%  101     0.1KB/s   00:00    edits                                                                                                                                                        100%    4     0.0KB/s   00:00




3、在namenode节点上执行hadoop namenode -importCheckpoint



[hadoop@node1 hdfs]$ hadoop namenode -importCheckpoint13/11/14 07:24:20 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = node1/192.168.1.151
STARTUP_MSG:   args = [-importCheckpoint]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/13/11/14 07:24:20 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=900013/11/14 07:24:20 INFO namenode.NameNode: Namenode up at: node1.com/192.168.1.151:900013/11/14 07:24:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null13/11/14 07:24:20 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext13/11/14 07:24:21 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop13/11/14 07:24:21 INFO namenode.FSNamesystem: supergroup=supergroup13/11/14 07:24:21 INFO namenode.FSNamesystem: isPermissionEnabled=true13/11/14 07:24:21 INFO metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext13/11/14 07:24:21 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean13/11/14 07:24:21 INFO common.Storage: Storage directory /app/user/hdfs/name is not formatted.13/11/14 07:24:21 INFO common.Storage: Formatting ...13/11/14 07:24:21 INFO common.Storage: Number of files = 2613/11/14 07:24:21 INFO common.Storage: Number of files under construction = 013/11/14 07:24:21 INFO common.Storage: Image file of size 2410 loaded in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Edits file /app/user/hdfs/namesecondary/current/edits of size 4 edits # 0 loaded in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds.13/11/14 07:24:21 INFO namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 13/11/14 07:24:21 INFO namenode.FSNamesystem: Finished loading FSImage in 252 msecs13/11/14 07:24:21 INFO hdfs.StateChange: STATE* Safe mode ON.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.13/11/14 07:24:21 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog13/11/14 07:24:21 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 5007013/11/14 07:24:21 INFO http.HttpServer: listener.getLocalPort() returned 50070 webServer.getConnectors()[0].getLocalPort() returned 5007013/11/14 07:24:21 INFO http.HttpServer: Jetty bound to port 5007013/11/14 07:24:21 INFO mortbay.log: jetty-6.1.1413/11/14 07:24:21 INFO mortbay.log: Started SelectChannelConnector@node1.com:5007013/11/14 07:24:21 INFO namenode.NameNode: Web-server up at: node1.com:5007013/11/14 07:24:21 INFO ipc.Server: IPC Server Responder: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server listener on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 0 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 1 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 2 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 3 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 4 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 5 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 6 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 9 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 7 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 8 on 9000: starting13/11/14 07:37:05 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.151
************************************************************/[hadoop@node1 current]$ start-all.sh
starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out
starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out[hadoop@node1 current]$ jps1027 JobTracker1121 Jps879 NameNode




4、验证数据的完整性:



1 [hadoop@node3 conf]$ hive 2  3 Logging initialized using configuration in jar:file:/app/hive/lib/hive-common-0.11.0.jar!/hive-log4j.properties 4 Hive history file=/tmp/hadoop/hive_job_log_hadoop_8383@node3_201311111443_2018635710.txt 5 hive> select * from table4; 6 OK 7 NULL    NULL    NULL 8 1    1    5 9 2    4    510 3    4    511 4    5    612 5    6    713 6    1    514 7    5    615 8    3    616 NULL    NULL    NULL17 Time taken: 3.081 seconds, Fetched: 10 row(s)




总结:
注意:恢复的namenode中secondarynamenode的最近一次check到故障发生这段时间的内容将丢失,所以fs.checkpoint.period参数值在实际设定中要尽可能的权衡。并且也时常备份secondarynamenode节点中的内容,因为scondarynamenode也是单点的,以防发生故障。
补充说明:如果是用新的节点来恢复namenode,则要注意
1、新节点的Linux环境,目录结构,环境变量等等配置需要跟原来的namenode一模一样,包括conf目录下的所有文件配置。
2、新namenode的主机名要与原namenode保持一致,如果是重新命名主机名的话,则需要批量替换datanode和secondarynamenode的hosts文件,并且重新配置以下文件的部分core-site.xml文件中的fs.default.name
hdfs-site.xml文件中的dfs.http.address(secondarynamenode节点上)
mapred-site.xml文件中的mapred.job.tracker(如果jobtracker与namenode在同一个机器上,一般都是同一台机器上)。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息