您的位置:首页 > 运维架构

在vmare的虚拟机上部署spark1.5.2的ha(成功)和在openstack的虚拟机上部署spark1.5.2的ha(失败)

2016-05-25 14:50 323 查看
在vmare上安装五台centos6.5的linux,两台用着spark的master节点,三台用做spark的worker节点,用zookeeper来配置spark的ha,两台master一台是alive,另一台是standby。spark的安装部署相当简单(这里不做介绍),配置ha是根据官网的配置如下:

Configuration

In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env using this configuration:
System propertyMeaning
spark.deploy.recoveryMode
Set to ZOOKEEPER to enable standby Master recovery mode (default: NONE).
spark.deploy.zookeeper.url
The ZooKeeper cluster url (e.g., 192.168.1.100:2181,192.168.1.101:2181).
spark.deploy.zookeeper.dir
The directory in ZooKeeper to store recovery state (default: /spark).
Possible gotcha: If you have multiple Masters in your cluster but fail to correctly configure the Masters to use ZooKeeper, the Masters will fail to discover each other and think they’re all leaders. This will not lead to a healthy cluster state (as all Masters
will schedule independently).

配置好后,主备切换可以成功!

在openstack上安装五台centos6.6的linux,两台用着spark的master节点,三台用做spark的worker节点,用zookeeper来配置spark的ha,两台master一台是alive,另一台是standby。spark的安装部署相当简单(这里不做介绍),配置ha是根据官网的配置如下:

Configuration

In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env using this configuration:
System propertyMeaning
spark.deploy.recoveryMode
Set to ZOOKEEPER to enable standby Master recovery mode (default: NONE).
spark.deploy.zookeeper.url
The ZooKeeper cluster url (e.g., 192.168.1.100:2181,192.168.1.101:2181).
spark.deploy.zookeeper.dir
The directory in ZooKeeper to store recovery state (default: /spark).
Possible gotcha: If you have multiple Masters in your cluster but fail to correctly configure the Masters to use ZooKeeper, the Masters will fail to discover each other and think they’re all leaders. This will not lead to a healthy cluster state (as all Masters
will schedule independently).

配置好后,主备切换失败,当把alive那台机的master进程杀掉时,三天worker节点的woker进程也没有了,而standby的那台状态变成alive,三个worker的状态都是dead!

master的日志如下:

/05/18 18:06:20 INFO ConnectionStateManager: State change: CONNECTED

16/05/18 18:06:20 WARN ConnectionStateManager: There are no ConnectionStateListeners registered.

16/05/18 18:06:21 INFO ZooKeeperLeaderElectionAgent: Starting ZooKeeper LeaderElection agent

16/05/18 18:06:21 INFO CuratorFrameworkImpl: Starting

16/05/18 18:06:21 INFO ZooKeeper: Initiating client connection, connectString=hadoopspark01:2181,hadoopspark02:2181,hadoopspark03:2181 sessionTimeout=60000 watcher=org.apache.curator.Connecti

onState@1f49f731

16/05/18 18:06:21 INFO ClientCnxn: Opening socket connection to server hadoopspark02/hadoopspark02:2181. Will not attempt to authenticate using SASL (unknown error)

16/05/18 18:06:21 INFO ClientCnxn: Socket connection established to hadoopspark02/hadoopspark02:2181, initiating session

16/05/18 18:06:21 INFO ClientCnxn: Session establishment complete on server hadoopspark02/hadoopspark02:2181, sessionid = 0x254c35056a10002, negotiated timeout = 40000

16/05/18 18:06:21 INFO ConnectionStateManager: State change: CONNECTED

16/05/18 18:07:34 INFO ZooKeeperLeaderElectionAgent: We have gained leadership

16/05/18 18:07:34 INFO Master: I have been elected leader! New state: RECOVERING

16/05/18 18:07:34 INFO Master: Trying to recover worker: worker-20160518180504-hadoopspark05-33284

16/05/18 18:07:34 INFO Master: Trying to recover worker: worker-20160518180503-hadoopspark04-40375

16/05/18 18:07:34 INFO Master: Trying to recover worker: worker-20160518180504-hadoopspark03-41784

16/05/18 18:08:34 INFO Master: Removing worker worker-20160518180504-hadoopspark03-41784 on hadoopspark03:41784

16/05/18 18:08:34 INFO Master: Removing worker worker-20160518180504-hadoopspark05-33284 on hadoopspark05:33284

16/05/18 18:08:34 INFO Master: Removing worker worker-20160518180503-hadoopspark04-40375 on hadoopspark04:40375

16/05/18 18:08:34 INFO Master: Recovery complete - resuming operations!

16/05/18 18:08:37 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@hadoopspark05:33284] has failed, address is now gated for [5000] ms. Reason: [Associa

tion failed with [akka.tcp://sparkWorker@hadoopspark05:33284]] Caused by: [Connection timed out: /hadoopspark05:33284]

16/05/18 18:08:37 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@hadoopspark04:40375] has failed, address is now gated for [5000] ms. Reason: [Associat

ion failed with [akka.tcp://sparkWorker@hadoopspark04:40375]] Caused by: [Connection timed out: /hadoopspark04:40375]

16/05/18 18:08:37 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@hadoopspark03:41784] has failed, address is now gated for [5000] ms. Reason: [Associat

ion failed with [akka.tcp://sparkWorker@hadoopspark03:41784]] Caused by: [Connection timed out: /hadoopspark03:41784]

16/05/18 18:08:37 INFO Master: hadoopspark05:33284 got disassociated, removing it.

16/05/18 18:08:37 INFO Master: hadoopspark04:40375 got disassociated, removing it.

16/05/18 18:08:37 INFO Master: hadoopspark03:41784 got disassociated, removing it.

worker的日志如下:

16/05/18 18:07:01 ERROR Worker: Connection to master failed! Waiting for master to reconnect...

16/05/18 18:07:01 INFO Worker: Connecting to master hadoopspark01:7077...

16/05/18 18:07:01 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@hadoopspark01:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

16/05/18 18:07:01 INFO Worker: hadoopspark01:7077 Disassociated !

16/05/18 18:07:01 ERROR Worker: Connection to master failed! Waiting for master to reconnect...

16/05/18 18:07:01 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.

16/05/18 18:07:01 WARN Worker: Failed to connect to master hadoopspark01:7077

akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkMaster@hadoopspark01:7077/), Path(/user/Master)]

at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)

at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)

at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)

at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)

at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)

at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)

at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)

at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)

at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)

at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)

at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)

at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:533)

at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:569)

at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:559)

at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)

at akka.actor.ActorRef.tell(ActorRef.scala:123)

at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)

at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:439)

at akka.dispatch.UnboundedMailbox$MessageQueue.cleanUp(Mailbox.scala:559)

at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:310)

at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:202)

at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:138)

at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:212)

at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)

at akka.actor.ActorCell.terminate(ActorCell.scala:369)

at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)

at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)

at akka.dispatch.Mailbox.run(Mailbox.scala:219)

at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

16/05/18 18:07:13 INFO Worker: Retrying connection to master (attempt # 1)

16/05/18 18:07:13 INFO Worker: Connecting to master hadoopspark01:7077...

16/05/18 18:07:25 INFO Worker: Retrying connection to master (attempt # 2)

16/05/18 18:07:25 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@338b180b rejected from java.util.concurrent.ThreadPoolExecutor@70d7949c[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 2]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)

at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)

at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)

at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:269)

at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)

at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)

at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)

at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)

at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)

at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)

at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)

at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)

at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)

at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)

at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

at akka.actor.ActorCell.invoke(ActorCell.scala:487)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

at akka.dispatch.Mailbox.run(Mailbox.scala:220)

at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

16/05/18 18:07:25 INFO ShutdownHookManager: Shutdown hook called

猜想:是不是spark和openstack在兼容性上的问题,或者openstack和vmware有本质上的区别!

请各路大神帮忙看看!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: