您的位置:首页 > 数据库 > Mongodb

故障案例--mongodb副本集write concern为majority的一个坑

2016-08-23 17:23 260 查看

故障现象:

一个副本集下四个节点,一个primary,两个Secondary,一个arbiter,其中将一个Secondary关闭后,修改primary节点的密码,这时修改命令会卡住直到超时失败。

udb-aqmp5a:PRIMARY> db.changeUserPassword("root","123123")

2016-08-23T17:05:30.879+0800 E QUERY    Error: Updating user failed: timeout

    at Error (<anonymous>)

    at DB.updateUser (src/mongo/shell/db.js:1152:11)

    at DB.changeUserPassword (src/mongo/shell/db.js:1156:10)

    at (shell):1:4 at src/mongo/shell/db.js:1152

故障原因:

查看mongodb的错误日志

2016-08-19T12:37:08.897+0800 W NETWORK  [ReplExecNetThread-12] Failed to connect to 10.19.66.62:27017, reason: errno:115 Operation now in progress

2016-08-19T12:37:08.897+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 10.19.66.62:27017; Location18915 Failed attempt to connect to 10.19.66.62:27017; couldn't connect to server 10.19.66.62:27017 (10.19.66.62), connection attempt failed

2016-08-19T12:37:15.524+0800 I COMMAND  [conn601] command admin.$cmd command: getLastError { getLastError: 1, w: "majority", wtimeout: 30000.0 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:270 locks:{ Global: { acquireCount: { r: 3, w: 3 }
}, Database: { acquireCount: { w: 1, W: 2 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } 30001ms

可以看到writeconcern为write majority,这种情况下修改密码不符合“大多数”原则。可能是majority在计算时需要符合"大多数数据节点"的需求,包括了仲裁节点,但是如果有仲裁节点存在,因为它无法实际写入数据,所以它却永远站在对立面。

故障复现:

准备条件:一个primary,两个Secondary,一个arbiter,并关闭其中一台Secondary

方法1  采用普通写入,比如往一个db写入一条数据,通过设置不同的w值和writeconcern值

udb-aqmp5a:PRIMARY> rs.status()

{
"set" : "udb-aqmp5a",
"date" : ISODate("2016-08-23T09:04:03.018Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "10.9.46.198:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 308,
"optime" : Timestamp(1471942975, 1),
"optimeDate" : ISODate("2016-08-23T09:02:55Z"),
"electionTime" : Timestamp(1471942751, 2),
"electionDate" : ISODate("2016-08-23T08:59:11Z"),
"configVersion" : 4,
"self" : true
},
{
"_id" : 1,
"name" : "10.9.56.132:27017",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : Timestamp(0, 0),
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2016-08-23T09:03:51.769Z"),
"lastHeartbeatRecv" : ISODate("2016-08-23T09:03:39.757Z"),
"pingMs" : 0,
"lastHeartbeatMessage" : "DBClientBase::findN: transport error: 10.9.56.132:27017 ns: admin.$cmd query: { replSetHeartbeat: \"udb-aqmp5a\", pv: 1, v: 4, from: \"10.9.46.198:27017\", fromId: 0, checkEmpty: false }",
"configVersion" : -1
},
{
"_id" : 2,
"name" : "10.9.48.72:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 84,
"optime" : Timestamp(1471942975, 1),
"optimeDate" : ISODate("2016-08-23T09:02:55Z"),
"lastHeartbeat" : ISODate("2016-08-23T09:04:01.766Z"),
"lastHeartbeatRecv" : ISODate("2016-08-23T09:04:02.802Z"),
"pingMs" : 0,
"syncingTo" : "10.9.56.132:27017",
"configVersion" : 4
},
{
"_id" : 3,
"name" : "10.9.51.198:27017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 67,
"lastHeartbeat" : ISODate("2016-08-23T09:04:01.794Z"),
"lastHeartbeatRecv" : ISODate("2016-08-23T09:04:01.363Z"),
"pingMs" : 0,
"configVersion" : 4
}
],
"ok" : 1

}

udb-aqmp5a:PRIMARY> db.test.insert({name:"jason.jiang"},{writeConcern:{w:2,wtimeout:5000}})

WriteResult({ "nInserted" : 1 })

这时可以写入成功,是因为w为2表示副本集只要有2个节点写入成功就行,于是返回成功

udb-aqmp5a:PRIMARY> db.test.insert({name:"jason.jiang1"},{writeConcern:{w:3,wtimeout:5000}})

WriteResult({
"nInserted" : 1,
"writeConcernError" : {
"code" : 64,
"errInfo" : {
"wtimeout" : true
},
"errmsg" : "waiting for replication timed out"
}

})

这时无法写入成功,因为需要写入3个节点,但是arbiter无法写入成功,而其中一个Secondary节点宕了,无法满足。

udb-aqmp5a:PRIMARY> db.test.insert({name:"jason.jiang2"},{writeConcern:{w:"majority",wtimeout:5000}})

WriteResult({
"nInserted" : 1,
"writeConcernError" : {
"code" : 64,
"errInfo" : {
"wtimeout" : true
},
"errmsg" : "waiting for replication timed out"
}

})

这时也无法写入成功,因为4个majority其实也就是w为3,无法满足

总结:

个人感觉这算是mongodb设计得不够合理的地方,容易引起误解。arbiter在主节点宕机选举新的primary时起到了作用,起到积极主动作用;而在writecern设置为majority后,因为其本身无法写入数据,故一直起到的是消极作用。像这种1主2从1仲裁的情况,如果主节点宕机,那么可以选举出新的主节点;如果1个从节点宕机,当设置为majority时,却无法再写入数据了,虽然数据节点中3个中的2个都是健康的(虽然现网环境下这个arbier完全是多余的,一般不会这么用)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: