Reroute Unassigned Shards——遇到主shard 出现的解决方法就是重新路由
2017-08-31 17:26
489 查看
Red Cluster!
摘自:http://blog.kiyanpro.com/2016/03/06/elasticsearch/reroute-unassigned-shards/There are 3 cluster states:
green: All primary and replica shards are active
yellow: All primary shards are active, but not all replica shards are active
red: Not all primary shards are active
When cluster health is red, it means cluster is dead. And that means you can do nothing until it’s recovered, which is very bad indeed. I will share with you how to deal with one common situation: when cluster is red due to unassigned shards.
Steps
The general idea is pretty simple: find those shards which are unassigned, manually assign them to a node with reroute API. Let’s see how we can do that step by step. Then we can combine them into a configurable simple script.Step 1: Check Unassigned Shards
To get cluster information, we usually use cat APIs. There is aGET /_cat/shardsendpoint to show a detailed view of what nodes contain which shards[1].
Cat shards
1 2 3 4 5 6 7 8 9 | # cat shards verbose curl "http://your.elasticsearch.host.com:9200/_cat/shards?v" # cat shards index curl "http://your.elasticsearch.host.com:9200/_cat/shards/wiki2" # example return # wiki2 0 p STARTED 197 3.2mb 192.168.56.10 Stiletto # wiki2 1 p STARTED 205 5.9mb 192.168.56.30 Frankie Raye # wiki2 2 p STARTED 275 7.8mb 192.168.56.20 Commander Kraken |
Get unassigned shards
1 2 3 4 5 6 | # cat shards with fgrep curl "http://your.elasticsearch.host.com:9200/_cat/shards" | fgrep UNASSIGNED # example return # wiki1 0 r UNASSIGNED ALLOCATION_FAILED # wiki1 1 r UNASSIGNED ALLOCATION_FAILED # wiki1 2 r UNASSIGNED ALLOCATION_FAILED |
POST /_flush/synced[2]. This endpoint is actually not just some information. It allows an administrator to initiate a synced flush manually. This can be particularly useful for a planned (rolling) cluster restart where you can stop indexing and don’t want to wait the default 5 minutes for idle indices to be sync-flushed automatically. It returns with a json response.
_flush/synced
1 | curl -XPOST "http://your.elasticsearch.host.com:9200/twitter/_flush/synced" |
Example response with failed shards
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | { "_shards": { "total": 4, "successful": 1, "failed": 1 }, "twitter": { "total": 4, "successful": 3, "failed": 1, "failures": [ { "shard": 1, "reason": "unexpected error", "routing": { "state": "STARTED", "primary": false, "node": "SZNr2J_ORxKTLUCydGX4zA", "relocating_node": null, "shard": 1, "index": "twitter" } } ] } } |
Step 2: Reroute
The reroute command allows to explicitly execute a cluster reroute allocation command including specific commands[3] . An unassigned shard can be explicitly allocated on a specific node.Reroute example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "move" : { "index" : "test", "shard" : 0, "from_node" : "node1", "to_node" : "node2" } }, { "allocate" : { "index" : "test", "shard" : 1, "node" : "node3" } } ] }' |
move: Move a started shard from one node to another node. Accepts index and shard for index name and shard number, from_node for the node to move the shard from, and to_node for the node to move the shard to.
cancel: Cancel allocation of a shard (or recovery). Accepts index and shard for index name and shard number, and node for the node to cancel the shard allocation on. It also accepts allow_primary flag to explicitly specify that it is allowed to cancel allocation for a primary shard. This can be used to force resynchronization of existing replicas from the primary shard by cancelling them and allowing them to be reinitialized through the standard reallocation process.
allocate: Allocate an unassigned shard to a node. Accepts the index and shard for index name and shard number, and node to allocate the shard to. It also accepts allow_primary flag to explicitly specify that it is allowed to explicitly allocate a primary shard (might result in data loss).
Combining step 2 with the unassigned shards from Step 1, we can reroute all unassigned shards 1 by 1, thus getting faster cluster recovery from red state.
Example Solutions
Python
Below is a python script I wrote usingPOST /_flush/syncedand
POST /reroute
Shell Script
Below is a shell script I found elsewhere in a blog post[4]1 2 3 4 5 6 7 8 9 10 11 12 13 14 | for shard in $(curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk '{print $2}'); do curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "allocate" : { "index" : "t37", # index name "shard" : $shard, "node" : "datanode15", # node name "allow_primary" : true } } ] }' sleep 5 done |
Possible Unassigned Shard Reasons
FYI, these are the possible reasons for a shard be in a unassigned state[1]:Name | Comment |
---|---|
INDEX_CREATED | Unassigned as a result of an API creation of an index |
CLUSTER_RECOVERED | Unassigned as a result of a full cluster recovery |
INDEX_REOPENED | Unassigned as a result of opening a closed index |
DANGLING_INDEX_IMPORTED | Unassigned as a result of importing a dangling index |
NEW_INDEX_RESTORED | Unassigned as a result of restoring into a new index |
EXISTING_INDEX_RESTORED | Unassigned as a result of restoring into a closed index |
REPLICA_ADDED | Unassigned as a result of explicit addition of a replica |
ALLOCATION_FAILED | Unassigned as a result of a failed allocation of the shard |
NODE_LEFT | Unassigned as a result of the node hosting it leaving the cluster |
REROUTE_CANCELLED | Unassigned as a result of explicit cancel reroute command |
REINITIALIZED | When a shard moves from started back to initializing, for example, with shadow replicas |
REALLOCATED_REPLICA | A better replica location is identified and causes the existing replica allocation to be cancelled |
References
ElasticSearch Document Cat ShardsElasticSearch Document Synced Flush
ElasticSearch Document Cluster Reroute
How to fix your elasticsearch cluster stuck in initializing shards mode?
相关文章推荐
- ELK出现unassigned_shards解决办法
- 单节点Elasticsearch出现unassigned_shards原因及解决办法
- 重新安装java出现错误的解决方法
- CentOS系统配置.ssh遇到port 22:No route to host问题的解决方法
- 在进行页面的DIV CSS排版时,遇到IE6(当然有时Firefox下也会偶遇)浏览器中的图片元素img下出现多余空白的问题绝对是常见的对于该问题的解决方法也是“见机行事”。
- 重新编译ns2遇到proxytrace2any.cc中`IsLittleEndian' undeclared 错误的解决方法
- 卸载或重新安装JDK出现"Windows Installer程序包有问题,此安装需要的DLL不能运行"解决方法(转)
- 关于重新安装 3rd_MR 版sdk 出现 S60_3rd_MR_1这样的目录的解决方法!
- 使用Navicat Premium将Oracle数据库中的表和数据迁移到MySQL数据库中,遇到的Date类型出现精度问题及解决方法
- 遇到的错误-----MySQL使用临时表 出现 “ERROR 1137 (HY000): Can't reopen table” 的异常 解决方法
- 使用AFNetworking 2.0 请求数据时出现错误 Request failed: unacceptable content-type: XXXX 解决方法
- 使用AFNetworking 2.0 请求数据时出现错误 Request failed: unacceptable content-type: text/html 解决方法
- 遇到的错误-----MySQL使用临时表 出现 “ERROR 1137 (HY000): Can't reopen table” 的异常 解决方法
- 自己遇到的Android虚拟机出现的错误及解决方法【不断更新】
- vs2003出现“此计算机上没有安装项目系统组件。请重新安装重新安装visual stdio”解决方法
- 遇到的问题------数据库 update 语句出现错误的解决方法
- 卸载重新安装Sql Server 2005出现“性能监视器计数器要求”错误解决方法
- mac下卸载mysql及重新安装mysql后遇到的初始密码的解决方法
- 是否你们遇到过Jlink重新刷完固件之后出现问题,解决办法
- 本人遇到的在ie中出现的jquery.form.js拒绝访问的解决方法