您的位置：首页 > 产品设计 > UI/UE

HBase大批量写入操作ipc queue size持续上升问题记录及修复

2015-06-24 14:43 471 查看

由于需要将关系数据库数据通过VDataHub工具导入到HBase表，虽然在VDataHub中进行了流控，但是由于集群还有其他任务在跑，导致整个集群磁盘IO几乎跑满，很多put操作没法写入，在hbase服务端可以看到很多类似以下的异常日志：

2015-06-23 13:45:18,844 WARN  [RpcServer.handler=71,port=60020] ipc.RpcServer: (responseTooSlow): {"processingtimems":18758,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","client":":54083","starttimems":1435038300085,"queuetimems":0,"class":"HRegionServer","responsesize":503,"method":"Multi"}
2015-06-23 13:45:18,844 WARN  [RpcServer.handler=71,port=60020] ipc.RpcServer: RpcServer.respondercallId: 9805 service: ClientService methodName: Multi size: 119.3 K connection: :54083: output error
2015-06-23 13:45:18,845 WARN  [RpcServer.handler=71,port=60020] ipc.RpcServer: RpcServer.handler=71,port=60020: caught a ClosedChannelException, this means that the server was processing a request but the client went away. The error message was: null
2015-06-23 13:45:19,080 WARN  [RpcServer.handler=77,port=60020] ipc.RpcServer: (responseTooSlow): {"processingtimems":18725,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","client":"55791","starttimems":1435038300354,"queuetimems":0,"class":"HRegionServer","responsesize":675,"method":"Multi"}
2015-06-23 13:45:19,080 WARN  [RpcServer.handler=77,port=60020] ipc.RpcServer: RpcServer.respondercallId: 9354 service: ClientService methodName: Multi size: 160.1 K connection: :55791: output error
2015-06-23 13:45:19,081 WARN  [RpcServer.handler=77,port=60020] ipc.RpcServer: RpcServer.handler=77,port=60020: caught a ClosedChannelException, this means that the server was processing a request but the client went away. The error message was: null

然后便是导入任务一直卡住，无法成功结束（其实就是sqoop的mapreduce任务）。但是令人费解的是，当其他任务结束了，整个集群就我们的数据导入任务的时候，通过看hbase的管理页面，看到没有几个请求，按理来说应该每秒请求数几万+的，而集群已处于空闲状态。

另外通过观察我们线上已经部署的metis监控图表，发现ipc queue size不太正常，按理来说，它的图示应该是有增有降，但几乎看到的现象是持续上升，偶尔有一点降，但总体是上升的，而正常的情况是当没有请求的时候是会降回到0的。
集群资源空闲无法写入，于是没办法重启集群，然后重新跑导入任务，写入正常，非常快。过了2-3天，监控图表显示queue size从重启后的0又不断攀升，慢慢地又写入不了。这可烦死人了，难道这以后还得定时重启，这不是笑话吗？继续定位问题，既然资源肯定没达到限制的情况下无法写入了，肯定是hbase自身出现什么问题了，于是根据queue size的异常情况翻开hbase源码，一看还真发现了点问题，对于这个监控值，只有在CallRunner类中才存在减操作的情况，而代码里面如果一旦出现IO等异常，那个减操作的代码根本没法执行，于是出现了整个集群在负载高出现异常后，queue
size会不断上升，但是又不减少。但达到一定大小后，hbase 会根据queue size大小做一个简单的流控，就引发了后面及时真实queue size没那么大，但还是采用了错误的值去做判断。

<strong>CallRunner代码片段</strong>

public void run() {
try {
if (!call.connection.channel.isOpen()) {
if (RpcServer.LOG.isDebugEnabled()) {
RpcServer.LOG.debug(Thread.currentThread().getName() + ": skipped " + call);
}
return;
}
this.status.setStatus("Setting up call");
this.status.setConnection(call.connection.getHostAddress(), call.connection.getRemotePort());
if (RpcServer.LOG.isDebugEnabled()) {
UserGroupInformation remoteUser = call.connection.user;
RpcServer.LOG.debug(call.toShortString() + " executing as " +
((remoteUser == null) ? "NULL principal" : remoteUser.getUserName()));
}
Throwable errorThrowable = null;
String error = null;
Pair<Message, CellScanner> resultPair = null;
RpcServer.CurCall.set(call);
TraceScope traceScope = null;
try {
if (!this.rpcServer.isStarted()) {
throw new ServerNotRunningYetException("Server is not running yet");
}
if (call.tinfo != null) {
traceScope = Trace.startSpan(call.toTraceString(), call.tinfo);
}
RequestContext.set(userProvider.create(call.connection.user), RpcServer.getRemoteIp(),
call.connection.service);
// make the call
resultPair = this.rpcServer.call(call.service, call.md, call.param, call.cellScanner,
call.timestamp, this.status);
} catch (Throwable e) {
RpcServer.LOG.debug(Thread.currentThread().getName() + ": " + call.toShortString(), e);
errorThrowable = e;
error = StringUtils.stringifyException(e);
} finally {
if (traceScope != null) {
traceScope.close();
}
// Must always clear the request context to avoid leaking
// credentials between requests.
RequestContext.clear();
}
RpcServer.CurCall.set(null);
this.rpcServer.addCallSize(call.getSize() * -1);
// Set the response for undelayed calls and delayed calls with
// undelayed responses.
if (!call.isDelayed() || !call.isReturnValueDelayed()) {
Message param = resultPair != null ? resultPair.getFirst() : null;
CellScanner cells = resultPair != null ? resultPair.getSecond() : null;
call.setResponse(param, cells, errorThrowable, error);
}
call.sendResponseIfReady();
this.status.markComplete("Sent response");
this.status.pause("Waiting for a call");
} catch (OutOfMemoryError e) {
if (this.rpcServer.getErrorHandler() != null) {
if (this.rpcServer.getErrorHandler().checkOOME(e)) {
RpcServer.LOG.info(Thread.currentThread().getName() + ": exiting on OutOfMemoryError");
return;
}
} else {
// rethrow if no handler
throw e;
}
} catch (ClosedChannelException cce) {
RpcServer.LOG.warn(Thread.currentThread().getName() + ": caught a ClosedChannelException, " +
"this means that the server was processing a " +
"request but the client went away. The error message was: " +
cce.getMessage());
} catch (Exception e) {
RpcServer.LOG.warn(Thread.currentThread().getName()
+ ": caught: " + StringUtils.stringifyException(e));
}
}

既然这样，修复的办法便很明了了，不管是否异常，在最后都进行减操作，直接放到finally代码块中，google一下，发现果然有人也遇到同样的问题，在去年8月份的时候社区已经修复了，修复的版本是0.98.6，不过我们的版本是cdh 0.98.1，只能靠自己源码编译改了，坑爹呀。
http://qnalist.com/questions/5065474/ipc-queue-size https://issues.apache.org/jira/browse/HBASE-11705 https://issues.apache.org/jira/secure/attachment/12660609/HBASE-11705.v2.patch
具体如何源码编译改的话，请参考博客里的另外一篇文章，这就不累述了。

修复后图表显示表明处于正常状态了：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航