您的位置:首页 > 运维架构

Troubleshooting ZooKeeper Operating Environment (zookeeper操作环境排查)

2017-09-22 11:46 351 查看
This
page details specific problems people have seen, solutions (if solved)
to those issues and the types of steps taken to troubleshoot the issue.
Feel free to update with your experiences. The Hbase troubleshooting page also has insight for identifying/resolving ZK issues.

Monitoring

It
is important to monitor the ZK environment (hardware, network,
processes, etc...) in order to more easily troubleshoot problems.
Otherwise you miss out on important information for determining the
cause of the problem. What type of monitoring are you doing on your
cluster? You can monitor at the host level -- that will give you some
insight on where to look; cpu, memory, disk, network, etc... You can
also monitor at the process level -- the ZooKeeper server JMX interface will give you information about latencies and such (you can also use the four letter words
for that if you want to hack up some scripts instead of using JMX). JMX
will also give you insight into the JVM workings - so for example you
could confirm/ruleout GC pauses causing the JVM Java threads to hang for
long periods of time (see below). Without
monitoring troubleshooting will be more difficult, but not impossible.
JMX can be used through jconsole, or access the stats through the four
letter words, also the log4j log contains much important/useful
information.

Troubleshooting Checklist

The
following can be useful checklist when you are having issues with your
ZK cluster, in particular if you are seeing large numbers of timeouts,
sessions expirations, poor performance, or high operation latencies. Use
the following on all servers and potentially on clients as well: hdparm with the -t and -T options to test your disk IO

time dd if=/dev/urandom bs=512000 of=/tmp/memtest count=1050
time md5sum /tmp/memtest; time md5sum /tmp/memtest; time md5sum /tmp/memtest

See ECC memory section below for more on this

ethtool to check the configuration of your network

ifconfig also to check network and examine error counts
ZK uses TCP for network connectivity, errors on the NICs can cause poor performance

scp/ftp/etc... can be used to verify connectivity, try copying large files between nodes

these smoke and latency tests can be useful to verify a cluster

Compare your results to some baselines

See the Latency Overview
page for some latency baselines. You can also compare the performance
of cpu/disk/mem/etc... that you have available to what is used in this
test.

A word or two about heartbeats

Keep
in mind the the session timeout period is used by both the client and
the server. If the ZK leader doesn't hear from the client w/in the
timeout (say it's 5 sec) it will expire the session. The client is
sending a ping after 1/3 of the timeout period. It expects to hear a
response before another 1/3 of the timeout elapses, after which it will
attempt to re-sync to another server in the cluster. In the 5 sec
timeout case you are allowing 1.3 seconds for the request to go to the
server, the server to respond back to the client, and the client to
process the response. Check the latencies in ZK's JMX in order to get
insight into this. i.e. if the server latency is high, say because of io
issues, or jvm swapping, vm latency, etc... that will cause the
client/sessions to timeout.

Frequent client disconnects & session expirations

ZooKeeper is a canary in a coal mine of sorts. Because of the heart-beating performed by the clients and servers ZooKeeper
based applications are very sensitive to things like network and system
latencies. We often see client disconnects and session expirations
associated with these types of problems. Take a look at this section to start.

Client disconnects due to client side swapping

This link
specifically discusses the negative impact of swapping in the context
of the server. However swapping can be an issue for clients as well. It
will delay, or potentially even stop for a significant period, the
heartbeats from client to server, resulting in session expirations. As told by a user: "This issue
is clearly linked to heavy utilization or swapping on the clients. I
find that if I keep the clients from swapping that this error
materializes relatively infrequently, and when it does materialize it is
linked to a sudden increase in load. For example, the concurrent start
of 100 clients on 14 machines will sometimes trigger this issue.
<...> All in all, it is my sense that Java processes must avoid
swapping if they want to have not just timely but also reliable
behavior." As told by a HBase user: "After looking ganglia history, it's clear that the nodes in question were starved of memory, swapping like crazy. The expired scanner lease, the region shutting down, and as you noted, the Zookeeper session expiry, were not a causal chain, but all the result of the machine grinding to a halt from swapping. The MapReduce tasks were allocated too much memory, and an apparent memory leak in the job we were running was causing the tasks to eat into the RegionServer's share of the machine's memory. I've reduced the memory allocated to tasks in hadoop's "mapred.child.java.opts" to ensure that the HADOOP_HEAPSIZE + total maximum memory allocated to tasks + the HBASE_HEAPSIZE is not greater than the memory available on the machine."

Hardware misconfiguration - NIC

In
one case there was a cluster of 5k ZK clients attaching to a ZK
cluster, ~20% of the clients had mis-configured NICs, this was causing
high tcp packet loss (and therefore high network latency), which caused
disconnects (timeout exceeded), but only under fairly high network load
(which made it hard to track down!). In the end special processes were
setup to continuously monitor client server network latency. Any spikes
in the latencies observed were then correlated to the ZK logs
(timeouts). In the end all of the NICs were reconfigured on these hosts.

Hardware - network switch

Another
issue with the same user as the NIC issue - a cluster of 5k ZK clients
attaching to a ZK cluster. It turned out that the network switches had
bad firmware which caused high packet latencies under heavy load. At
certain times of day we would see high numbers of ZK client disconnects.
It turned out that these were periods of heavy network activity,
exacerbated by the ZK client session expirations (they caused even more
network traffic). In the end the operations team spent a number of days
testing/loading the network infrastructure until they were able to pin
down the issue as being switch related. The switch firmware was upgraded
and this issue was eventually resolved.

Hardware - ifconfig is your friend

A
recent issue we saw extremely poor performance from a 3 server ZK
ensemble (cluster). Average and max latencies on operations as reported
by the "stat" command on the servers was very high (multiple seconds).
Turns out that one of the servers had a NIC that was dropping large
numbers of packets due to framing problems. Switching out that server
with another (no nic issue) resolved the issue. Weird thing was that
SSH/SCP/PING etc reported no problems. Moral of the story: use ifconfig to verify the network interface if you are seeing issues on the cluster.

Hardware - hdparm is your friend

Poor
disk IO will also result in increased operation latencies. Use hdparm
with the -t and -T options to verify the performance of persistent
storage.

Hardware - ECC memory problems can be hard to track down

I've
seen a particularly nasty problem where bad ECC memory was causing a
single server to run an order of magnitude slower than the rest of the
servers in the cluster. This caused some particularly nasty/random
problems that were nearly impossible to track down (since the machine
kept running, just slowly). Ops replaced the ECC memory and all was
fine. See the troubleshooting checklist at the top of this page -- the
dd/md5sum commands listed there can help to sniff this out (hint:
compare the results on all of your servers and verify they are at least
"close").

Virtual environments

We've
seen situations where users run the entire zk cluster on a set of
VMWare vms, all on the same host system. Latency on this configuration
was >>> 10sec in some cases due to resource issues (in
particular io - see the link I provided above, dedicated log devices are
critical to low latency operation of the ZK cluster). Obviously no one
should be running in this configuration in production - in particular
there will be no reliability in cases where the host storage fails!

Virtual environments - "Cloud Computing"

In
one scenario involving EC2 ZK was seeing frequent client disconnects.
The user had configured a timeout of 5 seconds, which is too low,
probably much too low. Why? You are running in virtualized environments
on non-dedicated hardware outside your control/inspection. There is
typically no way to tell (unless you are running on the 8 core ec2
systems) if the ec2 host you are running on is over/under subscribed
(other vms). There is no way to control disk latency either. You could
be seeing large latencies due to resource contention on the ec2 host
alone. In addition to that I've heard that network latencies in ec2 are
high relative to what you would see if you were running on your own
dedicated environment. It's hard to tell the latency btw the servers and
client->server w/in the ec2 environment you are seeing w/out
measuring it.

GC pressure

The Java GC can cause starvation of the Java threads
in the VM. This manifests itself as client disconnects and session
expirations due to starvation of the heartbeat thread. The GC runs,
locking out all Java threads from running. You can get an idea of what's happening wrt GC in your client/server VMs by using the following options when starting the JVM: -Xloggc:gc.log
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGC
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails gchisto is a useful tool for analyzing GC logs https://gchisto.dev.java.net/ Additionally you can use 'jstat' on a running jvm to gain more insight into realtime GC activity, see: http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstat.html This issue can be resolved in a few ways: First look at using one of the alternative GCs, in particular low latency GC: e.g. the following JVM option: -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC Secondly
you might try the solution used by HBASE, spawn a non-Java (JNI) thread
to manage your ephemeral znodes. This is a pretty advanced option
however, try the alternative GC first and see if that helps.

Performance tuning

Some things to keep in mind while tuning zookeeper performance. Verify
that logging isn't at DEBUG.Check your log4j.properties file and
change the line log4j.rootLogger=DEBUG, ROLLINGFILE to
log4j.rootLogger=WARN ROLLINGFILE. Logging to disk on every action can
greatly effect performance.

Verify that you are using fast local disk for the journal.

Test with http://github.com/phunt/zk-smoketest. This should identify real performance along latency issues. It is built against 32 bit python.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  zookeeper 问题排查