您的位置：首页 > 数据库 > MySQL

Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server

2011-04-09 17:11 676 查看

Wednesday, October 20, 2010

Using MySQL as a NoSQL - A story for exceeding 750,000 qps on a commodity server

Most of high scale web applications use MySQL + memcached. Many of
them use also NoSQL like TokyoCabinet/Tyrant. In some cases people have
dropped MySQL and have shifted to NoSQL. One of the biggest reasons for
such a movement is that it is said that NoSQL performs better than
MySQL for simple access patterns such as primary key lookups. Most of
queries from web applications are simple so this seems like a
reasonable decision.
Like many other high scale web sites, we at DeNA(*)
had similar issues for years. But we reached a different conclusion. We
are using "only MySQL". We still use memcached for front-end caching
(i.e. preprocessed HTML, count/summary info), but we do not use
memcached for caching rows. We do not use NoSQL, either. Why? Because
we could get much better performance from MySQL than from other NoSQL
products. In our benchmarks, we could get 750,000+ qps on a commodity
MySQL/InnoDB 5.1 server from remote web clients. We also have got
excellent performance on production environments.
Maybe you can't believe the numbers, but this is a real story. In this long blog post, I'd like to share our experiences.
(*)
For those who do not know.. I left Oracle in August 2010. Now I work at
DeNA, one of the largest social game platform providers in Japan.

Is SQL really good for fast PK lookups?

How many times do you need to run PK lookups per second? Our
applications at DeNA need to execute lots of PK lookups, such as
fetching user info by user id, fetching diary info by diary id.
memcached and NoSQL certainly fit very well for such requirements. When
you run simple multi-threaded "memcached get" benchmarks, you can
probably execute 400,000+ get operations per second, even though
memcached clients are located on remote servers. When I tested with the
latest libmemcached and memcached, I could get 420,000 get per sec on a
2.5GHz x 8 core Nehalem box with a quad-port Broadcom Gigabit Ethernet
card.

How frequently can MySQL execute PK lookups?
Benchmarking is easy. Just run concurrent queries from sysbench,
super-smack, mysqlslap, etc.

[matsunobu@host ~]$ mysqlslap --query="select user_name,..
from test.user where user_id=1" \
--number-of-queries=10000000 --concurrency=30 --host=xxx -uroot

You can easily check how many InnoDB rows are read per second.

[matsunobu@host ~]$ mysqladmin extended-status -i 1 -r -uroot \
| grep -e "Com_select"
...
| Com_select                            | 107069     |
| Com_select                            | 108873     |
| Com_select                            | 108921     |
| Com_select                            | 109511     |
| Com_select                            | 108084     |
| Com_select                            | 108483     |
| Com_select                            | 108115     |
...

100,000+ queries per second seems not bad, but much slower than
memcached. What is MySQL actually doing? From vmstat output, both %user
and %system were high.

[matsunobu@host ~]$ vmstat 1
r  b  swpd   free   buff  cache      in     cs us sy id wa st
23  0     0 963004 224216 29937708 58242 163470 59 28 12  0  0
24  0     0 963312 224216 29937708 57725 164855 59 28 13  0  0
19  0     0 963232 224216 29937708 58127 164196 60 28 12  0  0
16  0     0 963260 224216 29937708 58021 165275 60 28 12  0  0
20  0     0 963308 224216 29937708 57865 165041 60 28 12  0  0

Oprofile output told more about where CPU resources were spent.

samples  %        app name                 symbol name
259130    4.5199  mysqld                   MYSQLparse(void*)
196841    3.4334  mysqld                   my_pthread_fastmutex_lock
106439    1.8566  libc-2.5.so              _int_malloc
94583     1.6498  bnx2                     /bnx2
84550     1.4748  ha_innodb_plugin.so.0.0.0 ut_delay
67945     1.1851  mysqld                   _ZL20make_join_statistics
P4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
63435     1.1065  mysqld                   JOIN::optimize()
55825     0.9737  vmlinux                  wakeup_stack_begin
55054     0.9603  mysqld                   MYSQLlex(void*, void*)
50833     0.8867  libpthread-2.5.so        pthread_mutex_trylock
49602     0.8652  ha_innodb_plugin.so.0.0.0 row_search_for_mysql
47518     0.8288  libc-2.5.so              memcpy
46957     0.8190  vmlinux                  .text.elf_core_dump
46499     0.8111  libc-2.5.so              malloc

MYSQLparse() and MYSQLlex() are called during SQL parsing phase.
make_join_statistics() and JOIN::optimize() are called during query
optimization phase. These are "SQL" overhead. It's obvious that
performance drops were caused by mostly SQL layer, not by
"InnoDB(storage)" layer. MySQL has to do a lot of things like below
while memcached/NoSQL do not neeed to do.

* Parsing SQL statements
* Opening, locking tables
* Making SQL execution plans
* Unlocking, closing tables

MySQL also has to do lots of concurrency controls. For example, fcntl()
are called lots of times during sending/receiving network packets.
Global mutexes such as LOCK_open, LOCK_thread_count are taken/relesed
very frequently. That's why my_pthread_fastmutex_lock() were ranked #2
in the oprofile output and %system were not small.
Both MySQL
development team and external community are aware of concurrency
issues. Some issues have already been solved in 5.5. I'm glad to see
that lots of fixes have been done so far.
But it is also important
that %user reached 60%. Mutex contentions result in %system increase,
not %user increase. Even though all mutex issues inside MySQL are
fixed, we can not expect 300,000 queries per second.
You may be
heard about HANDLER statement. Unfortunately HANDLER statement was not
so much helpful to improve throughput because query parsing,
opening/closing tables still be needed.

CPU efficiency is important for in-memory workloads

If little active data fit in memory, SQL overheads become relatively
negligible. This is simply because disk i/o costs are much much higher.
We do not need to care so much about SQL costs in this case.

But, on some of our hot MySQL servers, almost all data fit in memory
and they became completely CPU bound. Profiling results were similar to
what I described above: SQL layer spent most of resources. We needed to
execute lots of primary key lookups(i.e. SELECT x FROM t WHERE id=?) or
limited range scans. Even though 70-80% of queries were simple PK
lookups from the same table (difference was just values in WHERE),
every time MySQL had to parse/open/lock/unlock/close, which seemed not
efficient for us.

Have you heard about NDBAPI?

Is there any good solution to reduce CPU resources/contentions around SQL layer in MySQL? If you are using MySQL Cluster, NDBAPI would
be the best solution. When I worked at MySQL/Sun/Oracle as a
consultant, I had seen lots of customers who were dissapointed at SQL
Node + NDB performance, then became happy after they could get N times
bettern performance by writing NDBAPI clients. You can use both NDBAPI
and SQL in MySQL Cluster. It's recommended using NDBAPI for frequent
access patterns, and using SQL + MySQL + NDB for ad-hoc or infrequent
query patterns.
This was what we wanted. We wanted faster access
APIs, but we also wanted to use SQL for ad-hoc or complex queries. But
DeNA is using InnoDB, like many other web services. Switching to NDB is
not trivial. Embedded InnoDB does neither support SQL nor network interface so it's not an option for us.

Developing "HandlerSocket Plugin" - a MySQL plugin that speaks NoSQL network protocols

We thought that the best approach was implementing a NoSQL network
server inside MySQL. That is, writing a network server as a MySQL
plugin (daemon plugin) which listens on specific ports, accepting NoSQL
protocols/APIs, then accessing to InnoDB directly by using MySQL internal storage engine APIs. This approach is similar to NDBAPI, but it can talk with InnoDB.
This concept was initially invented and prototyped by Kazuho Oku at Cybozu Labs last year. He wrote MyCached UDF that speaks memcached protocols. My colleague Akira Higuchi implemented another plugin: HandlerSocket. The below picture shows about what Hanldersocket can do.

< Fig 1 > What is Hanldersocket?

HandlerSocket is a MySQL daemon plugin so that applications can use
MySQL like NoSQL. The biggest purpose of the HandlerSocket is that it
talks with storage engines like InnoDB without SQL-related overheads.
To access MySQL tables, of course HandlerSocket needs to open/close
tables. But HandlerSocket does not open/close tables every time. It
keeps tables opened for reuse. Opening/closing tables is very costly
and causes serious mutex contentions so it's very helpful to improve
performance. Of course HandlerSocket closes tables when traffics become
small etc so that it won't block administrative commands (DDL) forever.

What is different from using MySQL + memcached? By comparing Fig 1 with
Fig 2, I think you'll notice lots of differences. Fig 2 shows typical
memcached and MySQL usage. memcached is aggressively used for caching
database records. This is mainly because memcached get operation is
much faster than in-memory / on-disk PK lookups in MySQL. If
HandlerSocket can fetch records as fast as memcached, we don't need
memcached for caching records.

< Fig 2 > Common architecture pattern for MySQL + memcached

Using HandlerSocket

As an example, here is a "user" table. Suppose we need to fetch user information by user_id.

CREATE TABLE user (
user_id INT UNSIGNED PRIMARY KEY,
user_name VARCHAR(50),
user_email VARCHAR(255),
created DATETIME
) ENGINE=InnoDB;

In MySQL, fetching user info can be done by, of course, SELECT statements.

mysql> SELECT user_name, user_email, created FROM user WHERE user_id=101;
+---------------+-----------------------+---------------------+
| user_name     | user_email            | created             |
+---------------+-----------------------+---------------------+
| Yukari Takeba | yukari.takeba@dena.jp | 2010-02-03 11:22:33 |
+---------------+-----------------------+---------------------+
1 row in set (0.00 sec)

Let me show how we can do the same thing with HandlerSocket.

* Installing HandlerSocket
Installation steps are described here. Basic steps are as below:

1. Download HandlerSocket here

2. Building HandlerSocket (both client and server codes)
./configure --with-mysql-source=... --with-mysql-bindir=... ; make; make install
3. Installing HandlerSocket into MySQL
mysql> INSTALL PLUGIN handlersocket soname 'handlersocket.so';

Since HandlerSocket is a MySQL plugin, you can use it like other
plugins such as InnoDB Plugin, Q4M, Spider, etc. That is, you do not
need to modify MySQL source code itself. MySQL version has to be 5.1 or
later. You need both MySQL source code and MySQL binary to build
HandlerSocket.

* Writing HandlerSocket client code
We provice C++ and Perl client libraries. Here is a sample Perl code to fetch a row by pk lookup.

view plain print ?

#!/usr/bin/perl

use strict;

use warnings;

use Net::HandlerSocket;

#1. establishing a connection

my $args = { host => 'ip_to_remote_host', port => 9998 };

my $hs = new Net::HandlerSocket($args);

#2. initializing an index so that we can use in main logics.

# MySQL tables will be opened here (if not opened)

my $res = $hs->open_index(0, 'test', 'user', 'PRIMARY',

'user_name,user_email,created');

die $hs->get_error() if $res != 0;

#3. main logic

#fetching rows by id

#execute_single (index id, cond, cond value, max rows, offset)

$res = $hs->execute_single(0, '=', [ '101' ], 1, 0);

die $hs->get_error() if $res->[0] != 0;

shift(@$res);

for (my $row = 0; $row < 1; ++$row) {

my $user_name= $res->[$row + 0];

my $user_email= $res->[$row + 1];

my $created= $res->[$row + 2];

print "$user_name\t$user_email\t$created\n";

}

#4. closing the connection

$hs->close();

#!/usr/bin/perl

use strict;
use warnings;
use Net::HandlerSocket;

#1. establishing a connection
my $args = { host => 'ip_to_remote_host', port => 9998 };
my $hs = new Net::HandlerSocket($args);

#2. initializing an index so that we can use in main logics.
# MySQL tables will be opened here (if not opened)
my $res = $hs->open_index(0, 'test', 'user', 'PRIMARY',
'user_name,user_email,created');
die $hs->get_error() if $res != 0;

#3. main logic
#fetching rows by id
#execute_single (index id, cond, cond value, max rows, offset)
$res = $hs->execute_single(0, '=', [ '101' ], 1, 0);
die $hs->get_error() if $res->[0] != 0;
shift(@$res);
for (my $row = 0; $row < 1; ++$row) {
my $user_name= $res->[$row + 0];
my $user_email= $res->[$row + 1];
my $created= $res->[$row + 2];
print "$user_name\t$user_email\t$created\n";
}

#4. closing the connection
$hs->close();

The above code fetches user_name, user_email and created columns from
user table, looking by user_id=101. So you'll get the same results as
the above SELECT statement.

[matsunobu@host ~]$ perl sample.pl
Yukari Takeba   yukari.takeba@dena.jp   2010-02-03 11:22:33

For most web applications, it's a good practice to keep lightweight
HandlerSocket connections established (persistent connections), so that
lots of requests can focus on main logic (the #3 in the above code).

HandlerSocket protocol is a small-sized text based protocol. Like
memcached text protocol, you can use telnet to get rows through
HandlerSocket.

[matsunobu@host ~]$ telnet 192.168.1.2 9998
Trying 192.168.1.2...
Connected to xxx.dena.jp (192.168.1.2).
Escape character is '^]'.
P       0       test    user    PRIMARY user_name,user_email,created
0       1
0       =       1       101
0       3       Yukari Takeba   yukari.takeba@dena.jp   2010-02-03 11:22:33

(Green lines are request packets, fields must be separated by TAB)

Benchmarking

Now it's good time to show our benchmarking results. I used the above
user table, and tested how many PK lookup operations can be done from
multi-threaded remote clients. All user data fit in memory (I tested
1,000,000 rows). I also tested memcached with similar data (I used
libmemcached and memcached_get() to fetch a user data). In MySQL via
SQL tests, I used a traditional SELECT statement: "SELECT user_name,
user_email, created FROM user WHERE user_id=? ". Both memcached and
HandlerSocket client codes were written in C/C++. All client programs
were located on remote hosts, connecting to MySQL/memcached via TCP/IP.

The highest throughput was as follows:

approx qps     server CPU util
MySQL via SQL                105,000      %us 60%  %sy 28%
memcached                    420,000      %us  8%  %sy 88%
MySQL via HandlerSocket      750,000      %us 45%  %sy 53%

MySQL via HandlerSocket could get over 7.5 times higher throughput than
traditional MySQL via SQL statements, even though %us was 3/4. This
shows that SQL-layer in MySQL is very costly and skipping the layer
certainly improves performance dramatically. It is also interesting
that MySQL via HandlerSocket was 178% faster than memcached, and
memcached spent too much %system resources. Though memcached is an
excellent product, there are still rooms for optimizations.

The below is oprofile outputs, gathered during MySQL via HandlerSocket
tests. CPU resources were spent on core operations such as network
packets handling, fetching rows, etc (bnx2 is a network device driver
program).

samples  %        app name                 symbol name
984785    5.9118  bnx2                     /bnx2
847486    5.0876  ha_innodb_plugin.so.0.0.0 ut_delay
545303    3.2735  ha_innodb_plugin.so.0.0.0 btr_search_guess_on_hash
317570    1.9064  ha_innodb_plugin.so.0.0.0 row_search_for_mysql
298271    1.7906  vmlinux                  tcp_ack
291739    1.7513  libc-2.5.so              vfprintf
264704    1.5891  vmlinux                  .text.super_90_sync
248546    1.4921  vmlinux                  blk_recount_segments
244474    1.4676  libc-2.5.so              _int_malloc
226738    1.3611  ha_innodb_plugin.so.0.0.0 _ZL14build_template
P19row_prebuilt_structP3THDP8st_tablej
206057    1.2370  HandlerSocket.so         dena::hstcpsvr_worker::run_one_ep()
183330    1.1006  ha_innodb_plugin.so.0.0.0 mutex_spin_wait
175738    1.0550  HandlerSocket.so         dena::dbcontext::
cmd_find_internal(dena::dbcallback_i&, dena::prep_stmt const&,
ha_rkey_function, dena::cmd_exec_args const&)
169967    1.0203  ha_innodb_plugin.so.0.0.0 buf_page_get_known_nowait
165337    0.9925  libc-2.5.so              memcpy
149611    0.8981  ha_innodb_plugin.so.0.0.0 row_sel_store_mysql_rec
148967    0.8943  vmlinux                  generic_make_request

Since MySQL via HandlerSocket runs inside MySQL and goes to InnoDB, you
can get statistics from regular MySQL commands such as SHOW GLOBAL
STATUS. It's worth to see 750,000+ Innodb_rows_read.

$ mysqladmin extended-status -uroot -i 1 -r | grep "InnoDB_rows_read"
...
| Innodb_rows_read                      | 750192     |
| Innodb_rows_read                      | 751510     |
| Innodb_rows_read                      | 757558     |
| Innodb_rows_read                      | 747060     |
| Innodb_rows_read                      | 748474     |
| Innodb_rows_read                      | 759344     |
| Innodb_rows_read                      | 753081     |
| Innodb_rows_read                      | 754375     |
...

Detailed specs were as follows.
Model: Dell PowerEdge R710
CPU: Nehalem 8 cores, E5540 @ 2.53GHz
RAM: 32GB (all data fit in the buffer pool)
MySQL Version: 5.1.50 with InnoDB Plugin
memcached/libmemcached version: 1.4.5(memcached), 0.44(libmemcached)
Network: Broadcom NetXtreme II BCM5709 1000Base-T (Onboard, quad-port, using three ports)

* Both memcached and HandlerSocket were network i/o bound. When I
tested with a single port, I got around 260,000 qps on MySQL via
HandlerSocket, 220,000 qps on memcached.

Features and Advantages of HandlerSocket

HandlerSocket has lots of features and advantages like below. Some of them are really beneficial for us.

* Supporting lots of query patterns

HandlerSocket supports PK/unique lookups, non-unique index lookups,
range scan, LIMIT, and INSERT/UPDATE/DELETE. Operations that do not use
any index are not supported. multi_get operations (similar to
IN(1,2,3..), fetching multiple rows via single network round-trip) are
also supported.
See documentation for details.

* Can handle lots of concurrent connections

HandlerSocket connection is light. Since HandlerSocket employs epoll()
and worker-thread/thread-pooling architecture, the number of MySQL
internal threads is limited (can be controlled by handlersocket_threads
parameter in my.cnf). So you can establish thousands or tens of
thousands of network connections to HandlerSocket, without losing
stability(consuming too much memory, causing massive mutex contentions,
etc: such as bug#26590, bug#33948, bug#49169).

* Extremely high performance

HandlerSocket is possible to gain competitive enough performance
against other NoSQL lineups, as already described. Actually I have not
seen any NoSQL product that can execute 750,000+ queries on a commodity
server from remote clients via TCP/IP.
Not only HandlerSocket eliminates SQL related function calls, but also it optimizes around network/concurrency issues.

** Smaller network packets

HandlerSocket protocol is much simpler and smaller than normal MySQL
protocols. So overall network transfer size can be much smaller.

** Running limited number of MySQL internal threads
See above.

** Grouping client requests

When lots of concurrent requests come to HandlerSocket, each worker
thread gathers as many requests as possible, then executing gathered
requests at one time, and sending back results. This can improve
performance greatly, by sacrificing response time a bit. For example,
you can gain the following benefits. I'll explain them in depth in
later posts, if anybody is interested.
*** Can reduce the number of fsync() calls
*** Can reduce replication delay

* No duplicate cache

When you use memcached to cache MySQL/InnoDB records, records are
cached in both memcached and InnoDB buffer pool. They are duplicate so
less efficient (Memory is still expensive!). Since HandlerSocket plugin
accesses to InnoDB storage engine, records are cached inside InnoDB
buffer pool, which can be reused by other SQL statements.

* No data inconsistency
Since data is stored at one place (inside InnoDB), data consistency check between memcached and MySQL is not needed.

* Crash-safe

Backend storage is InnoDB. It's transactional and crash safe. Even
though you use innodb-flush-log-at-trx-commit!=1, you lose only < 1s
of data on server crash.

* SQL can be used from mysql clients

In many cases people still want to use SQL (i.e to generate summary
reports). This is why we can't use Embedded InnoDB. Most NoSQL products
don't support SQL interface, either.
HandlerSocket is just a
plugin for MySQL. You can usually send SQL statements from MySQL
clients, and use HandlerSocket protocols when you need high throughput.

* All operational benefits from MySQL

Again, HandlerSocket runs inside MySQL, so all MySQL operations such as
SQL, online backups, replication, monitoring by Nagios /
EnterpriseMonitor, etc are supported. HandlerSocket activities can be
monitored by regular MySQL command such as SHOW GLOBAL STAUTS, SHOW
ENGINE INNODB STATUS, SHOW PROCESSLIST, etc.

* No need to modify/rebuild MySQL
Since it's a plugin, it runs on both MySQL Community and MySQL Enterprise Servers.

* Independent from storage engines

HandlerSocket is developed so that it can talk with any storage engine,
though we have tested and used with 5.1 and 5.5 InnoDB Plugin only.

Notes and Limitations

* Need to learn HandlerSocket APIs
You need to write a program to talk with HandlerSocket, though it's pretty easy to use. We provide C++ API and Perl bindings.

* No security

Like other NoSQL databases, HandlerSocket does not provide any security
feature. HandlerSocket's worker threads run with system user
privileges, so applications can access to all tables through
HandlerSocket protocols. Of course you can use firewalls to filter
packets, like other NoSQL products.

* No benefit for HDD bound workloads

For HDD i/o bound workloads, a database instance can not execute
thousands of queries per second, which normally results in only 1-10%
CPU usage. In such cases, SQL execution layer does not become
bottleneck, so there is no benefit to use Hanldersocket. We use
HandlerSocket on servers that almost all data fit in memory.

DeNA is using HandlerSocket in production

We already use HandlerSocket plugin in our production environments. The
results are great. We could have reduced lots of memcached and MySQL
slave servers. Overall network traffics have been reduced, too. We
haven't seen any performance problem (slow response time, stalls etc)
so far. We've been very satisfied with the results.
I think MySQL
has been underrated from NoSQL/Database communities. MySQL actually has
much longer history than most of other products, and lots of unique and
great enhancements have been done so far by excellent my ex-colleagues.
I know from NDBAPI that MySQL has very strong potentials as a NoSQL.
Storage engine API and daemon plugin interface are completely unique,
and they made Akira and DeNA develop HandlerSocket possible. As an
ex-employee at MySQL and a long-time fun for MySQL, I'd like to see
MySQL becomes better and more popular, not only as an RDBMS, but also
as Yet Another NoSQL lineup.
Since HandlerSocket plugin is Open Source, feel free to try. We'd be appreciated if you give us any feedback.

Posted by
Yoshinori Matsunobu

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航