如何快速更新数据库中的百万条数据
2017-01-22 10:22
435 查看
如何快速更新数据库中的百万条数据
环境:ubuntu 14.04,MySQL 5.6,Python 2.7。本文使用python的MySQLdb这个库,MySQL Python tutorial这个教程很赞。
MySQLDBdb处理INSERT、UPDATE、DELETE时需要显式得commit()才能使更新有效,这也意味着commit之前的更新语句是放在一个事务中的。
For databases that support transactions, the Python interface silently starts a transaction when the cursor is created. The commit() method commits the updates made using that cursor, and the rollback() method discards them. Each method starts a new transaction.
对比如下:
更新方式 | 总时间(ms) | rows | 平均(ms) |
---|---|---|---|
1 | 57831 | 1000 | 57.8 |
2 | 89695 | 10 | 896 |
3 | 92941 | 40×10000 | 0.23 |
4 | 47956 | 40×10000 | 0.119 |
5 | 28407 | 40×10000 | 0.071 |
6 | 3272 | 48×10000 | 0.008 |
方式2是单线程,where没有使用索引,所有更新对应一个commit。
方式3是单线程,where使用了索引,所有更新对应一个commit。
方式4使用了2个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
方式5使用了4个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
方式6是在一个update语句里面更新所有的数据。
建立表
mysql> CREATE TABLE `test`.`number` ( `id` INT NOT NULL COMMENT '', `num` INT NULL COMMENT '', PRIMARY KEY (`id`) COMMENT ''); mysql> show create table number \G *************************** 1. row *************************** Table: number Create Table: CREATE TABLE `number` ( `id` int(11) NOT NULL, `num` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
使用mysqldb插入数据(100*000条)
# coding: UTF-8 __author__ = 'letian' import MySQLdb import time host='127.0.0.1' user='root' passwd='123456' db='test' port = 3306 start_time = time.time() conn = MySQLdb.connect(host, user, passwd, db, port) cur = conn.cursor() for x in xrange(100*10000): # 0 到 999999 print x sql = "INSERT INTO number VALUES(%s,%s)" % (x, x) cur.execute(sql) conn.commit() end_time = time.time() print '用时:', end_time - start_time
运行结果:
... 999997 999998 999999 用时:219.332565069
挺快。
看下数据:
mysql> select count(*) from number; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row in set (0.60 sec) mysql> select * from number limit 3; +----+------+ | id | num | +----+------+ | 0 | 0 | | 1 | 1 | | 2 | 2 | +----+------+ 3 rows in set (0.00 sec)
更新方式1:单线程,使用索引,多个commmit,更新1000条
# coding: UTF-8 __author__ = 'letian' import MySQLdb import time host='127.0.0.1' user='root' passwd='123456' db='test' port = 3306 start_time = time.time() conn = MySQLdb.connect(host, user, passwd, db, port) cur = conn.cursor() for x in xrange(1000): print x sql = "UPDATE number SET num=%s where id=%s" % (x+1, x) cur.execute(sql) conn.commit() cur.close() conn.close() end_time = time.time() print '用时:', end_time - start_time
输出:
... 997 998 999 用时:57.8318688869
1000条用时约60秒。
为什么慢?
先explain一下:
mysql> explain update number set num=3 where id=2 \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: number type: range possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 1 Extra: Using where 1 row in set (0.06 sec)
sql语句没什么问题。
再看一下磁盘IO:
letian $ iostat -d -k 1 100 Linux 3.13.0-37-generic (myhost) 06/22/2015 _x86_64_ (4 CPU) ............ Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 57.00 4.00 468.00 4 468 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 71.00 28.00 404.00 28 404 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 71.00 28.00 608.00 28 608 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 61.00 4.00 492.00 4 492 ............ Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 73.00 0.00 536.00 0 536 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 69.00 4.00 516.00 4 516 ............ Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 61.00 4.00 472.00 4 472 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 63.00 4.00 452.00 4 452 ............ Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 16.00 0.00 1804.00 0 1804 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 3.00 0.00 16.00 0 16 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 2.00 0.00 12.00 0 12
tps代表每秒IO传输的次数,与下面的
更新方式4对比一下,可以看出当前的更新方式IO较多。(为什么呢?移步Why does TRANSACTION / COMMIT improve performance so much with PHP/MySQL (InnoDB)?)
更新方式2,单线程,不使用索引,一个commit,更新100条
# coding: UTF-8 __author__ = 'letian' import MySQLdb import time host='127.0.0.1' user='root' passwd='123456' db='test' port = 3306 start_time = time.time() conn = MySQLdb.connect(host, user, passwd, db, port) cur = conn.cursor() for x in xrange(100): print x sql = "UPDATE number SET num=%s where num=%s" % (x*2, x) cur.execute(sql) conn.commit() cur.close() conn.close() end_time = time.time() print '用时:', end_time - start_time
输出:
... 97 98 99 用时: 89.695207119
为什么这么慢:
mysql> explain update number set num=4 where num=2 \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: number type: index possible_keys: NULL key: PRIMARY key_len: 4 ref: NULL rows: 979800 Extra: Using where 1 row in set (0.05 sec)
可以看到,每次update几乎要扫全表(rows: 979800)。
更新方式3:单线程,使用索引,一个commit,更新40×10000条
# coding: UTF-8 __author__ = 'letian' import MySQLdb import time host='127.0.0.1' user='root' passwd='123456' db='test' port = 3306 start_time = time.time() conn = MySQLdb.connect(host, user, passwd, db, port) cur = conn.cursor() for x in xrange(40*10000): # 0 到 999999 print x sql = "UPDATE number SET num=%s where id=%s" % (x+1, x) cur.execute(sql) conn.commit() end_time = time.time() print '用时:', end_time - start_time
输出:
... 399997 399998 399999 用时:92.9413559437
查看磁盘IO:
letian $ iostat -d -k 1 100 Linux 3.13.0-37-generic (myhost) 06/22/2015 _x86_64_ (4 CPU) Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 20.88 288.50 283.18 2458129 2412772 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 18.00 116.00 328.00 116 328 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 17.00 8.00 548.00 8 548 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 11.00 12.00 396.00 12 396 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 10.00 4.00 488.00 4 488 ............ Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 10.00 4.00 460.00 4 460 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 10.00 4.00 448.00 4 448 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 10.00 4.00 464.00 4 464 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 10.00 4.00 456.00 4 456 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 9.00 4.00 452.00 4 452 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 9.00 0.00 140.00 0 140 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 15.00 4.00 2448.00 4 2448 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 43.00 144.00 2544.00 144 2544 ............ Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 9.00 0.00 516.00 0 516 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 10.00 4.00 512.00 4 512 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 12.00 4.00 540.00 4 540 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 14.00 4.00 2288.00 4 2288 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 10.00 4.00 528.00 4 528 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 6.00 0.00 184.00 0 184 ............ Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 14.00 0.00 3868.00 0 3868 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 15.00 0.00 2604.00 0 2604 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 15.00 0.00 1516.00 0 1516 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 3.00 0.00 16.00 0 16 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 12.00 0.00 68.00 0 68 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 4.00 0.00 28.00 0 28 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.00 0.00 0.00 0 0
更新方式4:2个进程,一个进程里一个commit,共更新40*10000条
两个进程:# coding: UTF-8 __author__ = 'letian' import MySQLdb import time import multiprocessing host='127.0.0.1' user='root' passwd='123456' db='test' port = 3306 start_time = time.time() def update(int_iter): conn = MySQLdb.connect(host, user, passwd, db, port) cur = conn.cursor() for x in int_iter: print x sql = "UPDATE number SET num=%s where id=%s" % (x+5, x) cur.execute(sql) conn.commit() cur.close() conn.close() pool = multiprocessing.Pool(processes = 2) pool.apply_async(update, (xrange(20*10000), )) pool.apply_async(update, (xrange(20*10000, 40*10000), )) pool.close() pool.join() end_time = time.time() print '用时:', end_time - start_time
输出:
... 199998 199999 用时: 47.9561839104
看下数据:
mysql> select * from number where id=1; +----+------+ | id | num | +----+------+ | 1 | 6 | +----+------+ 1 row in set (0.00 sec) mysql> select * from number where id=399999; +--------+--------+ | id | num | +--------+--------+ | 399999 | 400004 | +--------+--------+ 1 row in set (0.00 sec)
更新方式5:4个进程,一个进程里一个commit,共更新40*10000条
# coding: UTF-8 __author__ = 'letian' import MySQLdb import time import multiprocessing host='127.0.0.1' user='root' passwd='123456' db='test' port = 3306 start_time = time.time() def update(int_iter): conn = MySQLdb.connect(host, user, passwd, db, port) cur = conn.cursor() for x in int_iter: print x sql = "UPDATE number SET num=%s where id=%s" % (x+10, x) cur.execute(sql) conn.commit() cur.close() conn.close() pool = multiprocessing.Pool(processes = 4) pool.apply_async(update, (xrange(10*10000), )) pool.apply_async(update, (xrange(10*10000, 20*10000), )) pool.apply_async(update, (xrange(20*10000, 30*10000), )) pool.apply_async(update, (xrange(30*10000, 40*10000), )) pool.close() pool.join() end_time = time.time() print '用时:', end_time - start_time
运行结果:
299996 299997 299998 299999 用时: 28.4070010185
查看更新效果:
mysql> select * from number where id=1; +----+------+ | id | num | +----+------+ | 1 | 11 | +----+------+ 1 row in set (0.00 sec) mysql> select * from number where id=399999; +--------+--------+ | id | num | +--------+--------+ | 399999 | 400009 | +--------+--------+ 1 row in set (0.00 sec)
更新方式6:一次更新多行,更新40*10000条
# coding: UTF-8 __author__ = 'letian' import MySQLdb import time host='127.0.0.1' user='root' passwd='123456' db='test' port = 3306 start_time = time.time() conn = MySQLdb.connect(host, user, passwd, db, port) cur = conn.cursor() sql = "UPDATE number SET num=id+100 where id<%s" % (40*10000, ) cur.execute(sql) conn.commit() cur.close() conn.close() end_time = time.time() print '用时:', end_time - start_time
输出:
用时: 3.27281808853
查看更新效果:
mysql> select * from number where id=1; +----+------+ | id | num | +----+------+ | 1 | 101 | +----+------+ 1 row in set (0.00 sec) mysql> select * from number where id=399999; +--------+--------+ | id | num | +--------+--------+ | 399999 | 400099 | +--------+--------+ 1 row in set (0.00 sec)
分析使用的sql语句:
mysql> explain UPDATE number SET num=id+100 where id<400000 \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: number type: range possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 489900 Extra: Using where 1 row in set (0.00 sec)
这种方式非常快,灵活性不够。
如果遇到这样一个场景:将id为0、1、2的num置为2,其他全部置为4,可以先全部置为4,然后在更新id为0、1、2的记录,这样速度会很快。
相关文章推荐
- 如何快速更新数据库中的百万条数据
- 如何快速更新数据库中的百万条数据
- 不影响数据库运行如何快速恢复刚删除的表及其数据(三)如何恢复刚更新的数百万条数据
- 如何实现快速地从超大数据库中提取数据并进行相关的操作
- 如何快速找回数据库表中误删除的数据
- 如何快速高效的更新memcached缓存数据
- 使用jdbc更新数据时间时如何屏蔽各种数据库的差异性
- 如何使用 t-sql 更新数据库中日期字段的部分数据(年?月?日?时?分?秒?)
- 不影响数据库运行如何快速恢复刚删除的表及其数据(二)如何恢复刚才删除的一条数据
- 如何用一个数据库的数据去更新另一个数据库的内容
- Android数据库更新并保留原来的数据如何实现
- 如何用PhpMyAdmin快速创建全新数据库并导入数据
- 如何快速建立数据字典,我如此设计小型数据库
- 如何用一个数据库的数据去更新另一个数据库的内容 (转)
- 如何用一个数据库的数据去更新另一个数据库的内容(续)
- 如何实现表单一次上传多表数据并更新到数据库
- 如何对不同数据库的表相互更新数据?
- DataTable 快速导入数据库——百万条数据只需几秒!!(最新,添加了委托)
- 不影响数据库运行如何快速恢复刚删除的表及其数据(四)如何恢复删除的表
- 不影响数据库运行如何快速恢复刚删除的表及其数据(一)flashback是什么