您的位置:首页 > 编程语言

Berkeley DB 源代码分析 (5) --- 事务锁模块

2012-03-25 15:55 267 查看
Locking Subsystem Learning Notes

0. locking API

__db_lput/__db_lget are txnal lock put/get, often __TLPUT is called instead,

and __TLPUT calls __db_lput internally. __db_lput will downgrade the lock

rathter than simply releasing it if the db supports dirty reads and the lock

is a write lock; in other circumstances it will release the lock.

__LPUT and __ENV_LPUT simply calls __lock_put, thus the lock is simply

released.

1. So far we only use intentional locks in cds mode. This is because we are not using row level locking in tds mode, we only lock pages in tds mode. (what about queue am?)

2. Need to provide row level locking API, like the existing __db_lget/__db_lput for page level locking. Because qam is using record locking now, it uses __db_lget with a DB_LOCK_RECORD flag to lock a record. However __db_lget can't be directly used because
it's locking a db-wide recno, so (dbfileuid, recno) is sufficient to be the lock obj, but in the new am we don't have global recno, we will lock (dbfileuid, pgno, in-page-indx).

qam does not lock pages, it always lock (dbfileuid, recno), so it does not

need intentional locking either.

1. So far we only use intentional locks in cds mode. This is because we are not using row level locking in tds mode, we only lock pages in tds mode. (what about queue am?)

2. Need to provide row level locking API, like the existing __db_lget/__db_lput for page level locking. Because qam is using record locking now, it uses __db_lget with a DB_LOCK_RECORD flag to lock a record. However __db_lget can't be directly used because
it's locking a db-wide recno, so (dbfileuid, recno) is sufficient to be the lock obj, but in the new am we don't have global recno, we will lock (dbfileuid, pgno, in-page-indx).

qam does not lock pages, it always lock (dbfileuid, recno), so it does not

need intentional locking either.

3. __db_ilock(DB_LOCK_ILOCK) is the lock object structure used inside DB to represent lock

info, all locks used by db internally are managed using this type e.g. the lock list.

We need to add a in-page-index to lock a row.

__db_lock is the struct in the lock region, used to store each lock's info

inside lock subsystem, to manage database locks;

DB_LOCK is used in a lock request used by outside caller to get/put a lock. it

contains offset pointer for us to find the corresponding internal __db_lock.

4.

the $db/lock/Design is obsolete.

Firstly, there are multiple lock table partitions, each of them has __db_lockpart structure,

containing partition specific info, and containing some of

the free lock objects and free locks. So getting a free lock object/lock can

be parallel in multiple partitions.

The main lock region has __db_lockregion structure, containing global info of the lock region; And it

contains the locker free list, lockers are not put into multiple partitions. It also contains the

locker/lock object hash tables, as expected. also it contains the lock table

partition array. Note that all the partitions live in the same region file as

the main lock region.

There are no direct bucket mutex for each bucket in lockobj hash table,

rather, we use lock partition's mutex to protect a specific bucket of

lockobjs. So if there are same number of partitions and lockobj buckets,

the concurrency of lockobj mgmt is not affected; However partition can be far

less than lockobjs and locks, it's by default 1 on single cpu systems and

10*cpu-number on multicpu systems. I think manually setting a big lock

partition number can promote concurrency, though the doc says otherwise.

And we use only one mutex to protect the entire lockers, so concurrency is

affected. Especially when there are a lot of short but frequent txns, e.g.

when mvcc is turned on; However this must be something we have to do to avoid other issues.

The stat info for lock region are divided into 3 parts, first the DB_LOCK_STAT

contains general stat info and lives in main lock region (__db_lockregion);

And we have an array of stats for lock object hash buckets(DB_LOCK_HSTAT) and

lives in main lock region too, there are same number of lockobj hash buckets

and lockobj stat structures;

lastly we have a DB_LOCK_PSTAT for each lock table partition (in

__db_lockpart).

The reason to divide stat into multiple parts is to bring up concurrency, when

reporting stat into, we integrate all the pieces into a DB_LOCK_STAT in

__lock_stat function.

Both locks and lockobjs have generation number, in order to allow reusing them

and at the same time avoid the issue that a previous use happen at the same

time with a reuse. Locker structs can be reused too but they don't need

generation numbers because each locker has a unique locker id.

5. __lock_region_mutex_count

total mutex needed in lock subsystem: max-number-of-locks +

number-of-partitions + 3. Each lock has a mutex for the thread of control to

block on; each parition has its mutex; lockregion, lockers mgmt and dead lock

detector needs 1 respectively.

6. __lock_region_size

So far we allocate more hash buckets than the things to be stored in the hash

table. this is true for locker/lockobj hash tables.

If the hash is good enough (simply modular operator % will be good enough), we

can put each locker and lockobj in a hash bucket, i.e. we never need to

traverse the bucket list for a locker/lockobj. Consequently there can be potentially a huge

number of hash bucket, and I think this is one of the reasons we don't use one

mutex for each hash bucket.

7. __lock_env_refresh

It frees all allocated memory space if using private env. From here we can see what

structures we allocated. The only issue I find is: we are only freeing free

lockers, in-use lockers are not freed. The in-use lockers normally are freed

when the locker is closed(txn end; db/dbc handle close), but may not be freed

if above close op does not execute.

8. __lock_region_init

Init shared lock region structure (__db_lockregion), allocate mutexes,

and init all members.

Allocate memory for locker/lockobj hash tables, lock partition array and lockobj hash bucket stat array, all lockers, locks and lockobjs, and link locks and lockobjs into free list of each partition evenly, linking lockers into a free list.

This function is only called when we are creating rather than joining a lock

region.

9. __lock_open

Open and init __db_locktab structure, which is the per-process struct for lock region. call __lock_region_init to create the lock region if needed.

10. __lock_fix_list -- sort lock req 's objs

this is the only function in use in file locklist.c. It arrange lockobjs in

the lockreq parameter of lock_vec so that they are grouped/sorted by file.

11. __lock_promote

Looks in a lockobj's waiter list WL for any locks L that doesn't conflict with any

locks in the same lockobj's holder list HL, and move such an lock L from WL to

HL, and mark L as DB_LSTAT_PENDING, and unlock the L->mtx_lock mutex so that the corresponding thread of control can continue executing in OS. And if the lockobj 's WL is empty and it is in a DD's dd_objs

list, remove it from the dd_objs since it won't block anyone.

It should be called whenever a lock becomes available so that some waiters may hold the lock and continue, (i.e. it is used as a scheduler), e.g. a lock in the lockobj's holder list is released, downgraded, or given back to the parent txn.

12. __lock_inherit_locks

Return a child txn's locks back to its parent txn. For each lock L the child

txn CT has, it L is also held by the parent txn T, then the L->refcount++;

else L is put into T's locker's lock list. Do a scheduling by calling

__lock_promote at the end so that any sibling txns waiting for any returned

locks can go on.

13. __lock_remove_waiter

Remove a lock from a lockobj's waiter list. if the lockobj has no locks in its

waiter list, remove it from dd's dd_objs list because it won't block any one.

At last, unlock the lock's mutex, so that any thread of control waiting for

the mutex can continue. A lock's mutex is used to suspend the thread of

control T for it to wait for the lock. if we want T to suspend, the code to

acquire the lock->mtx_lock is executed in T and thus T suspends, since the

lock->mtx_lock was created and mutex-locked. and whenever we want any thread

of control to go since they got this lock, we mutex-unlock the lock->mtx_lock,

and any such T does not mutex-unlock this mutex, so next time we can still

suspend such a T by having it acquire the lock->mtx_lock.

14. __lock_trade

transfer a lock from a locker A to another locker B, so that the lockobj A

locks now is locked by B.

15. __lock_allocobj and __lock_alloclock

__lock_allocobj Allocate lockobj from a partition other than the specified one P. we do so

because P does not have free lockobj space, and we want to find such a free

lockobj space from other partitions and move it to P's free list. Such an

action is called a lockobj 'steal' seen from the stat results.

Similarly we have __lock_alloclock. Free locks are also in partitions, so we

iterate each partition to find one free space for a lock, and steal it into

the current partition.

16. __lock_getobj

Find a specified lock obj from lockobj hash table. If the lockobj does not

exist yet and we will create one, first allocate a lockobj space from a partition P.

If P does not have free lockobj space, steal from other partitions; then init

the lockobj fields.

Note that lock subsystem allows external use, so the size

of a lockobj may not be standard (sizeof (__db_ilock)). so if the external

lock obj is larger than sizeof(db_ilock), we allocate memory from lock region and store the (off, size) into lockobj->lockobj. then we always copy bytes into either lockobj->lockobj.off or lockobj->objdata.

And we will need to free the region space when releasing a lockobj, if we

found lockobj->lockobj.size > sizeof(lockobj->objdata).

17. __lock_freelock

Optionally free a lock L from its locker's lock list, and optionally put it

into its partition's free lock list. The lock's mutex is destroyed.

Question: what does lock promotion mean?

A locker's info is stored in shared mem, and txns/db/dbc handles only store a

locker ID, when the locker is used inside locking subsystem, we find the

locker via __lock_getlocker and use it only inside locking subsystem.

18. __lock_put_pp, __lock_put, __lock_put_nolock, __lock_put_internal

__lock_put_internal is the ultimate funtion called. all the rest are wrappers.

__lock_put_internal releases a lock regarding/disregarding its reference count. Note that any lock/lockobj release will increment its generation to reuse it and prevent issues from the reuse.

It first unlink the lock from its lockobj's wait/holder list. If it was in holder list,

we call __lock_promote to schedule so that waiters can proceed. then if the

lockobj has no locks in both its watier and holder list, free the lockobj.

Finally call __lock_freelock to free the lock, it will be removed from its

locker's locklist.

19. __lock_get_pp, __lock_get, __lock_get_api, __lock_get_internal

__lock_get_internal is the ultimate function that does the tasks. It decides

whether a lock should be granted to the locker. If not, the thread of control

executing this function blocks on the requested lock's mutex. The locker and

lockobj are supposed to be both already created in the lock region.

Figure out if we can grant this lock or if it should wait.

By default, we can grant the new lock if it does not conflict

with anyone on the holders list OR anyone on the waiters list.

The reason that we don't grant if there's a conflict is that

this can lead to starvation (a writer waiting on a popularly

read item will never be granted). The downside of this is

that a waiting reader can prevent an upgrade from reader to writer,

which is not uncommon.

There are two exceptions to the no-conflict rule. First, if

a lock is held by the requesting locker AND the new lock does

not conflict with any other holders, then we grant the lock.

The most common place this happens is when the holder has a

WRITE lock and a READ lock request comes in for the same

locker.

If we do not grant the read lock, then we guarantee deadlock.

Second, dirty readers are granted if at all possible while

avoiding starvation, see below.

In case of conflict, we put the new lock on the end of the

waiters list, unless we are upgrading or this is a dirty reader in

which case the locker goes at or near the front of the list.

lock holders on a lockobj have compatible locks, so, if the locks are

DB_LOCK_READ, other locks are read locks too. that's why the grant_dirty

variable assignment only tests one lock.

locks can be downgraded from DB_LOCK_WRITE to DB_LOCK_WWRITE. a page can't be

read by a dirty reader when it's DB_LOCK_WRITE. when we are done writing the page

but txn not ended yet, and in dirty read txns, such action means the lock is

downgraded to DB_LOCK_WWRITE, and thus it can be read by dirty readers. such

behavior prevents reading an invalid page --- during the memory manipulations

to the page it is completely broken/invalid to other txns. For the same

reason, when a page is being dirty-read, it should not be write otherwise

there maybe memory errors, thus, dirty read lock conflicts with WRITE locks;

and it should not be normally-read because the page is dirty, but the conflict

matrix does not prevent so, just no locker does this.

when a locker get its requested lock, it's status is set to DB_LSTAT_PENDING.

and when the thread of control actually runs, it's stat will be set to

DB_LSTAT_HELD. when it's waiting for a lock it's stat is DB_LSTAT_WAITING.

DB_LOCK_UPGRADE is a flag modifying DB_LOCK_WRITE, meaning the locker wants to

upgrade from READ to WRITE lock; it's not a type of lock.

op flow:

1. decide to grant/wait a lock, and where to wait in the waiter list of the

lockobj;

2. create lock struct in lock region and wire up with locker/lockobj.

3. if granted the lock, continue, return; otherwise, do deadlock detect before

waiting for the lock(acquiring the lock's mutex, which will suspend the thread

of control since that mutex was acquired when the lock was created); when got

the lock and continue, handle time expires and other details.

Guidelines:

1. Reduce deadlocks

If a locker already locks a lockobj and it wants more locks on it, it should

get the locks ASAP, otherwise cyclic wait has a better chance to form.

so if the new lockreq doesn't conflict with any other holders on the lockobj,

it gets the lock, otherwise it wait as the first one and will get the lock

earliest.

also if the lockreq wants to upgrade to write lock, it should

have higher priority to proceed --- if it conflict with a holder, it's the 1st

in the waiter list; otherwise it get the lock, no need to be compatible with

all waiters. By doing so, lock upgrade can complete asap, otherwise it's more

likely to starve than read locks, read locks can starve lockers who want

to upgrade locks.

2. Avoid starvation

A lockreq should not conflict with any holders. and it shouldn't conflict with

any waiters (except the above 2 situations) so that waiters don't starve.

3. Promote throughput

For dirty readers(lockers that has DB_READ_UNCOMMITTED), they should get locks

asap, which is why they want to read dirty data. thus they will be put at the

1st or 2nd in the waiter list, after the write lock.

20. __lock_vec_pp, __lock_vec_pp, __lock_vec_api, __lock_vec

__lock_vec is the ultimate function that does the tasks, and it calls

__lock_put_internal and __lock_get_internal to do the work.

DB_LOCK_WAIT: an internal type of lock used to lock nothing, but wait for an

event to happen. such a lock is used as a condition var.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: