您的位置:首页 > 编程语言

Berkeley DB 源代码分析 (3) --- Btree的实现 (2)

2012-03-25 15:54 344 查看
__bam_ditem

In btree we store on-page duplicate key/data pairs this way:

1. we only put the key onto the page once, since it's duplicated, there is no meaning putting

identical keys multiple times. and we put each of the dup keys' data items

onto the page;

2. In the index array, there are multiple index element for this

dup key pointing at the same key's offset. and since the index array is sorted by

the keys the elements point at, index element to the same dup keys are continuous, like

indx[i], indx[i+1] and indx[i+2] point at the same key value on the page who

has 3 dup key/data pairs.

so when deleting the key indx[i+1], we don't remove the key from page since there are

still indx[i] and indx[i+2] pointing at the key. we simply move elements after

indx[i+1] one element forward, and then we will have indx[i] and indx[i+1]

pointing at that key, and thus we will have two dup key/data pairs. When

deleting a key/data pair of btree leaf page, we do it twice, first delete the

key then delete the data item -- the order can't be reversed.

Deleting key/data pairs

1. In DBC->del, we only mark the key/data pair deleted (B_DELETE), and mark

the cursor to be pointing to a deleted k/d pair(C_DELETED), but we don't

effectively remove the k/d from page, unless the cursor is closed and it's

pointing to a deleted k/d. In this special case we will remove the single k/d

pair it points to. After a data item is marked deleted, it can be internally

found/located by search functions, but never returned to user. The space it

takes can be overwritten, when inserting a k/d which should be located at

exactly the same page and location.

Thus, if we use DB->del to delete a k/d, it's immediately deleted from db; if

we use DBC->del to iterate the db and del each k/d, none except the last one

is removed from db. This can avoid frequent tree structure change

(split/rsplit), which are expensive operations, but also waste a lot of space

potentially.

I think we should add a DB_FORCE flag for DBC->close and when it's specified

we know no other cursor is pointing on the k/d, thus when our cursor is about

to move away from current page to another page, we delete all k/d pairs marked

B_DELETE. We don't remove on each DBC->del call because it would make the

cursor movement operations harder to implement.

__db_ret

Return a specified key/data pair, given the page pointer(which was locked and

fetched from mpool already), pgno and index.

__bam_getboth_finddatum

works for DB_GET_BOTH and DB_GET_BOTH_RANGE flags in DB/DBC->get.

If DB_DUP is set but DB_DUPSORT is not set, in which case dbp->dup_compare is

null, we do a linear search, and only look for exact match even RANGE is

specified, i.e. RANGE is identical to GET_BOTH if not DUPSORT, which is

undocumented.

Otherwise both DB_DUP and DB_DUPSORT are specified, and we do a binary search

on the leaf page. __bamc_search does the btree search in the opd, not this

one.

__bamc_put

In btree we can't specify where to store a k/d because its stored according to

k's value and d's value. The only exception to this rule is when the btree

allows dup(DB_DUP set) but doesn't allow sorted dup(!DB_DUPSORT), and in this

case we can specify to insert a data item before or after(DB_BEFORE/DB_AFTER)

the cursor's current pointed key/data pair as a dup data item for the same

key.

Other flags like DB_OVERWRITE_DUP, DB_NODUPDATA and DB_NOOVERWRITE all

controls how to deal with dup data items rather than control movement or pos

of the cursor. The first two is effective when the

db is a btree with both DB_DUP and DB_DUPSORT set --- DB_OVERWRITE_DUP asks DB/DBC->put to

overwrite the identical key/data pair in such a db, the DB_NODUPDATA asks DB/DBC->put to

return DB_KEYEXIST error value. DB_NOOVERWRITE asks DB/DBC->put to to always

return DB_KEYEXIST if the key to put already exists in db, no matter DB_DUP is

set or not.

In this function we need to deal with four types of btree db:

a. no-dup db; b. dup but unsorted db; c.

dup sorted on-page key/data pairs; d. dup sorted opd tree.

we try to decide the flags parameter to __bam_iitem, according to the type of

db and the flag parameter for __bamc_put. and we may need to split if the page is full, and

if we do so, we will retry after split. we will also move the cursor at the

correct position for the put operation, for __bam_iitem to operate.

If our cursor is currently in the main btree and we need to go into an opd

tree to search/put, we have to return to the caller for it to create an opd

cursor for this key item, whose opd root pgno we already found out, and retry

the put op.

* __bam_iitem (dbc, key, data, op, flags)

Insert a key/data pair into specified position. flags parameter is not used.

When this function is called, we have positioned the cursor at the position

where we want to insert/update the key/data pair. If the op is DB_CURRENT,

we update the current k/d pair; if op is DB_BEFORE/DB_AFTER we insert a new

data item as the dup data item of current k/d pair's key, and we insert it

immediately before or after the current position; if op is DB_KEYFIRST, we

insert key/data pair at the sorted pos of the btree.

The first part handles partial put and streaming put. If it's streaming put --

identifiable according to the dbt's doff/dlen and existing data item length --

we need to read the entire existing data item into "tdbt" var by calling

__db_goff, and set DB_DBT_STREAMING.

If it's a normal partial put, we also need to read it into "tdbt" var by calling __bam_build.

The 2nd part computes the bytes needed to store this data item, based on the

type of operation and the computed byte-len at the beginning of this function.

If the free space on the page is not enough for this item, we will split. We

don't try to utilize items marked deleted. I think we should do so, by making

use of the space taken by those deleted items which are not referenced by any cursor. (TODO)

QUESTION:

consider this situation: we insert a thousand k/d pairs with keys always being

"1", data being arbitary byte string. how the split will handle it? dozens of

pages are needed, but only one key available. I think the only solution is to

store all dups to an opd, but don't form a opd tree, just put these data items

into the chain of opd pages.

If allowing DUP and less then half free space after this k/d put, we call

__bam_dup_check to check if we need to put all dup data items for the

key into an opd btree/recno-tree. If the dup data items consum over a quater

of the page space, call __bam_dup_convert to do so. Thus we only need one such

page to store all on-page dups. If DB_DUPSORT is set to the db handle, we use

btree opd to store the sorted dup data items, otherwise we use recno opd tree

to store unsorted dup data items.

Then, we put the key and then the data item into the page.

DETAILS:

Putting a key:

For DB_AFTER and DB_BEFORE, we only store another element in the index array,

value is the key offset. Then we ajust cursor indexes.

For DB_CURRENT, we need to modify only the current data item, key is not

modified, so we will talk about it below.

For DB_KEYFIRST, this

is the only case we need to store the key into the page. We will call

__bam_ovput if the key should be put into an overflow page, or __db_pitem

otherwise.

Putting a data item:

Calls __bam_ovput, __db_pitem and __bam_ritem to put the data item onto the

page or overflow pages. finally call __bam_dup_convert to move too large dup

items into opd tree.

* Partial put:

__db_buildpartial and __db_partsize and __bam_partsize

For a DBT dbt passed to DB/DBC->put to be partial put, if dbt.doff > existing

data item's total length L0(L0 can be 0 if adding a new data item for an

existing key using DB_AFTER or DB_BEFORE), after this partial put, there will be a hole of

zeros between L0 and dbt.doff bytes in the new data item's byte array.

__db_partsize computes how many bytes the new data item will need, based on

existing data item total length, and dbt's doff, dlen, size members. the new

data item may shink or grow.

__db_buildpartial builds the new data item given the new total size of this

item and the existing bytes and dbt.data's new bytes.

__bam_partsize get the btree specific new data item's total length. To do so,

it first get this data item's existing length, then calls

__db_partsize to get total size in bytes of the new data item we will insert.

providing to it the old data item length and the new data item represented by a dbt.

* __db_goff (dbt, bpp, bpsz, pgno, tlen)

get an overflow data item whose first page number is pgno and total length is

tlen, partially or entirely. The partial get settings is in dbt. It starts

from the first page, fetch each page on the chain from mpool, until arrive at

the page where we want to start the partial get (dbt.doff), then get specified

length. It has optimize for streaming get -- if doing streaming get, i.e. get

a huge chunk of data continuously portion by portion with no overlap or holes,

the next-pageno and stream-offset is stored in the cursor, so that we directly

start from that pgno's stream-offset, and get as much as we want, so that we

don't have to go through the page chain from beginning.

* __bam_build

Build a dbt obj representing the new data item we want to insert. The data

item may be an overflow item, or not. And partial put is allowed. streaming

put don't need this function.

* __bam_ovput(dbc, type, pgno, h, indx, item)

Store a overflow or dup data item into the overflow pages, or the opd trees,

and put the B_OVERFLOW/B_DUPLICATE on-page item onto the leaf page.

type: overflow or dups

pgno: opd root page;

h: page pointer of the leaf page

indx: the index of the leaf page's index-array where we will put the offset of

the leaf-page item. this item can be an overflow item(BOVERFLOW) or

B_DUPLICATE.

item: data item to store.

__db_pitem(pagep, indx, nbytes, hdr, data)

Put a data item (all 3 types, BKEYDATA, BOVERFLOW and BDUPLICATE supported ) onto the

page.

pagep: the page where we will store the data item;

indx: the index of the element to store the offset of this data item;

nbytes: the total bytes of the data item, including the hdr (BKEYDATA and the

other 2 types of headers), whether or not hdr is NULL.

hdr: the BKEYDATA and the other 2 hdrs, if NULL, construct internally.

data: the data item to store.

The remaining work here is to shuffle the index array to store the offset of

the item, and append the item into the page at the tail of the items in the

page(items grow from end to beginning)

__db_ditem (pagep, indx, nbytes)

Remove the data item refered by 'index' element of the index array of page

'pagep' from pagep, by shuffle the item array tail part towards the end of the

page(i.e. the beginning of the item array) and shuffle the index array's elements

after 'index' one element forward. And consequently need to alter index elements

which point to elements with less offset than original inp[index].

We do not trust page contents because this function can be used by recovery

code too, when the page may be inconsistent. Thus we need the "nbytes" to

indicate the NO. of bytes of this data item, which can be computed during

normal operation and read from log during recovery.

For each page altering function, i.e. __db_ditem, there is a ***_nolog like

__db_ditem_nolog which is called by __db_ditem and then __db_ditem logs the

op. so that **nolog can be called during recovery.

__bam_dup_convert (h, indx, cnt)

Move dup data items from leaf page onto a opd tree leaf page. It won't be able

to set up a tree of >1 pages because when it's called the dup items only takes

up a quarter of the leaf page.

h: the leaf page containing dup data items;

cnt: the NO. of key/data pairs, e.g. (1, 2) and (1, 3) gives cnt 2.

indx: the 1st key item's index of the dup key/data pairs.

First we move the dbc cursor to the 1st key/data pair of the dup k/d pairs.

Then we allocate a new page via __db_new, whose type is either LDUP(btree opd

for sorted dup items) or LRECNO(recno opd for unsorted dup items), and init

the page.

Then we iterate all k/d pairs, for each data item we move it onto the opd leaf page.

The data items will be used as keys in the opd trees, but no data items will

be available for those leaf pages, which is why we need another type(LDUP).

then delete the key index from h's index array by calling __bam_adjindx, then

call __db_ditem to remove the data item(it's already on the opd leaf page)

from h. Any of these items can be BOVERFLOW data items, we will delete the

page chain via __db_doff before calling __db_ditem.

Finally we call __bam_ovput to store the dup leaf page into db and it's pgno

onto the data item for the key as a B_DUPLICATE data item. This DUP data

item's index in h's index array is 1 greater than the 1st key's index, of

course.

__bam_get_root (dbc, root_pgno, slevel, flags, stack)

Fetch the root of a tree and see if we want to keep it in the stack.

Determine the lock mode and buffer get mode:

By default read lock it. but in some conditions we want to write lock it, which means we will dirty the page, so use DB_MPOOL_DIRTY in memp_fget.

and sometimes we want to try latch the buffer.

Get the root page using the lock and get modes. we need to consider sub db's root pages which may change position.

If we fail to try latch the buffer, lock the page and retry the get-root-page action.

TO BE CONTINUED....

__bam_split

context knowleage:

Btree leaf pages have level 1, and leaf pages' parent pages have level 2, etc.

Search path : when searching for a key in a tree we have a stack of pages, we also call it the search path for the key.

split a root page or other pages of a btree. we are given a key which is the one faild an insert because the leaf page is full. when we split a page we need to

promote the mid value to its parent page, thus over time the parent page can be full too, thus a leaf page split may result in spliting the entire

search path from root to the leaf, and we would need to lock all pages in the path to do such a split.

However in order to enhance concurrency, we don't want to lock the entire path all at once, rather,

we try to lock as few pages and as lower level pages as possible, so we search down the tree for key K to a certain level L (L starts at leaf level), locking the

page L and its parent page PL, and see (by calling the split functions __bam_root and __bam_page which will return DB_NEEDSPLIT if we need to split

its parent, and will succeed if the parent page does not need split.) when we split L, do we need to split PL. If we do, we go UP one level and repeat

the search and judge, until we meet L0

which is the page in our search path that does not require its parent to split, and L0 will be our start point for the entire split process. so we start at L0

and go downwards on the search path, locking a page and its parent, then split the page. Such split can succeed because the parent already have enough space because

it's just been split, or had enough space (for L0's parent). But such split can fail too because between our two splits for internal pages,

the parent of the page we are splitting (PS) may have been filled and we can't promote the mid value into it, so we have to go UP again starting from PS,

to find the new L0, and split pages then go down just like said above, until we reach the leaf page by going down to leaf level.

Whenever we go UP or DOWN, we MUST start a new search from the root page, we should always go from the root to lock pages downwards, never upwards. if our insert

operation fails because the leaf page needs a split, we must unlock all pages then do the split.

__bam_root -- split a btree root page.

The special thing about root page is that we don't have to worry that its parent page may need to split too since the root

page does not have a parent.

Mark the root page dirty first in case the page we have is for read only. then allocate new pages via __db_new for the left and right page and init them,

then call __bam_psplit to do the page split, then log the split action. since we are splitting the root there is no risk that we need to do further upper

level split.

In this function we don't log the snapshot of the root page before the split, we only log the new root page after the split, as well as the

new left and right leaf page, and the two pointers to them in the new root page.

in this function there is a wrong order to lock and mempool ops: we unlocked the pages before mempool puts, we should have done in reverse order.

__bam_page(pg2split)

split a btree page other than the root page.

Call __bam_psplit to move pg2split's left page data and right page data to 2 pages alloced from malloc'ed memory rather than cache and init them.

we must first test whether pg2split's parent page needs a split. if so __bam_pinsert returns

DB_NEEDSPLIT but don't actually modifies the parent page.

We malloc space for both the left and right pages, so we don't get

a new page from the underlying buffer pool until we know the split

is going to succeed. The reason is that we can't release locks

acquired during the get-a-new-page process because metadata page

locks can't be discarded on failure since we may have modified the

free list.

If __bam_pinsert test succeeds we will call __db_new to allocate pages and call __bam_pinsert to actually modify teh parent page, and log the split.

__bam_pinsert

promote the mid value to the parent page. may return DB_NEEDSPLIT if the parent page is full.

__bam_psplit: move half of items from one page to another. both __bam_root and __bam_page call it to do the split work.

it does not do any logging.

when we want to read/write a page, we should always first lock the page in the correct mode, then call mempool methods to get the page, and modify it. after the

modification we should first put it back to mempool, then unlock the page if needed (txnal locks are not released here). this way there will always be only one

AM executing in the mpool, although there may be backend threads doing trickling, chkpnting, dirty-page-syncing, etc.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: