您的位置:首页 > 编程语言

Berkeley DB 源代码分析 (2) --- Btree的实现 (1)

2012-03-25 15:52 316 查看
II. Type Dictionary

1. BTREE

The DB handle's DB->bt_internal structure, stores per-process and per-dbhandle

btree info and function pointers.

2. BTMETA

The btree meta page structure shared by all processes. It stores what's in the

btree's meta page, including all btree specific global(btree-db-wide) info and

common AM db-wide global info.

3. BTREE_CURSOR

It contains the common cursor fields defined in macro __DBC_INTERNAL which are shared

by all types of cursors, and mainly a page stack for btree searching.

In __dbc type, we use a pointer to the __dbc_internal type which is defined as the "base class" for all cursor types, but

actually we allocate memory for BTREE_CURSOR or other types, and cast to

specific cursor types before actually using them. We never directly use

__dbc_internal type.

III. Macro Dictionary

P_INIT: init a non-meta page.

DBC_LOGGING: Find via a cursor if using logging.

LCK_COUPLE: a parameter to __db_lget. If specified, in __db_lget we will first

release the lock then aquire the lock to the same lockobj with specified lock

mode.

DBC_DOWNREV: In replication it is allowd that the master have lower version DB lib than replicas.

So if the master uses DB versions older than the version which first

have latching support, replicas will notice this and set this flag to all its cursors, and replicas

will use traditional mutex locking rather than shared latches.

STD_LOCKING: Whether to use std locking, that is, the locking subsystem is started, and we are not using CDS, and the cursor is not sitting on an

off-page-duplicate apge/tree.

DB_MPOOL_TRY: a flag for __memp_fget, telling __memp_fget to try get a latch on the target page's buffer.

If the latch is not granted, return DB_LOCK_NOTGRANTED immediately without waiting.

IV. Function Dictionary

0. __bam_open

Open a btree database. It simply calls __bam_read_root after some config

checks.

1. __bam_read_root

get the btree db's metadata page and use info in it to init the BTREE structure of the DB->bt_internal. The meta data's info was filled in general

DB->open call before calling __bam_open.

2. __bam_metachk

Checks a btree meta data page validity.

3. __bam_init_meta

Init a meta data page's fields, i.e. the BTMETA structure's fields. Called

whenever a metadata page is created during btree db open procedures. For other

pages than meta pages, we use P_INIT to init them.

4. __bam_new_file

Routed from __db_new_file.

Create a btree db file by initializing its meta page and root page. Called

during db open process and routed from __db_new_file when db is a btree db.

The db may be in memory or not. For inmem db, we create the page from cache

and mark it dirty (mark this in __memp_fget rather than after actually writing

to it otherwise the page may get evicted before we had a chance to mark it.);

For on-disk db files, we don't use cache for now, rather, we put the page in

private memory to init, and directly write the pages into the db file using __fop_write.

when writing pages directly via __fop_writ/__fop_read, we should call the

internal common page in/out functions after got the page via __fop_read and

before writing the page via __fop_write. The __memp_fget/__memp_fput functions

call them too, as registered callbacks via __memp_pg. We have internal page in/out

callbacks for the 3 types of databases(btree, hash, queue), the internal page in/out functions mainly do

check summing and page header byte swap, so that database files created in

big-endian machines can be opened on little-endian machines, though the user

data are never swapped, so users need to make sure the bytes they get are correct.

There are AM specific work to do in internal page in/out functions, so we have

a __db_pgin/__db_pgout pair(placed in db/db_conv.c), in which they call AM specific pgin/out functions

like __bam_pgin/__bam_pgout (placed in btree/btree_conv.c, note the file name

convention).

The reason we use __fop_write here, is that at this point, the db is not fully

opened, it's not registered in the mpool region yet.

__memp_fget/put functions do not do logging, so before putting a dirty page

back to the cache, we should log changes; __fop_write logs the action, so no

need to do it in __bam_new_file.

QUESTION:

1. In this function we didn't lock the meta/root pages but use latches, why? don't we want txnal semantics?

2. Generally how do we guarantee txnal sementics when we release metapage

non-txnal locks immediately after use? (this is good for performance, but how to enture

consistence? ) examples are __bam_read_root, __bam_new_subdb, and __db_new.

These locks are not txnal, why can't they be replaced by latches?

In the DBMETA general meta info, only the "last_pgno", and "free", and

"key_count" and "record_count" can be updated, others are static fields. AM

specific parts have several more, for btree they are "root", "iv" and "chksum". So if

these fields don't require txnal locks, it's OK to release locks before txn

ends.

5. __bam_new_subdb

Routed from __db_init_subdb. It init the subdb's meta and root pages. It

locks the subdb's meta page during the entire function.

When this function is called, the db file is registered into the mpool so we

always use __memp_fget/put to read/write the page.

It calls __db_new to get a page.

Other than above, it's quite like __bam_new_file.

6. __db_new

__db_new prefer free pages in db file, and

falls back to allocating a new page by extending the db file. __db_new is

seldom called because it writes the db file's metadata page, which becomes a

bottle neck and is expensive, thus there can be many free pages but we are

extending the db file.

7. __db_free

Free a page and put it into the free list.

8. __bamc_close

See I.3.

9. __memp_dirty

Mark a page dirty.

10. __bamc_del

Mark the key/data pair with B_DELETE on the page containing it, and then

mark all cursors sitting on the key/data pair with C_DELETED via __bam_ca_delete.

But do not delete it yet or decrement the number of entries in the page, the k/d will

be deleted by the last cursor sitting on it if it is closed at this position. I think we should delete it if we find via __bam_ca_delete that

no other cursors are sitting on it.

Whenever we modify a page, we first lock the page via __db_lget, then get the page from cache via __memp_fget, then optionally mark the

page dirty via __memp_dirty, then log the action using various logging functions, followed by actually/effectively modifying the page.

Then we call __memp_fput to return the page back to the mpool, finally we release lock on the page.

QUESTION: how are key/data pairs deleted? this function only marks k/d

"deleted", but don't delete them from db pages. __bamc_close only deletes the

k/d it sits on when closed, but other k/d marked deleted by the cursur are not

even deleted when the cursor is closed. so when are all of them deleted from

page?

11. __bamc_count

When counting, consider B_DELETE items, don't count them.

12. __bamc_physdel & __bam_ditem

Physically delete a key/data pair, called when the last cursor sitting on the

deleted key/data pair is closed.

We call __bam_ditem twice to delete a key/data pair, and we log the op in

__bam_ditem. Following each __bam_ditem call, we call __bam_ca_di to adjust

other cursors of this database. We don't have a function to delete a k/d pair

from a btree leaf page at once, I think we should have such a function.

Internal btree pages only has a single structure to store the key and pageno,

they don't exist in pairs. actually except for btree leaf pages(P_LBTREE), all

other data items exist in single.

__bam_ditem alters the btree page's index array according to the type of btree

pages, and decrement the number of entries in the page, then calls __db_ditem to remove

the item from the page and log the action. or calls __db_doff to delete a opd overflow item.

from the overflow page and free the overflow page.

When deleting the last key/data pair from a btree leaf page, the page itself,

and potentially the stack of pages leading from root node to this leaf node

need to be deleted. So we note down the last key K by calling __db_ret to get

the k/d pair, and then delete this last k/d

pair by calling __bam_ditem twice, each followed by __bam_ca_di to adjust

cursors. Then, we search that last key K from root, when we complete the search,

we have in dbc->dbc_internal the stack/path of nodes leading to

this node, and we should delete several nodes in the stack --- imagine the leaf page P2's parent page P1

also has only one item, when we delete P2, we also delete P1's last item, thus

delete P1, and so on.

13. __bam_stkrel

Release pages in the search stack of the cursor, put each page back to mpool

and optionally unlock each page.

14. __bamc_get

the effective part of DBC->get.

According to the flags, dispatch calls to __bamc_prev, __bamc_next,

__bamc_search, or simply get page. The impl _DUP to is quite straigtforward,

by simply comparing adjacent keys; similarly for NO_DUP flags, it simply

iterate the k/d pairs with identical keys util got a different key.

15. __bamc_prev, __bamc_next

get from next/prev page, or from current page. alter DBC->dbc_internal's pgno

and indx. Note that in the 2 functions we may be on an opd page or a btree

ordinary leaf page.

The 2 functions plus __bamc_search only read data, they don't effectively

modify the page, so by default if we need to get another page, we read-lock

it, unless DB_RMW is set, and we would write-lock it.

the 2 funcs can skip empty pages, and deleted k/d pairs or key items in

btree internal pages.

QUESTION: Strangely enough, a k/d marked deleted is not physically deleted even when the

cursor moves away from it. so when is it deleted?
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: