Cache Buffers Chains and Latch Spelunking
2009-10-09 22:59
441 查看
from:http://www.ardentperf.com/2007/09/13/cache-buffers-chains-and-latch-spelunking/
Last night I posted a case study where I used the AWR (a blessed new feature) to investigate “gc buffer busy” wait events in a RAC environment. I concluded the write-up by theorizing that the single freelist was pointing all nodes of the cluster to the same small group of blocks for inserts and thereby causing the blocks on the freelist to always be subject to unearthly contention across the cluster.
One common piece of advice for gc buffer busy waits is to treat them like regular buffer busy waits. Because essentially that’s what they are – a buffer busy wait on a remote instance. So another avenue of investigation is to look at what might be causing buffer busy waits across the cluster.
Some people may remember that back in the days before YAPP and the wait interface, latches were usually where the purported “experts” looked when you had performance problems. Particularly those two infamous latches cache buffers chains and library cache. And of course today these are still an important part of any in-depth investigation and V$LATCH even includes wait time so you can take a time-based approach to analysis. I spent some time yesterday having a look at the latching in this RAC system and it yielded some results that I thought might be interesting to post. So here goes…
![](http://www.ardentperf.com/wp-content/uploads/2007/09/latches.GIF)
Now it was apparent to me pretty quickly that the cache buffers chains latch was the busiest on the system. And of course it’s normal for there to be some contention on this latch. You’ll notice from the screenshot the enormous difference in gets, spin_gets and sleeps. So that gave me a bit of a head start – but you could certainly get the same information from the AWR (that blessed new feature) as well. In fact you can do all of this from the AWR – although look out since latch information is not gathered by default!
So these guys weren’t gathering latch info – and so I couldn’t use the AWR. Instead I just setup my own temporary table to hold the statistics.
At this point I worked on something else for awhile. After an hour or so I took a second snapshot.
I guess I should briefly explain why I grabbed that information. The cache buffers chains latch is actually made up of a large number of child latches. When Oracle needs to access the buffer cache it hashes some of the block’s information to discover which child it needs to use. That way each child latch only has a short list of blocks that it is managing. What I aim to do is find out which child latches are the busiest and then see which segments have blocks protected by those latches. Not only that but we’ll be able to see which blocks are the busiest (in real time).
So the next step is to find the child latches suffering from the most contention. In order to do this we’ll look for latches that frequently cause processes to sleep (relinquish the processor) while waiting.
903892 214 110918 3 SCUR
JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3208618 1209 16393 2 CR 279622172
JSCHDER.BIGTABLE_ORDERS_MF_AND_PARTS [INDEX] 3136 994 121460 1 SCUR
JSCHDER.BIGTABLE.P_PARTS_APPROVED [TABLE PARTITION] 3064309 1156 33703 1 SCUR
JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3208618 1209 16393 1 CR 279619002
JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3208618 1209 16393 1 CR 279619202
JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3208618 1209 16393 1 CR 279621063
JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3208618 1209 16393 1 CR 279618998
JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3208618 1209 16393 1 PI
JSCHDER.BIGTABLE_LOG.P_2007_09 [TABLE PARTITION] 3208618 1209 16393 1 PI
JSCHDER.BIGTABL_LG_X_CHANGE_DATE.P_2007_09 [INDEX PARTI 3208620 1542 4537 1 SCUR
TION]
JSCHDER.PERSISTENT_QUERY 1455110 10 99668 0 SCUR
JSCHDER.GENERATED_REPORT 1455107 13 22427 0 SCUR
JSCHDER.SPOT_ACTIVITY_URL_DETAIL 2684382 287 51469 0 SCUR
JSCHDER.GENERATED_REPORT 1455107 476 103892 0 SCUR
JSCHDER.BIGTABLE_TOTALS 1746552 524 187199 0 SCUR
JSCHDER.PERSISTENT_QUERY 1455110 1476 21915 0 SCUR
17 rows selected.
Last night I posted a case study where I used the AWR (a blessed new feature) to investigate “gc buffer busy” wait events in a RAC environment. I concluded the write-up by theorizing that the single freelist was pointing all nodes of the cluster to the same small group of blocks for inserts and thereby causing the blocks on the freelist to always be subject to unearthly contention across the cluster.
One common piece of advice for gc buffer busy waits is to treat them like regular buffer busy waits. Because essentially that’s what they are – a buffer busy wait on a remote instance. So another avenue of investigation is to look at what might be causing buffer busy waits across the cluster.
Some people may remember that back in the days before YAPP and the wait interface, latches were usually where the purported “experts” looked when you had performance problems. Particularly those two infamous latches cache buffers chains and library cache. And of course today these are still an important part of any in-depth investigation and V$LATCH even includes wait time so you can take a time-based approach to analysis. I spent some time yesterday having a look at the latching in this RAC system and it yielded some results that I thought might be interesting to post. So here goes…
Cache Buffers Chains Latch
Now it was apparent to me pretty quickly that the cache buffers chains latch was the busiest on the system. And of course it’s normal for there to be some contention on this latch. You’ll notice from the screenshot the enormous difference in gets, spin_gets and sleeps. So that gave me a bit of a head start – but you could certainly get the same information from the AWR (that blessed new feature) as well. In fact you can do all of this from the AWR – although look out since latch information is not gathered by default!
So these guys weren’t gathering latch info – and so I couldn’t use the AWR. Instead I just setup my own temporary table to hold the statistics.
create table jschneider.waits as select 1 snap, systimestamp timestamp, inst_id , CHILD# , ADDR , GETS , MISSES , SLEEPS from gv$latch_children where name = 'cache buffers chains'; Table created.
At this point I worked on something else for awhile. After an hour or so I took a second snapshot.
insert into jschneider.waits select 2, systimestamp, inst_id , CHILD# , ADDR , GETS , MISSES , SLEEPS from gv$latch_children where name = 'cache buffers chains'; 393216 rows created. commit; Commit complete.
I guess I should briefly explain why I grabbed that information. The cache buffers chains latch is actually made up of a large number of child latches. When Oracle needs to access the buffer cache it hashes some of the block’s information to discover which child it needs to use. That way each child latch only has a short list of blocks that it is managing. What I aim to do is find out which child latches are the busiest and then see which segments have blocks protected by those latches. Not only that but we’ll be able to see which blocks are the busiest (in real time).
So the next step is to find the child latches suffering from the most contention. In order to do this we’ll look for latches that frequently cause processes to sleep (relinquish the processor) while waiting.
with subq as ( select t2.inst_id i, t2.child#, t2.addr, t2.gets-t1.gets gets, t2.misses-t1.misses misses, t2.sleeps-t1.sleeps sleeps from jschneider.waits6 t1, jschneider.waits6 t2 where t1.child#=t2.child# and t1.inst_id=t2.inst_id and t1.snap=1 and t2.snap=2 order by sleeps desc ) select * from subq where rownum < 40; I CHILD# ADDR GETS MISSES SLEEPS ---------- ---------- ---------------- ---------- ---------- ---------- 3 32664 00000005F9EA6780 349526 8642 168 2 32664 00000005F9EA6780 352351 9850 167 3 21135 00000005F7D9FC80 109896 1284 146 3 39340 00000005FBA3AD10 58127 1100 142 5 13536 00000005A8F41F40 2305516 63418 126 3 57464 0000000608DD4200 45361 909 112 3 59297 000000060799DF68 31056 690 94 3 14549 00000005F9C30B08 46444 624 90 2 53157 00000006078C8808 4727 1129 83 3 27008 00000005F6E27820 64685 1148 82 3 62193 00000005FAD17808 28596 550 78 3 24183 00000005F6DC54B8 32420 505 77 3 35290 00000005F7F8BE38 65142 1115 75 3 46376 0000000606C9CA30 29474 653 74 3 8428 00000005F6BA1900 67146 303 73 3 15740 00000005F7CE4388 94209 338 72 3 30375 00000005F7EE1040 4962 663 69 2 57464 0000000608DD4200 59033 887 68 3 21180 00000005F7DA1588 65043 743 68 3 38233 00000005F6FADC08 31475 516 66 3 25742 00000005F8DE1520 57828 928 65 3 46198 00000005FAAEB6D0 61915 751 65 3 6648 00000005F6B63AE0 32022 462 62 5 18908 00000005A2E51A30 1480518 39246 62 3 59340 00000005FBCF2210 77557 307 61 6 63812 00000005A0CD70E8 1071392 63118 61 2 2526 00000005F5B16B90 339617 303 60 1 32664 00000005F9EA6780 234869 5319 57 2 14849 00000005F9C3B1E8 30541 248 57 3 11539 00000005F9BC80B8 141915 220 57 2 16944 00000005F9C83F40 3931 961 56 4 18475 00000005F7D434E0 87239 477 56 3 33009 00000005F9EB2768 66883 793 56 2 59438 00000005FACB7B90 27733 742 52 2 63600 00000005FBD863B0 10316 188 51 3 44988 00000005FAAC15C0 28087 208 51 2 52350 00000005FBBFF1E0 570855 3252 50 3 1388 00000005F6AACD00 203545 163 50 3 8972 00000005F8B9A4D0 28815 205 49
Exploring the Buffer Cache
Notice that I’ve also queried all latches across the entire cluster. So this is giving me a system-wide picture of cache buffer chains latches. Interestingly, the top child is the same one on two different instances (2 and 3). So next let’s login locally to instance 3 and see what that latch child is protecting! Also, I’m going to grab the current SCN – but only the base – from dbms_flashback.select mod(dbms_flashback.get_system_change_number,power(2,32)) cur_scn_base from dual; CUR_SCN_BASE ------------ 279813816 col object format a55 col state format a5 select /*+rule*/ * from ( select o.owner||'.'||o.object_name||decode(o.subobject_name,NULL,'','.')|| o.subobject_name||' ['||o.object_type||']' object, sq.* from ( select x.obj, x.file#,x.dbablk, x.tch, decode(x.state,0,'FREE',1,'XCUR',2,'SCUR',3,'CR',4,'READ', 5,'MREC',6,'IREC',7,'WRITE',8,'PI',9,'MEMORY',10,'MWRITE', 11,'DONATED',x.state) state, decode(x.state,3,cr_scn_bas,NULL) scn_bas from sys.v$latch_children l, sys.x$bh x where x.hladdr = l.addr and x.obj < power(2,22) and x.hladdr = '00000005F9EA6780' ) sq, dba_objects o where o.data_object_id=sq.obj order by sq.tch desc, file#, dbablk, scn_bas ) where rownum<40; OBJECT OBJ FILE# DBABLK TCH STATE SCN_BAS ------------------------------------------------------- ---------- ---------- ---------- ---------- ----- ---------- JSCHDER.SPOT_ACTIVITY