ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [
2012-02-21 16:19
661 查看
author:skate
time:2012/02/22
昨天晚上凌晨12点接到监控短信(dataguard is down),于是登录系统查看原因,
首先查看备库的alertlog文件,查看最近的半小时的log都是如下的信息
........
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:06 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
.......
在往前查看alertlog
.....
Mon Feb 20 09:35:59 2012
Archived Log entry 10127 added for thread 1 sequence 11099 ID 0x263e89b dest 1:
Mon Feb 20 10:01:06 2012
RFS[6]: Selected log 13 for thread 1 sequence 11101 dbid 40093083 branch 760555291
Mon Feb 20 10:01:06 2012
Media Recovery Waiting for thread 1 sequence 11101 (in transit)
Recovery of Online Redo Log: Thread 1 Group 13 Seq 11101 Reading mem 0
Mem# 0: /oracle/oradata/skate01/standbyredo13.log
Mon Feb 20 10:01:14 2012
Archived Log entry 10128 added for thread 1 sequence 11100 ID 0x263e89b dest 1:
Mon Feb 20 10:03:58 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc (incident=264961):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:27 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_1590.trc (incident=264121):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264121/skate01_mmon_1590_i264121.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:29 2012
Restarting dead background process MMON
Mon Feb 20 10:04:29 2012
MMON started with pid=15, OS id=17808
Mon Feb 20 10:04:29 2012
Dumping diagnostic data in directory=[cdmp_20120220100429], requested by (instance=1, osid=1590 (MMON)), summary=[incident=264121].
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc (incident=264122):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264122/skate01_mmon_17808_i264122.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc (incident=264123):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264123/skate01_mmon_17808_i264123.trc
Dumping diagnostic data in directory=[cdmp_20120220100432], requested by (instance=1, osid=17808 (MMON)), summary=[incident=264122].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:52 2012
.......
发现在“Mon Feb 20 10:03:58 2012”就已经开始报ora-600错误了。首先看看数据库现在是什么状态,是否正常。
1. $ ps -ef |grep $ORACLE_SID //检查oracle进程是否正常
2. $ netstat -an | grep 1588| wc -l //检查oracle是否有连接
3. 检查os的状态:vmstat,top,iostat
从以上检查,没发现什么异常,想起来20号有项目迁移到这个active备库上,可能和这有原因,于是想登录数据库进一步查证,发现无法登陆,提示错误如下:
[root@skate01 ~]# su - oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"
SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:26:12 2012
Copyright (c) 1982, 2010, Oracle. All rights reserved.
ERROR:
ORA-01075: you are currently logged on
Enter user-name:
ERROR:
ORA-01017: invalid username/password; logon denied
SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus
尝试两次都提示一样的错误,无法登陆,看来数据库服务当掉了,看来只能重启数据库了,ORA-01075的错误一般是磁盘空间不够或审计原因,但我检查我的环境不是这两种原因,所以使用os命令kill进程,使用如下两个命令
1. $ ps -ef |grep $ORACLE_SID|grep -v grep|awk '{print $2}' | xargs kill -9 //kill进程
2. $ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm //删除掉oracle的共享段
先查看需要kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi
kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi |awk '{print $2}' | xargs kill -9
如果只kill掉oracle进程,还是无法登陆oracle
查看删除的共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'
删除共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm
resource(s) deleted
[oracle@skate01 ~]$
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'
尝试登录oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"
SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:47:36 2012
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup nomount;
ORACLE instance started.
Total System Global Area 3.5275E+10 bytes
Fixed Size 2233656 bytes
Variable Size 3623881416 bytes
Database Buffers 3.1541E+10 bytes
Redo Buffers 108003328 bytes
SQL> alter database mount standby database;
Database altered.
SQL> alter database open read only;
Database altered.
SQL> alter database recover managed standby database disconnect using current logfile;
Database altered.
然后检查alterlog看是否有异常,发现都很正常,然后检查确认os层是正常的,然后在登录数据库检查dataguard是否健康。
1.standby库和primary的时间延迟(在standby上运行):
select 'Last applied : ' Logs,
to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
from v$archived_log
where sequence# =
(select max(sequence#) from v$archived_log where applied = 'YES')
union
select 'Last received : ' Logs,
to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
from v$archived_log
where sequence# = (select max(sequence#) from v$archived_log);
2.查看进程的活动状态(在standby运行):
select process, status, thread#, sequence#, block#, blocks
from v$managed_standby;
3.检查log的恢复速度
select * from v$dataguard_status
select * from v$recovery_progress
确认库目前是正常的,然后在会头看数据库为什么会宕机,为什么会报ora-600
查看trace文件
[root@skate01 ~]# more /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Dump file /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/product/11.2.0/db_1
System name: Linux
Node name: skate01
Release: 2.6.18-194.el5
Version: #1 SMP Fri Apr 2 14:58:14 EDT 2010
Machine: x86_64
Instance name: skate01
Redo thread mounted by this instance: 1
Oracle process number: 120
Unix process pid: 17783, image: oracle@skate01
*** 2012-02-20 10:03:58.215
*** SESSION ID:(17.5) 2012-02-20 10:03:58.215
*** CLIENT ID:() 2012-02-20 10:03:58.215
*** SERVICE NAME:(SYS$USERS) 2012-02-20 10:03:58.215
*** MODULE NAME:(JDBC Thin Client) 2012-02-20 10:03:58.215
*** ACTION NAME:() 2012-02-20 10:03:58.215
Dump continued from file: /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
========= Dump for incident 264961 (ORA 600 [KGHLKREM1]) ========
----- Beginning of Customized Incident Dump(s) -----
***** Internal heap ERROR KGHLKREM1 addr=0x838000020 ds=0x60001188 *****
***** Dump of memory around addr 0x838000020:
837FFF020 00000000 00000000 00000000 00000000 [................]
Repeat 511 times
Recovery state: ds=0x60001188 rtn=(nil) *rtn=(nil) szo=0 u4o=0 hdo=0 off=0
Szo:
UB4o:
Hdo:
Off:
Hla: 0
******************************************************
HEAP DUMP heap name="sga heap" desc=0x60001188
extent sz=0x9800 alt=248 het=32767 rec=9 flg=-126 opc=4
parent=(nil) owner=(nil) nex=(nil) xsz=0x0 heap=(nil)
fl2=0x60, nex=(nil)
ds for latch 1: 0x600551d8 0x60056a30 0x60058288 0x60059ae0
ds for latch 2: 0x6005eaa0 0x600602f8 0x60061b50 0x600633a8
reserved granule count 12 (granule size 134217728)
----- End of Customized Incident Dump(s) -----
*** 2012-02-20 10:03:58.341
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=40p7rprfbt1as) -----
select 'a' from dual
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36 call kgdsdst() 000000000 ? 000000000 ?
7FFF0B5CCD88 ? 000000001 ?
000000001 ? 000000002 ?
ksedst1()+98 call skdstdst() 000000000 ? 000000000 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksedst()+34 call ksedst1() 000000000 ? 000000001 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbkedDefDump()+2741 call ksedst() 000000000 ? 000000001 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksedmp()+36 call dbkedDefDump() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksfdmp()+64 call ksedmp() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbgexPhaseII()+1764 call ksfdmp() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbgexExplicitEndInc call dbgexPhaseII() 2B1892AF1710 ? 2B1892EA06A8 ?
()+750 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
dbgeEndDDEInvocatio call dbgexExplicitEndInc 2B1892AF1710 ? 2B1892EA06A8 ?
nImpl()+767 () 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
dbgeEndDDEInvocatio call dbgeEndDDEInvocatio 2B1892AF1710 ? 2B1892EA06A8 ?
n()+47 nImpl() 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
kghnerror()+394 call dbgeEndDDEInvocatio 2B1892AF1710 ? 2B1892EA06A8 ?
n() 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
kghadd_reserved_ext call kghnerror() 00B7CCEA0 ? 060001188 ?
ent()+945 00A0EF0C0 ? 838000020 ?
100000000 ? 000000002 ?
kghget_reserved_ext call kghadd_reserved_ext 00B7CCEA0 ? 060001188 ?
ent()+526 ent() 060059AE0 ? 060059B28 ?
000000000 ? 000000000 ?
kghgex()+1455 call kghget_reserved_ext 00B7CCEA0 ? 060004CD8 ?
ent() 060059AE0 ? 060059B28 ?
000000000 ? 000000000 ?
kghfnd()+734 call kghgex() 00B7CCEA0 ? 060004CD8 ?
060059AE0 ? 000001058 ?
000000000 ? 000000000 ?
kghalo()+536 call kghfnd() 00B7CCEA0 ? 000000000 ?
060004CD8 ? 000000000 ?
060059AE0 ? 7FFF0B5C95C0 ?
kghgex()+437 call kghalo() 00B7CCEA0 ? 060059AE0 ?
000001000 ? 000001000 ?
060059AE0 ? 060004CD8 ?
kghalf()+395 call kghgex() 00B7CCEA0 ? 000000000 ?
85DD58D08 ? 000000FD0 ?
060059AE0 ? 060004CD8 ?
kksLoadChild()+2785 call kghalf() 00B7CCEA0 ? 85DD58D08 ?
000000001 ? 060004CD8 ?
000000000 ? 009A98F10 ?
kxsGetRuntimeLock() call kksLoadChild() 00B7CCEA0 ? 88FD354E0 ?
+2061 7FFF0B5DB5B0 ? 2B1892F59070 ?
85DD586F8 ? 000000000 ?
kksfbc()+14522 call kxsGetRuntimeLock() 00B7CCEA0 ? 2B1892F59070 ?
7FFF0B5DB5B0 ? 2B1892F59070 ?
85DD586F8 ? 88FD354E0 ?
kkspsc0()+2020 call kksfbc() 2B1892F59070 ? 000000003 ?
000000108 ? 7FFF0B5DD6F8 ?
000000015 ? 000000000 ?
kksParseCursor()+13 call kkspsc0() 2B1892F41BB8 ? 7FFF0B5DD6F8 ?
9 000000015 ? 000000003 ?
000000006 ? 0000000A4 ?
opiosq0()+2022 call kksParseCursor() 7FFF0B5DC0D0 ? 7FFF0B5DD6F8 ?
000000015 ? 000000003 ?
000000006 ? 0000000A4 ?
kpooprx()+269 call opiosq0() 000000003 ? 00000000E ?
7FFF0B5DC2A0 ? 0000000A4 ?
000000000 ? 7FFF0B5DBFB0 ?
kpoal8()+795 call kpooprx() 7FFF0B5DF694 ? 7FFF0B5DD6F8 ?
000000014 ? 000000001 ?
000000000 ? 7FFF0B5DBFB0 ?
opiodr()+910 call kpoal8() 00000005E ? 00000001C ?
7FFF0B5DF690 ? 000000001 ?
000000000 ? 000000001 ?
ttcpip()+2289 call opiodr() 00000005E ? 00000001C ?
7FFF0B5DF690 ? 000000000 ?
0098A1530 ? 000000001 ?
opitsk()+1665 call ttcpip() 00B7E2B10 ? 00923BB90 ?
7FFF0B5DF690 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiino()+961 call opitsk() 00B7E2B10 ? 000000001 ?
7FFF0B5DF690 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiodr()+910 call opiino() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opidrv()+565 call opiodr() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
sou2o()+98 call opidrv() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
opimai_real()+128 call sou2o() 7FFF0B5E0DF0 ? 00000003C ?
000000004 ? 7FFF0B5E0E18 ?
0098A0FE0 ? 7FFF0B5DF888 ?
ssthrdmain()+252 call opimai_real() 000000002 ? 7FFF0B5E0FE0 ?
000000004 ? 7FFF0B5E0E18 ?
0098A0FE0 ? 7FFF0B5DF888 ?
main()+196 call ssthrdmain() 000000002 ? 7FFF0B5E0FE0 ?
000000001 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
__libc_start_main() call main() 000000002 ? 7FFF0B5E1188 ?
+244 000000001 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
_start()+36 call __libc_start_main() 000A07368 ? 000000002 ?
7FFF0B5E1178 ? 000000000 ?
0098A0FE0 ? 000000002 ?
--------------------- Binary Stack Dump ---------------------
再往前查看alertlog,发现还报了ora-07445
Tue Jan 17 08:42:12 2012
Archived Log entry 7472 added for thread 1 sequence 8444 ID 0x263e89b dest 1:
Tue Jan 17 09:00:14 2012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x8] [PC:0xB0997A, ksmdscan_internal()+82] [flags: 0x0, count: 1]
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_25574.trc (incident=264155):
ora-07445: exception encountered: core dump [ksmdscan_internal()+82] [SIGSEGV] [ADDR:0x8] [PC:0xB0997A] [Address not mapped to objec
t] []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264155/skate01_ora_25574_i264155.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Jan 17 09:00:21 2012
Dumping diagnostic data in directory=[cdmp_20120117090021], requested by (instance=1, osid=25574), summary=[incident=264155].
Tue Jan 17 09:00:22 2012
Sweep [inc][264155]: completed
Sweep [inc2][264155]: completed
Tue Jan 17 09:06:08 2012
Media Recovery Waiting for thread 1 sequence 8446
然后查看oracle文档“ID 1070812.1”,发现这个我启用hugepage有关,
当系统vm.drop_caches设置大于0,并且启用hugepage,这时这两个就会冲突,因为drop_caches是要释放内存,而hugepage是hold住内存。
参考:/article/2349643.html
解决方法
1.如果启用hugepage,那就设置vm.drop_caches=0
[root@localhost ~]# more /proc/sys/vm/drop_caches
3
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 3
[root@localhost ~]# vi /etc/sysctl.conf
##skate add
vm.drop_caches=0
使其立刻生效
[root@localhost ~]# sysctl -p
检查是否生效
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 0
或者
2.升级Linux Kernel version到 2.6.18-194.0.0.0.4.EL5
附上官方文档:
ORA-600 [KGHLKREM1] On Linux Using Parameter drop_cache On hugepages Configuration [ID 1070812.1]
Generic Linux
You are getting the error
and the SGA heap Dump of memory around the offending addr (in this particular example: 0x6bc00020)
it's showing zeroed out :
/proc/sys/vm/drop_caches (since Linux 2.6.16)
Writing to this file causes the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.
To free pagecache:
* echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
* echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
* echo 3 > /proc/sys/vm/drop_cachesAs this is a non-destructive operation, and dirty objects are not freeable, the user should run "sync" first in order to make sure all cached objects are freed.
2. You have setup the Hugepages
Using the linux kernel "drop_cache" parameter and having the hugepages a memory corruption can occurs.
Per internal Bug 9461825, executing vm.drop_caches corrupts Oracle Database SGA hugepages;
it is fixed in Linux Kernel version 2.6.18-194.0.0.0.4.EL5
OR
2. Upgrade to Linux Kernel version 2.6.18-194.0.0.0.4.EL5
----------end-----------
time:2012/02/22
昨天晚上凌晨12点接到监控短信(dataguard is down),于是登录系统查看原因,
首先查看备库的alertlog文件,查看最近的半小时的log都是如下的信息
........
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:03 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:05 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Feb 21 00:02:06 2012
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
.......
在往前查看alertlog
.....
Mon Feb 20 09:35:59 2012
Archived Log entry 10127 added for thread 1 sequence 11099 ID 0x263e89b dest 1:
Mon Feb 20 10:01:06 2012
RFS[6]: Selected log 13 for thread 1 sequence 11101 dbid 40093083 branch 760555291
Mon Feb 20 10:01:06 2012
Media Recovery Waiting for thread 1 sequence 11101 (in transit)
Recovery of Online Redo Log: Thread 1 Group 13 Seq 11101 Reading mem 0
Mem# 0: /oracle/oradata/skate01/standbyredo13.log
Mon Feb 20 10:01:14 2012
Archived Log entry 10128 added for thread 1 sequence 11100 ID 0x263e89b dest 1:
Mon Feb 20 10:03:58 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc (incident=264961):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:27 2012
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_1590.trc (incident=264121):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264121/skate01_mmon_1590_i264121.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:29 2012
Restarting dead background process MMON
Mon Feb 20 10:04:29 2012
MMON started with pid=15, OS id=17808
Mon Feb 20 10:04:29 2012
Dumping diagnostic data in directory=[cdmp_20120220100429], requested by (instance=1, osid=1590 (MMON)), summary=[incident=264121].
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc (incident=264122):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264122/skate01_mmon_17808_i264122.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_mmon_17808.trc (incident=264123):
ora-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264123/skate01_mmon_17808_i264123.trc
Dumping diagnostic data in directory=[cdmp_20120220100432], requested by (instance=1, osid=17808 (MMON)), summary=[incident=264122].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Feb 20 10:04:52 2012
.......
发现在“Mon Feb 20 10:03:58 2012”就已经开始报ora-600错误了。首先看看数据库现在是什么状态,是否正常。
1. $ ps -ef |grep $ORACLE_SID //检查oracle进程是否正常
2. $ netstat -an | grep 1588| wc -l //检查oracle是否有连接
3. 检查os的状态:vmstat,top,iostat
从以上检查,没发现什么异常,想起来20号有项目迁移到这个active备库上,可能和这有原因,于是想登录数据库进一步查证,发现无法登陆,提示错误如下:
[root@skate01 ~]# su - oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"
SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:26:12 2012
Copyright (c) 1982, 2010, Oracle. All rights reserved.
ERROR:
ORA-01075: you are currently logged on
Enter user-name:
ERROR:
ORA-01017: invalid username/password; logon denied
SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus
尝试两次都提示一样的错误,无法登陆,看来数据库服务当掉了,看来只能重启数据库了,ORA-01075的错误一般是磁盘空间不够或审计原因,但我检查我的环境不是这两种原因,所以使用os命令kill进程,使用如下两个命令
1. $ ps -ef |grep $ORACLE_SID|grep -v grep|awk '{print $2}' | xargs kill -9 //kill进程
2. $ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm //删除掉oracle的共享段
先查看需要kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi
kill的进程
[oracle@skate01 ~]$ ps -ef |grep $ORACLE_SID|grep -v grep |grep -v avahi |awk '{print $2}' | xargs kill -9
如果只kill掉oracle进程,还是无法登陆oracle
查看删除的共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'
删除共享段
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}' | xargs ipcrm shm
resource(s) deleted
[oracle@skate01 ~]$
[oracle@skate01 ~]$ ipcs -m | grep oracle | awk '{print $2}'
尝试登录oracle
[oracle@skate01 ~]$ sqlplus "/as sysdba"
SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 21 00:47:36 2012
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup nomount;
ORACLE instance started.
Total System Global Area 3.5275E+10 bytes
Fixed Size 2233656 bytes
Variable Size 3623881416 bytes
Database Buffers 3.1541E+10 bytes
Redo Buffers 108003328 bytes
SQL> alter database mount standby database;
Database altered.
SQL> alter database open read only;
Database altered.
SQL> alter database recover managed standby database disconnect using current logfile;
Database altered.
然后检查alterlog看是否有异常,发现都很正常,然后检查确认os层是正常的,然后在登录数据库检查dataguard是否健康。
1.standby库和primary的时间延迟(在standby上运行):
select 'Last applied : ' Logs,
to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
from v$archived_log
where sequence# =
(select max(sequence#) from v$archived_log where applied = 'YES')
union
select 'Last received : ' Logs,
to_char(next_time, 'DD-MON-YY:HH24:MI:SS') Time
from v$archived_log
where sequence# = (select max(sequence#) from v$archived_log);
2.查看进程的活动状态(在standby运行):
select process, status, thread#, sequence#, block#, blocks
from v$managed_standby;
3.检查log的恢复速度
select * from v$dataguard_status
select * from v$recovery_progress
确认库目前是正常的,然后在会头看数据库为什么会宕机,为什么会报ora-600
查看trace文件
[root@skate01 ~]# more /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Dump file /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264961/skate01_ora_17783_i264961.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/product/11.2.0/db_1
System name: Linux
Node name: skate01
Release: 2.6.18-194.el5
Version: #1 SMP Fri Apr 2 14:58:14 EDT 2010
Machine: x86_64
Instance name: skate01
Redo thread mounted by this instance: 1
Oracle process number: 120
Unix process pid: 17783, image: oracle@skate01
*** 2012-02-20 10:03:58.215
*** SESSION ID:(17.5) 2012-02-20 10:03:58.215
*** CLIENT ID:() 2012-02-20 10:03:58.215
*** SERVICE NAME:(SYS$USERS) 2012-02-20 10:03:58.215
*** MODULE NAME:(JDBC Thin Client) 2012-02-20 10:03:58.215
*** ACTION NAME:() 2012-02-20 10:03:58.215
Dump continued from file: /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_17783.trc
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x838000020], [], [], [], [], [], [], [], [], [], []
========= Dump for incident 264961 (ORA 600 [KGHLKREM1]) ========
----- Beginning of Customized Incident Dump(s) -----
***** Internal heap ERROR KGHLKREM1 addr=0x838000020 ds=0x60001188 *****
***** Dump of memory around addr 0x838000020:
837FFF020 00000000 00000000 00000000 00000000 [................]
Repeat 511 times
Recovery state: ds=0x60001188 rtn=(nil) *rtn=(nil) szo=0 u4o=0 hdo=0 off=0
Szo:
UB4o:
Hdo:
Off:
Hla: 0
******************************************************
HEAP DUMP heap name="sga heap" desc=0x60001188
extent sz=0x9800 alt=248 het=32767 rec=9 flg=-126 opc=4
parent=(nil) owner=(nil) nex=(nil) xsz=0x0 heap=(nil)
fl2=0x60, nex=(nil)
ds for latch 1: 0x600551d8 0x60056a30 0x60058288 0x60059ae0
ds for latch 2: 0x6005eaa0 0x600602f8 0x60061b50 0x600633a8
reserved granule count 12 (granule size 134217728)
----- End of Customized Incident Dump(s) -----
*** 2012-02-20 10:03:58.341
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=40p7rprfbt1as) -----
select 'a' from dual
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36 call kgdsdst() 000000000 ? 000000000 ?
7FFF0B5CCD88 ? 000000001 ?
000000001 ? 000000002 ?
ksedst1()+98 call skdstdst() 000000000 ? 000000000 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksedst()+34 call ksedst1() 000000000 ? 000000001 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbkedDefDump()+2741 call ksedst() 000000000 ? 000000001 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksedmp()+36 call dbkedDefDump() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
ksfdmp()+64 call ksedmp() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbgexPhaseII()+1764 call ksfdmp() 000000003 ? 000000002 ?
7FFF0B5CCD88 ? 000000001 ?
000000000 ? 000000002 ?
dbgexExplicitEndInc call dbgexPhaseII() 2B1892AF1710 ? 2B1892EA06A8 ?
()+750 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
dbgeEndDDEInvocatio call dbgexExplicitEndInc 2B1892AF1710 ? 2B1892EA06A8 ?
nImpl()+767 () 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
dbgeEndDDEInvocatio call dbgeEndDDEInvocatio 2B1892AF1710 ? 2B1892EA06A8 ?
n()+47 nImpl() 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
kghnerror()+394 call dbgeEndDDEInvocatio 2B1892AF1710 ? 2B1892EA06A8 ?
n() 7FFF0B5D88C0 ? 000000001 ?
000000000 ? 000000002 ?
kghadd_reserved_ext call kghnerror() 00B7CCEA0 ? 060001188 ?
ent()+945 00A0EF0C0 ? 838000020 ?
100000000 ? 000000002 ?
kghget_reserved_ext call kghadd_reserved_ext 00B7CCEA0 ? 060001188 ?
ent()+526 ent() 060059AE0 ? 060059B28 ?
000000000 ? 000000000 ?
kghgex()+1455 call kghget_reserved_ext 00B7CCEA0 ? 060004CD8 ?
ent() 060059AE0 ? 060059B28 ?
000000000 ? 000000000 ?
kghfnd()+734 call kghgex() 00B7CCEA0 ? 060004CD8 ?
060059AE0 ? 000001058 ?
000000000 ? 000000000 ?
kghalo()+536 call kghfnd() 00B7CCEA0 ? 000000000 ?
060004CD8 ? 000000000 ?
060059AE0 ? 7FFF0B5C95C0 ?
kghgex()+437 call kghalo() 00B7CCEA0 ? 060059AE0 ?
000001000 ? 000001000 ?
060059AE0 ? 060004CD8 ?
kghalf()+395 call kghgex() 00B7CCEA0 ? 000000000 ?
85DD58D08 ? 000000FD0 ?
060059AE0 ? 060004CD8 ?
kksLoadChild()+2785 call kghalf() 00B7CCEA0 ? 85DD58D08 ?
000000001 ? 060004CD8 ?
000000000 ? 009A98F10 ?
kxsGetRuntimeLock() call kksLoadChild() 00B7CCEA0 ? 88FD354E0 ?
+2061 7FFF0B5DB5B0 ? 2B1892F59070 ?
85DD586F8 ? 000000000 ?
kksfbc()+14522 call kxsGetRuntimeLock() 00B7CCEA0 ? 2B1892F59070 ?
7FFF0B5DB5B0 ? 2B1892F59070 ?
85DD586F8 ? 88FD354E0 ?
kkspsc0()+2020 call kksfbc() 2B1892F59070 ? 000000003 ?
000000108 ? 7FFF0B5DD6F8 ?
000000015 ? 000000000 ?
kksParseCursor()+13 call kkspsc0() 2B1892F41BB8 ? 7FFF0B5DD6F8 ?
9 000000015 ? 000000003 ?
000000006 ? 0000000A4 ?
opiosq0()+2022 call kksParseCursor() 7FFF0B5DC0D0 ? 7FFF0B5DD6F8 ?
000000015 ? 000000003 ?
000000006 ? 0000000A4 ?
kpooprx()+269 call opiosq0() 000000003 ? 00000000E ?
7FFF0B5DC2A0 ? 0000000A4 ?
000000000 ? 7FFF0B5DBFB0 ?
kpoal8()+795 call kpooprx() 7FFF0B5DF694 ? 7FFF0B5DD6F8 ?
000000014 ? 000000001 ?
000000000 ? 7FFF0B5DBFB0 ?
opiodr()+910 call kpoal8() 00000005E ? 00000001C ?
7FFF0B5DF690 ? 000000001 ?
000000000 ? 000000001 ?
ttcpip()+2289 call opiodr() 00000005E ? 00000001C ?
7FFF0B5DF690 ? 000000000 ?
0098A1530 ? 000000001 ?
opitsk()+1665 call ttcpip() 00B7E2B10 ? 00923BB90 ?
7FFF0B5DF690 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiino()+961 call opitsk() 00B7E2B10 ? 000000001 ?
7FFF0B5DF690 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opiodr()+910 call opiino() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
7FFF0B5DF0F0 ? 7FFF0B5DF888 ?
opidrv()+565 call opiodr() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
sou2o()+98 call opidrv() 00000003C ? 000000004 ?
7FFF0B5E0E18 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
opimai_real()+128 call sou2o() 7FFF0B5E0DF0 ? 00000003C ?
000000004 ? 7FFF0B5E0E18 ?
0098A0FE0 ? 7FFF0B5DF888 ?
ssthrdmain()+252 call opimai_real() 000000002 ? 7FFF0B5E0FE0 ?
000000004 ? 7FFF0B5E0E18 ?
0098A0FE0 ? 7FFF0B5DF888 ?
main()+196 call ssthrdmain() 000000002 ? 7FFF0B5E0FE0 ?
000000001 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
__libc_start_main() call main() 000000002 ? 7FFF0B5E1188 ?
+244 000000001 ? 000000000 ?
0098A0FE0 ? 7FFF0B5DF888 ?
_start()+36 call __libc_start_main() 000A07368 ? 000000002 ?
7FFF0B5E1178 ? 000000000 ?
0098A0FE0 ? 000000002 ?
--------------------- Binary Stack Dump ---------------------
再往前查看alertlog,发现还报了ora-07445
Tue Jan 17 08:42:12 2012
Archived Log entry 7472 added for thread 1 sequence 8444 ID 0x263e89b dest 1:
Tue Jan 17 09:00:14 2012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x8] [PC:0xB0997A, ksmdscan_internal()+82] [flags: 0x0, count: 1]
Errors in file /oracle/app/diag/rdbms/skate01/skate01/trace/skate01_ora_25574.trc (incident=264155):
ora-07445: exception encountered: core dump [ksmdscan_internal()+82] [SIGSEGV] [ADDR:0x8] [PC:0xB0997A] [Address not mapped to objec
t] []
Incident details in: /oracle/app/diag/rdbms/skate01/skate01/incident/incdir_264155/skate01_ora_25574_i264155.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Jan 17 09:00:21 2012
Dumping diagnostic data in directory=[cdmp_20120117090021], requested by (instance=1, osid=25574), summary=[incident=264155].
Tue Jan 17 09:00:22 2012
Sweep [inc][264155]: completed
Sweep [inc2][264155]: completed
Tue Jan 17 09:06:08 2012
Media Recovery Waiting for thread 1 sequence 8446
然后查看oracle文档“ID 1070812.1”,发现这个我启用hugepage有关,
当系统vm.drop_caches设置大于0,并且启用hugepage,这时这两个就会冲突,因为drop_caches是要释放内存,而hugepage是hold住内存。
参考:/article/2349643.html
解决方法
1.如果启用hugepage,那就设置vm.drop_caches=0
[root@localhost ~]# more /proc/sys/vm/drop_caches
3
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 3
[root@localhost ~]# vi /etc/sysctl.conf
##skate add
vm.drop_caches=0
使其立刻生效
[root@localhost ~]# sysctl -p
检查是否生效
[root@localhost ~]# sysctl -a | grep drop_caches
vm.drop_caches = 0
或者
2.升级Linux Kernel version到 2.6.18-194.0.0.0.4.EL5
附上官方文档:
ORA-600 [KGHLKREM1] On Linux Using Parameter drop_cache On hugepages Configuration [ID 1070812.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 and later [Release: 10.2 and later ]Generic Linux
Symptoms
You are running an Oracle Database, single-instance or RAC. You have the SGA backed by hugepages.You are getting the error
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x06BC00020] with stack trace similar to: kghnerror kghadd_reserved_ext kghgex or also ORA-07445: exception encountered: core dump [kglhdal()+1105][SIGSEGV] [Address not mapped to object] [0x000000008] [] [] ORA-07445: exception encountered: core dump [kghfnd()+2328] [SIGSEGV] [Address not mapped to object] [0xFFFFFFFFFFFFFFF0] [] []
and the SGA heap Dump of memory around the offending addr (in this particular example: 0x6bc00020)
it's showing zeroed out :
asm1_lmd0_8600.trc ~~~~~~~~~~~~~~~~~~ *** 2010-02-08 15:57:38.274 ***** Internal heap ERROR KGHLKREM1 addr=0x6c400020 ds=0x60000058 ***** ***** Dump of memory around addr 0x6c400020: 06C3FF020 00000000 00000000 00000000 00000000 [................] Repeat 511 times
Changes
1. On your system you are running with vm.drop_caches=1 (or 3), drop_cache have been setto a value greater than zero , or you are executingecho 3 > /proc/sys/vm/drop_caches
/proc/sys/vm/drop_caches (since Linux 2.6.16)
Writing to this file causes the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.
To free pagecache:
* echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
* echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
* echo 3 > /proc/sys/vm/drop_cachesAs this is a non-destructive operation, and dirty objects are not freeable, the user should run "sync" first in order to make sure all cached objects are freed.
2. You have setup the Hugepages
Cause
This is a Linux Kernel issue.Using the linux kernel "drop_cache" parameter and having the hugepages a memory corruption can occurs.
Per internal Bug 9461825, executing vm.drop_caches corrupts Oracle Database SGA hugepages;
it is fixed in Linux Kernel version 2.6.18-194.0.0.0.4.EL5
Solution
1. As a workaround when hugepages are set avoid any vm.drop_cache settings.OR
2. Upgrade to Linux Kernel version 2.6.18-194.0.0.0.4.EL5
----------end-----------
相关文章推荐
- ORA-00600: internal error code, arguments: [KGHLKREM1], [0x70000002AFE3EF8], [], [], [], [], [], []
- ORA-00600: internal error code, arguments: [13013], [5001], [267], [8455677], [0], [8455677], [17],
- ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr]错误处理
- ORA-00600: internal error code, arguments: [kspsetpao1], [1129], [1068]
- ORA-00600: internal error code, arguments: [kdsgrp1] example
- ORA-00600: internal error code, arguments: [17281], [1001], [0x1FF863EE8], [], [], [], [], []
- ORA-00600: internal error code, arguments: [kdourp_inorder2]
- ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []
- ORA-00600: internal error code, arguments: [kcblasm_1], [103], [], [], [], [], [], []
- ORA-00600:internal error code,arguments:[keltnfy-idmlnit],[46],[1],[],[],[],[],[]
- ORA-00600:internal error code,arguments:[keltnfy-idmlnit],[46],[1],[],[],[],[],[]
- ORA-00600: internal error code, arguments: [18062], [], [], [], [], [], [], []
- ran accross the ORA-00600: internal error code, arguments: error
- ORA-00600: internal error code, arguments: [SKGMFAIL], [2], [4], [4], [1], [], [], [], [], [], [], [
- ORA-00600: internal error code, arguments: [SKGMFAIL], [2], [4], [4], [1], [], [], [], [], [], [], [
- ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []
- ORA-00600: internal error code, arguments: [4194] 问题处理
- ORA-00600: internal error code, arguments: [1265]
- ORA-00600: internal error code, arguments: [kdsgrp1]
- ORA-00600: internal error code, arguments: [kdsgrp1]