您的位置:首页 > 其它

RMAN备份时报ORA-19501错误--问题定位篇

2013-08-05 20:56 92 查看
一个库,在备份时报错ORA-19501,下面将我的分析过程简单罗列下

环境:linux    + oracle 10.1.0.4.2

错误内容如下

RMAN> run {
2> backup database format '/XXX/flash_recovery_area/prod/backupset/%U.dbf';
3> }
Starting backup at 01-AUG-13
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=296 devtype=DISK
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00046 name=/XXX/oradata/prod/datafile/o1_mf_esbigtbl_1q6k0sp9_.dbf
input datafile fno=00002 name=/XXX/oradata/prod/datafile/o1_mf_undotbs1_1q6jqcko_.dbf
input datafile fno=00063 name=/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf
......
......
channel ORA_DISK_1: starting piece 1 at 01-AUG-13
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 08/01/2013 15:48:46
ORA-19501: read error on file "/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf", blockno 15233 (blocksize=8192)
ORA-27072: File I/O error
Linux Error: 2: No such file or directory
Additional information: 15232


首先根据上面的错误信息

1. 查看了该数据文件,发现它在物理上是存在的

2.根据oracle的错误编号,挖掘更多的内容

[oracle@infra bin]$ oerr ora 19501
19501, 00000, "read error on file \"%s\", blockno %s (blocksize=%s)"
// *Cause:  read error on input file
// *Action: check the file

[oracle@infra bin]$ oerr ora 27072
27072, 00000, "File I/O error"
// *Cause:  read/write/readv/writev system call returned error, additional
//          information indicates starting block number of I/O
// *Action: check errno


分析:读取文件错误,推断可能有坏块,具体是物理坏块还是逻辑坏块呢

3.查看告警日志,里面没有错误信息,没有提供有价值的信息

下面就从坏块入手

4.查看坏块所在的表空间及对象

SQL> r
1  SELECT OWNER, SEGMENT_NAME, SEGMENT_TYPE, TABLESPACE_NAME, A.PARTITION_NAME
2    FROM DBA_EXTENTS A
3    WHERE FILE_ID = &FILE_ID
4*   AND &BLOCK_ID BETWEEN BLOCK_ID AND BLOCK_ID + BLOCKS - 1
Enter value for file_id: 63
old   3:   WHERE FILE_ID = &FILE_ID
new   3:   WHERE FILE_ID = 63
Enter value for block_id: 15233
old   4:   AND &BLOCK_ID BETWEEN BLOCK_ID AND BLOCK_ID + BLOCKS - 1
new   4:   AND 15233 BETWEEN BLOCK_ID AND BLOCK_ID + BLOCKS - 1

OWNER                SEGMENT_NAME         SEGMENT_TYPE       TABLESPACE_NAME      PARTITION_NAME
-------------------- -------------------- ------------------ -------------------- ------------------------------
CONTENT              DR$IFS_TEXT$I        TABLE              CONTENT_IFS_CTX_K

SQL> select count(*) from CONTENT.DR$IFS_TEXT$I;

COUNT(*)
----------
3257212


分析:可以访问坏块上的表的数据,这里有2种情况:

(1)该表的所有数据都在内存中,查询时全部逻辑读                                                  ---无法判断该表是否是存在逻辑坏块还是物理坏块;

(2)该表中的数据在内存和磁盘中都有,查询时,一部分物理读                              ---

    假设法 --由于不确定是物理坏块还是逻辑坏块,那么就假设为逻辑坏块。

    为了验证是逻辑坏块,执行下面操作

5.用dbv工具验证是否存在逻辑坏块

[oracle@infra bin]$ dbv file=/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf  blocksize=8192

DBVERIFY: Release 10.1.0.4.2 - Production on Thu Aug 1 16:26:07 2013

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

DBVERIFY - Verification starting : FILE = /XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf

DBVERIFY - Verification complete

Total Pages Examined         : 15258
Total Pages Processed (Data) : 13427
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 20
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 1707
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 104
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 1978186924 (0.1978186924)
RMAN> run {
2> backup validate datafile 63 format '/XXX/flash_recovery_area/prod/backupset/%U.dbf';
3> }

Starting backup at 01-AUG-13
using target database controlfile instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=367 devtype=DISK
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00063 name=/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 08/01/2013 16:32:35
ORA-19501: read error on file "/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf", blockno 15233 (blocksize=8192)
ORA-27072: File I/O error
Additional information: 15232
RMAN> backup check logical validate datafile 63;

Starting backup at 01-AUG-13
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00063 name=/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 08/01/2013 17:42:10
ORA-19501: read error on file "/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf", blockno 15233 (blocksize=8192)
ORA-27072: File I/O error
Additional information: 15232


分析:dbv验证没有逻辑坏块,可是为什么rman下验证时候又报错呢?

      答案只有一个,那就是不是逻辑坏块,而是物理坏块

      那么为了验证是物理坏块,执行下面操作

6.用cp的命令验证物理坏块

[oracle@infra datafile]$ cp  /XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf   /tmp/1.dbf
cp: reading `/XXX/oradata/prod/datafile/o1_mf_content__1q6k05ym_.dbf': Input/output error
[oracle@infra datafile]$ cp /XXX/oradata/prod/datafile/rman01.dbf /tmp/2.dbf  --没报错
[oracle@infra datafile]$ cp /XXX/oradata/prod/datafile/o1_mf_ovfmetri_1q6jw7hm_.dbf  /tmp/3.dbf  --没报错


分析:上面的测试结果,让我怀疑磁盘坏了,为了验证我的怀疑,执行如下内容

7.验证磁盘是否健康正常

[oracle@infra oracle]$ dmesg
0    0   0   0    0    0    00
17 00F 0F  1    1    0   1   0    1    1    A9

IO APIC #9......
.... register #00: 09000000
.......    : physical APIC id: 09
.......    : Delivery Type: 0
.......    : LTS          : 0
.... register #01: 00178020
.......     : max redirection entries: 0017
.......     : PRQ implemented: 1
.......     : IO APIC version: 0020
.... register #03: 00000001
.......     : Boot DT    : 1
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 000 00  1    0    0   0   0    0    0    00
01 000 00  1    0    0   0   0    0    0    00
02 00F 0F  1    1    0   1   0    1    1    B1
03 000 00  1    0    0   0   0    0    0    00
04 000 00  1    0    0   0   0    0    0    00
05 000 00  1    0    0   0   0    0    0    00
06 000 00  1    0    0   0   0    0    0    00
07 000 00  1    0    0   0   0    0    0    00
08 000 00  1    0    0   0   0    0    0    00
09 000 00  1    0    0   0   0    0    0    00
0a 000 00  1    0    0   0   0    0    0    00
0b 000 00  1    0    0   0   0    0    0    00
0c 000 00  1    0    0   0   0    0    0    00
0d 000 00  1    0    0   0   0    0    0    00
0e 000 00  1    0    0   0   0    0    0    00
0f 000 00  1    0    0   0   0    0    0    00
10 000 00  1    0    0   0   0    0    0    00
11 000 00  1    0    0   0   0    0    0    00
12 000 00  1    0    0   0   0    0    0    00
13 000 00  1    0    0   0   0    0    0    00
14 000 00  1    0    0   0   0    0    0    00
15 000 00  1    0    0   0   0    0    0    00
16 000 00  1    0    0   0   0    0    0    00
17 000 00  1    0    0   0   0    0    0    00
IRQ to pin mappings:
IRQ0 -> 0:2
IRQ1 -> 0:1
IRQ4 -> 0:4
IRQ5 -> 0:5
IRQ6 -> 0:6
IRQ8 -> 0:8
IRQ10 -> 0:10
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
IRQ16 -> 0:16
IRQ17 -> 0:17
IRQ18 -> 0:18
IRQ19 -> 0:19
IRQ23 -> 0:23
IRQ26 -> 1:2
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 2792.9879 MHz.
..... host bus clock speed is 199.4990 MHz.
cpu: 0, clocks: 1994990, slice: 398998
CPU0<T0:1994976,T1:1595968,D:10,S:398998,C:1994990>
...

audit subsystem ver 0.1 initialized
mtrr: type mismatch for fd000000,800000 old: uncachable new: write-combining
mtrr: type mismatch for fd000000,800000 old: uncachable new: write-combining
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18790976
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18790984
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18790992
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791000
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791008
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791016
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791024
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791032
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791040
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791048
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791056
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791064
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791072
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791080
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791088
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
application bug: sqlplus(2014) has SIGCHLD set to SIG_IGN but calls wait().
(see the NOTES section of 'man 2 wait'). Workaround activated.
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791048
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791056
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791064
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791072
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791080
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791088
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 25040001
I/O error: dev 08:05, sector 18791096
application bug: sqlplus(3841) has SIGCHLD set to SIG_IGN but calls wait().
(see the NOTES section of 'man 2 wait'). Workaround activated.
application bug: sqlplus(3841) has SIGCHLD set to SIG_IGN but calls wait().
(see the NOTES section of 'man 2 wait'). Workaround activated.
application bug: sqlplus(5580) has SIGCHLD set to SIG_IGN but calls wait().
(see the NOTES section of 'man 2 wait'). Workaround activated.


结论:上面的测试结果证明了我的推断,磁盘坏了,产生了坏道,导致备份时,物理读取该数据文件时候报错

      但是这样就又有了新的问题,磁盘坏了,应该所在坏道上的数据文件逻辑结构也损坏,也就是应该产生

      逻辑坏块,但是事实并没有。而且在一次数据库重启后,数据库正常,并为报与之有关的错误
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: