您的位置:首页 > 其它

LAD(Log Archive Dest)配置不当引起备份失败

2015-06-26 18:41 501 查看

一.问题起因

2014/10/14接某客户反馈,备份数据库的crontab执行失败。远程连接分析后发现是因为2014/09/13灾备演练过后dataguard参数没有正确调整导致的归档未清理,过多归档备份时因空间不足而失败。详细过程如下

二.日志分析

1.登陆后检查备份日志后发现数据文件备份成功但是备份归档时失败:

including current SPFILE in backup set
channel c1: starting piece 1 at 13-OCT-14
channel c1: finished piece 1 at 13-OCT-14
piece handle=/backup/addrrman/full_ADDRPROD_20141013_14004_1 tag=TAG20141013T220005 comment=NONE
channel c1: backup set complete, elapsed time: 00:00:01
channel c2: finished piece 1 at 13-OCT-14
piece handle=/backup/addrrman/full_ADDRPROD_20141013_14001_1 tag=TAG20141013T220005 comment=NONE
channel c2: backup set complete, elapsed time: 01:45:12
channel c3: finished piece 1 at 13-OCT-14
piece handle=/backup/addrrman/full_ADDRPROD_20141013_14002_1 tag=TAG20141013T220005 comment=NONE
channel c3: backup set complete, elapsed time: 01:46:01
Finished backup at 13-OCT-14

sql statement: alter system archive log current
。。。。skip .....

released channel: c1
released channel: c2
released channel: c3
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on c3 channel at 10/14/2014 00:30:34
<span style="color:#ff0000;">ORA-19502: write error on file "/backup/addrrman/arch_ADDRPROD_20141014_14093_1", block number 442369 (block size=512)
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 28: No space left on device
Additional information: -1
Additional information: 1048576</span>

2.检查数据文件备份集大小发现数据量未剧增

oracle@p740a:/backup/addrrman[addr11g1]$ls -ltr
total 143197088
-rw------- 1 oracle oinstall 98 Aug 21 18:53 nohup.out
-rw-r--r-- 1 oracle oinstall 7702 Oct 13 22:00 analyze.lst
-rw-r----- 1 oracle asmadmin 23931797504 Oct 13 23:44 full_ADDRPROD_20141013_14000_1
-rw-r----- 1 oracle asmadmin 7847936 Oct 13 23:44 full_ADDRPROD_20141013_14003_1
-rw-r----- 1 oracle asmadmin 98304 Oct 13 23:44 full_ADDRPROD_20141013_14004_1
-rw-r----- 1 oracle asmadmin 23550468096 Oct 13 23:45 full_ADDRPROD_20141013_14001_1
-rw-r----- 1 oracle asmadmin 25820962816 Oct 13 23:46 full_ADDRPROD_20141013_14002_1
-rw-r--r-- 1 oracle oinstall 2659758 Oct 14 00:34 rman_delete.log
-rw-r--r-- 1 oracle oinstall 803655 Oct 14 00:37 delete_local_std_arch.log
-rw-r--r-- 1 oracle oinstall 1210456 Oct 14 00:38 rman_bk.log
-rw-r--r-- 1 oracle oinstall 527 Oct 14 00:38 delete_cd_std_arch.log

3.检查归档删除日志发现9/13日归档因为没有在所有standby去apply

RMAN-08120: WARNING: archived log not deleted, not yet applied by standby
archived log file name=+ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13079.1905.858179699 thread=1 sequence=13079
RMAN-08120: WARNING: archived log not deleted, not yet applied by standby
archived log file name=+ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13080.1618.858181499 thread=1 sequence=13080
<span style="color:#ff0000;">RMAN-08120: WARNING: archived log not deleted, not yet applied by standby</span>
archived log file name=+ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13081.1619.858182367 thread=1 sequence=13081

4.结合归档删除脚本中的archivelog删除策略

rman target / nocatalog log /backup/addrrman/rman_delete.log<<EOF
allocate channel for maintenance type disk connect 'sys/xxxx@addr11g1';
allocate channel for maintenance type disk connect 'sys/xxxx@addr11g2';
CONFIGURE RETENTION POLICY TO REDUNDANCY 1;
<span style="color:#ff0000;">CONFIGURE ARCHIVELOG DELETION POLICY TO APPLIED ON ALL STANDBY;-->在所有standby应用后才能删除</span>
crosscheck backup;
crosscheck archivelog all;
delete noprompt archivelog until time 'sysdate-7';
delete noprompt obsolete;
delete noprompt expired backup;

exit
EOF

5.检查log_archive_dest和log_archive_dest_state发现有defer的LAD

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest string
log_archive_dest_1 string LOCATION=+ARCHDG VALID_FOR=(AL
L_LOGFILES,ALL_ROLES) DB_UNIQU
E_NAME=addrprod

log_archive_dest_3 string service=ADDRCD arch async vali
d_for=(ONLINE_LOGFILES,PRIMARY
_ROLE) reopen=60 db_unique_nam
e=ADDRCD

log_archive_dest_4 string service=ADDRPROD_STD arch asyn
c valid_for=(ONLINE_LOGFILES,P
RIMARY_ROLE) reopen=60 db_uniq
ue_name=ADDRPROD_STD
log_archive_dest_state_1 string ENABLE
<span style="background-color: rgb(255, 255, 0);">log_archive_dest_state_3 string defer</span>
log_archive_dest_state_4 string enable

三.问题解决

清理log_archive_dest_3后重新手工删除archivelog 成功:

SQL> show parameter log_archive_dest_3;

NAME TYPE VALUE
------------------------------------ ---------- ------------------------------
log_archive_dest_3 string service=ADDRCD arch async vali
d_for=(ONLINE_LOGFILES,PRIMARY
_ROLE) reopen=60 db_unique_nam
e=ADDRCD
log_archive_dest_30 string
log_archive_dest_31 string
SQL> alter system set log_archive_dest_3='' scope=both sid='*';

System altered.

SQL> show parameter log_archive_dest_3;

NAME TYPE VALUE
------------------------------------ ---------- ------------------------------
log_archive_dest_3 string
log_archive_dest_30 string
log_archive_dest_31 string
删除归档时未再报错:
RMAN> CONFIGURE ARCHIVELOG DELETION POLICY TO APPLIED ON ALL S
4000
TANDBY;

delete noprompt archivelog until time 'sysdate-7';using target database control file instead of recovery catalog
old RMAN configuration parameters:
CONFIGURE ARCHIVELOG DELETION POLICY TO APPLIED ON ALL STANDBY;
new RMAN configuration parameters:
CONFIGURE ARCHIVELOG DELETION POLICY TO APPLIED ON ALL STANDBY;
new RMAN configuration parameters are successfully stored

RMAN>

allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=963 instance=addr11g1 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=1717 instance=addr11g1 device type=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: SID=1908 instance=addr11g1 device type=DISK
allocated channel: ORA_DISK_4
channel ORA_DISK_4: SID=2189 instance=addr11g1 device type=DISK
List of Archived Log Copies for database with db_unique_name ADDRPROD
=====================================================================

Key     Thrd Seq     S Low Time
------- ---- ------- - ---------
168624  1    13079   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13079.1905.858179699

168643  1    13080   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13080.1618.858181499

168646  1    13081   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13081.1619.858182367

168648  1    13082   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13082.1620.858182411

168656  1    13083   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13083.1625.858182901

168658  1    13084   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13084.1624.858182903

168662  1    13085   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13085.1627.858182967

168666  1    13086   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13086.1629.858184767

168670  1    13087   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13087.1631.858186569

168674  1    13088   A 13-SEP-14
Name: +ARCHDG/addrprod/archivelog/2014_09_13/thread_1_seq_13088.1633.858188367


四.小结

这种临时性操作的收尾不干净导致的问题应该也不少见,本次没有引起重大故障(当然并不意味着每次都不会引起重大故障)。所以,日常工作中我们还是需要从多方面入手确保系统的正常运行,例如:

1).足够熟悉系统环境,清楚掌握各个临时操作之后如何恢复回去;

2).当然以上一点纯粹不靠谱啦,都说好记性不如烂笔头,最好还是有标准化的OM咯;

3).相关临时操作完成后需要对系统进行一次完整的检查。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: