版权声明:本文为Buddy Yuan原创文章,未经允许不得转载。原文地址:http://www.dboracle.com/archivers/ORA-04021: TIMEOUT OCCURRED WHILE WAITING TO LOCK OBJECT的解决办法.html
国庆都放假了,咱们还处理了一起小故障。先说说这个故障吧,Acticve DataGuard突然实例宕机。通过查看Alert的Trace发现是LGWR进程把实例宕掉了。具体alert trace如下:
Mon Oct 01 00:15:49 2018 Media Recovery Waiting for thread 2 sequence 309643 (in transit) Recovery of Online Redo Log: Thread 2 Group 64 Seq 309643 Reading mem 0 Mem# 0: +DG_DATA/dgskgj/onlinelog/group_64.807.942196417 Mem# 1: +DG_DATA/dgskgj/onlinelog/group_64.717.936131839 Mon Oct 01 00:15:54 2018 Archived Log entry 186436 added for thread 2 sequence 309642 ID 0x2f689337 dest 1: Mon Oct 01 00:16:17 2018 Errors in file /oracle/app/product/diag/rdbms/dgskgj/skgj1/trace/skgj_lgwr_14418510.trc: ORA-04021: timeout occurred while waiting to lock object LGWR (ospid: 14418510): terminating the instance due to error 4021 Mon Oct 01 00:16:18 2018 System state dump requested by (instance=1, osid=14418510 (LGWR)), summary=[abnormal instance termination].
可以看到这里首先出现了ORA-04021: timeout occurred while waiting to lock object,紧接着LGWR就terminating实例。所以我们在这里要先看一下LGWR的Trace情况。
Trace file /oracle/app/product/diag/rdbms/dgskgj/skgj1/trace/skgj1_lgwr_14418510.trc Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /oracle/app/product/11.2/db System name: AIX Node name: drskgj1 Release: 1 Version: 6 Machine: 00F96EC64C00 Instance name: skgj1 Redo thread mounted by this instance: 1 Oracle process number: 23 Unix process pid: 14418510, image: oracle@drskgj1 (LGWR) *** 2018-10-01 00:16:17.890 *** SESSION ID:(4348.1) 2018-10-01 00:16:17.890 *** CLIENT ID:() 2018-10-01 00:16:17.890 *** SERVICE NAME:(SYS$BACKGROUND) 2018-10-01 00:16:17.890 *** MODULE NAME:() 2018-10-01 00:16:17.890 *** ACTION NAME:() 2018-10-01 00:16:17.890 *** TRACE FILE RECREATED AFTER BEING REMOVED *** error 4021 detected in background process ORA-04021: timeout occurred while waiting to lock object kjzduptcctx: Notifying DIAG for crash event ----- Abridged Call Stack Trace ----- ksedsts()+240<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+120<-ksuitm()+5136<-ksbrdp()+4696<-opirip()+1620<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+276<-main()+204<-__start()+112 ----- End of Abridged Call Stack Trace ----- *** 2018-10-01 00:16:17.896 LGWR (ospid: 14418510): terminating the instance due to error 4021 ksuitm: waiting up to [5] seconds before killing DIAG(15991690)
这里可以看到LGWR给的信息也很有限,就是给了一串堆栈的信息,也没有什么特别的其他信息。所以这种情况下,我们就只能先借助MOS查一下ORA-04021: timeout occurred while waiting to lock object是什么问题。通过搜索我们发现文档ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)和我们遇到的问题是一致的。首先怎么判断问题是一致的呢?第一,alter日志报错的方式相同,第二,lgwr的trace堆栈是一样的。那么这篇告诉我们,我们命中了Bug 16717701 – ADG SHOULD GET THE INSTANCE PARSE LOCK WITH A TIMEOUT或者是Bug 11712267 – ACTIVE DATA GUARD DATABASE HUNG ON ‘LIBRARY CACHE: MUTEX X’ WAIT EVENT。
该问题的原因是当通过ADG中的恢复,LGWR将DB INSTANCE状态对象锁定为独占模式。这样的结果是LGWR可以阻止SQL的解析,而SQL的解析也能阻止LGWR。这是非常糟糕的行为。我们可以通过查看下列视图查询这个行为。
SQL> select a.*,b.name from v$sesstat a , v$statname b 2 where a.statistic#=b.statistic# 3 and a.sid=(select distinct sid from v$mystat) 4 and b.name like '%parse%'; SID STATISTIC# VALUE NAME ---------- ---------- ---------- ------------------------------ 117 264 0 ADG parselock X get attempts 117 265 0 ADG parselock X get successes
所以防止这个问题的办法就是
1.先尝试使用选项步骤1,将cursor_sharing更改为force。减少SQL解析的时间。
2.如果再次发生该问题,将隐含参数”_adg_parselock_timeout”设置成500。这个隐含参数是可以动态修改的。这个参数是防止超时的。