An interesting case of ORA-600 :)
I was paged for standby lag and started looking into the issue. The db version was 11.1.0.7.0 version. Standby was lagging behind by 36 archives
SQL> @mrp INST_ID PROCESS STATUS THREAD# SEQUENCE# BLOCK# DELAY_MINS ---------- --------- ------------ ---------- ---------- ---------- ---------- 1 MRP0 WAIT_FOR_LOG 1 568330 0 0 INST_ID PROCESS STATUS THREAD# SEQUENCE# BLOCK# DELAY_MINS ---------- --------- ------------ ---------- ---------- ---------- ---------- 1 RFS IDLE 0 0 0 0 1 RFS IDLE 0 0 0 0 1 RFS WRITING 1 568330 1 0 1 RFS IDLE 0 0 0 0 1 RFS IDLE 1 568366 309529 0
After sometime checked back again and still on the same sequence#. Status of MRP and RFS
SQL> @mrp INST_ID PROCESS STATUS THREAD# SEQUENCE# BLOCK# DELAY_MINS ---------- --------- ------------ ---------- ---------- ---------- ---------- 1 MRP0 WAIT_FOR_LOG 1 568330 0 0 INST_ID PROCESS STATUS THREAD# SEQUENCE# BLOCK# DELAY_MINS ---------- --------- ------------ ---------- ---------- ---------- ---------- 1 RFS IDLE 0 0 0 0 1 RFS IDLE 0 0 0 0 1 RFS WRITING 1 568330 1 0 1 RFS IDLE 0 0 0 0 1 RFS IDLE 1 568366 795006 0
No error was reported in alert log or was in v$archive_dest_status.Archive log genertaion on Primary was usual like any other day, nothing changed.Seq# 568330 was present on Primary and size was 497Mb whereas the same seq# size on standby was 513Mb. I am not sure why!!!
So, i thought to copy the archive from the primary, stop mrp , perform manual recovery using ‘recover standby standby’ as other 35 archives were present on disk and then start real time apply.
Once the copy completed, i stopped the MRP and started manual recovery
SQL> alter database recover managed standby database cancel; Database altered. SQL> select name,open_mode,database_role from v$database; NAME OPEN_MODE DATABASE_ROLE --------- ---------- ---------------- TEST MOUNTED PHYSICAL STANDBY SQL> recover standby database; ORA-00279: change 69956910076 generated at 10/26/2014 21:46:36 needed for thread 1 ORA-00289: suggestion : /oraclebackup/fra/TEST/archivelog/2014_10_26/o1_mf_1_568330_b4vdl546_.arc ORA-00280: change 69956910076 for thread 1 is in sequence #568330 Specify log: {=suggested | filename | AUTO | CANCEL} AUTO ORA-16145: archival for thread# 1 sequence# 568330 in progress SQL> @mrp no rows selected INST_ID PROCESS STATUS THREAD# SEQUENCE# BLOCK# DELAY_MINS ---------- --------- ------------ ---------- ---------- ---------- ---------- 1 RFS IDLE 0 0 0 0 1 RFS IDLE 0 0 0 0 1 RFS WRITING 1 568330 1 0 1 RFS IDLE 0 0 0 0 1 RFS IDLE 1 568372 563688 0
At this point i thought to bounce the standby and then perform recovery. I issued shutdown immediate and it was hung. Almost after 10mins alert log showed
Active call for process 45660 user 'oracle' program 'oracle@test.com' SHUTDOWN: waiting for active calls to complete.
I checked the PID 45660 and as it showed LOCAL=NO, did kill -9
[oracle@test:~/app/oracle (test)]$ ps -ef | grep 45660 oracle 41463 16050 0 00:43 pts/0 00:00:00 grep 45660 oracle 45660 1 1 Oct23 ? 01:09:22 oracletest (LOCAL=NO) [oracle@test:~/app/oracle (test)]$ kill -9 45660 [oracle@test:~/app/oracle (test)] ps -ef | grep 45660 oracle 42181 16050 0 00:44 pts/0 00:00:00 grep 45660 oracle 45660 1 1 Oct23 ? 01:09:22 oracletest (LOCAL=NO)
Kill -9 didn’t kill the process. As it was standby, i thought to go ahead and Abort the instance and this is what happend when i tried startup mount
SQL> startup mount ORACLE instance started. Total System Global Area 6.1466E+10 bytes Fixed Size 2174600 bytes Variable Size 2.0535E+10 bytes Database Buffers 4.0802E+10 bytes Redo Buffers 126640128 bytes ORA-03113: end-of-file on communication channel Process ID: 47331 Session ID: 995 Serial number: 3
Alert log showed
Mon Oct 27 00:48:09 2014
ALTER DATABASE MOUNT
Mon Oct 27 00:53:10 2014
System State dumped to trace file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/trace/test_ckpt_47252.trc
Mon Oct 27 00:53:11 2014
Errors in file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/trace/test_asmb_47260.trc (incident=491160):
ORA-00600: internal error code, arguments: [600], [ORA_NPI_ERROR], [ORA-00600: internal error code, arguments: [kfnsInstanceReg00], [test:test_TEST_DB], [30000], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Incident details in: /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/incident/incdir_491160/test_asmb_47260_i491160.trc
Errors in file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/test/trace/test_asmb_47260.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [600], [ORA_NPI_ERROR], [ORA-00600: internal error code, arguments: [kfnsInstanceReg00], [test:test_TEST_DB], [30000], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Mon Oct 27 00:53:12 2014
Mon Oct 27 00:53:12 2014
Sweep Incident[491160]: completed
Mon Oct 27 00:53:12 2014
Trace dumping is performing id=[cdmp_20141027005312]
Errors in file /home/oracle/app/oracle/diag/rdbms/test_TEST_db/prod/trace/test_ckpt_47252.trc:
ORA-15083: failed to communicate with ASMB background process
CKPT (ospid: 47252): terminating the instance due to error 15083
Mon Oct 27 00:53:20 2014
Instance terminated by CKPT, pid = 47252
The CKPT trace showed
15083: Timeout waiting for ASMB to be ready (state=1)
ASMB is ALIVE
—– Abridged Call Stack Trace —–
SYSTEM STATE (level=10, with short stacks)
————
System global information:
processes: base 0xe90ea0088, size 900, cleanup 0xea0ec6598
allocation: free sessions 0xeb8f53d68, free calls (nil)
control alloc errors: 0 (process), 0 (session), 0 (call)
PMON latch cleanup depth: 0
seconds since PMON's last scan for dead processes: 1
system statistics:
22 logons cumulative
19 logons current
5 opened cursors cumulative
1 opened cursors current
0 user commits
0 user rollbacks
51 user calls
19 recursive calls
0 recursive cpu usage
0 session logical reads
0 session stored procedure space
0 CPU used when call started
0 CPU used by this session
1 DB time
Looking at the ASM instance alert log
System State dumped to trace file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_33024.trc
Errors in file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_33024.trc:more /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_33024.trc
*** ACTION NAME:() 2014-10-27 02:18:39.197kfnmsg_wait: KSR failure reaping message (0xffffff83)
*** 2014-10-27 02:18:39.197
Communication failure sending message to UFGs (0xffffff83)
kfnmsg_wait: opcode=35 msg=0x6b3f0928
2014-10-27 02:18:39.197762 :kfnmsg_Dump(): kfnmsg_SendtoUFGs: (kfnmsg) op=35 (KFNMS_GROUP_RLS) (gnum) [1.0]
CommunicationFailure 0xffffff83 after 300s
The server was hosting 2 standby db instances and 1 ASM instance. We thought to bring down the other standby db instance and then the ASM instance and then finally restart all. The shutdown immediate for 2nd standby instance was hung for long time so we aborted it. After ASM was brought down, still we could see below 5 pids
$ ps -ef | grep oracle
oracle 12032 16050 0 02:44 pts/0 00:00:00 grep arc
oracle 23135 1 0 Oct18 ? 00:02:15 ora_arc5_test
oracle 25390 1 0 Oct18 ? 00:01:19 ora_arc1_test2
oracle 25392 1 0 Oct18 ? 00:00:50 ora_arc2_test2oracle 45660 1 1 Oct23 ? 01:09:22 oracleprod (LOCAL=NO)
Even after all the instances were down on server we could see few archiver Pids and a Pid with LOCAL=NO. Kill -9 also didn’t kill the pid.
Started the ASM instance and then when i started the standby instance TEST , ASM crashed
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_57380.trc
Errors in file /home/oracle/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_ora_57380.trc (incident=280159):
ORA-00600: internal error code, arguments: [kfupsRelease01], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /home/oracle/app/oracle/diag/asm/+asm/+ASM/incident/incdir_280159/+ASM_ora_57380_i280159.trc
At this point escalated to client for server reboot. After the server reboot was done, all the instances came up properly. Started MRP and standby came insync.
Also you can reference to the below Metalink documents
Database startup Reports ORA-00600:[kfnsInstanceReg00] (Doc ID 1552422.1)
Bug 6934636 – Hang possible for foreground processes in ASM instance (Doc ID 6934636.8)
ORA-600 [kfnsInstanceReg00] and long waits on “ASM FILE METADATA OPERATION” (Doc ID 1480320.1)
Filed under: 11gR1, Oracle Tagged: ASMB is ALIVE, kfnsInstanceReg00, kfupsRelease01, ORA_NPI_ERROR, UFGs
