| Title: | VAX DBMS |
| Notice: | THIS NOTESFILE IS NOT A FORMAL SUPPORT CHANNEL |
| Moderator: | SCARY::CHARLAND |
| Created: | Thu Feb 20 1986 |
| Last Modified: | Tue Jun 03 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 2642 |
| Total number of notes: | 11044 |
[6.1A ECO 1, VMS 6.2]
Customer has two DBMS processes and the DBMS_MONITOR get into a
deadlock situation which is not resolved and the whole database
winds up hung on a locking logjam with these processes as the
blocker(s). Out of desparation they crashed the system and a
romp through SDA follows. I suggested DBO/BUGCHECK/PID as an
unsupported way out of it rather than taking out the whole
production machine, but I still need to track down what is
wrong. DEADLOCK_WAIT is set to 3. Any ideas ?
Regards,
rcs
>The following is the output from SDA showing the DBMS_MONITOR process
>waiting for a lock owned by process index 02C8.
SDA> show proc dbms_monitor
Process index: 00A2 Name: DBMS_MONITOR Extended PID: 35E004A2
-----------------------------------------------------------------
Status : 00040011 res,pswapm,phdres
Status2: 00000001 quantum_resched
PCB address B53E29C0 JIB address B46C2B00
PHD address C33C7C00 Swapfile disk address 00000000
Master internal PID 000100A2 Subprocess count 0
Internal PID 000100A2 Creator internal PID 00000000
Extended PID 35E004A2 Creator extended PID 00000000
State LEF Termination mailbox 00AA
Current priority 15 AST's enabled KESU
Base priority 15 AST's active NONE
UIC [00001,000004] AST's remaining 509
Mutex count 0 Buffered I/O count/limit 1022/1024
Waiting EF cluster 0 Direct I/O count/limit 256/256
Starting wait time 10001010 BUFIO byte count/limit 496288/496288
Event flag wait mask F7FFFFFF # open files allowed left 1014
Local EF cluster 0 E0000001 Timer entries allowed left 127
Local EF cluster 1 00000000 Active page table count 0
Global cluster 2 pointer 00000000 Process WS page count 552
Global cluster 3 pointer 00000000 Global WS page count 23
SDA> show proc/lock
Process index: 00A2 Name: DBMS_MONITOR Extended PID: 35E004A2
-----------------------------------------------------------------
Lock data:
Lock id: 5700E9B7 PID: 000100A2 Flags: VALBLK SYNCSTS SYSTEM
Par. id: 00000000 SUBLCKs: 0 PROTECT NODLCKW
LKB: B3179200 BLKAST: 00000000
PRIORTY: 0000 RQSEQNM: 0000
Waiting for PW 00000000-FFFFFFFF
Resource: 00000054 00000028 (...T... Status: ASYNC PROTCT
Length 16 00000001 35E07AC8 Hz`5....
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........
Local copy
SDA> show res/lock=5700E9B7
Resource database
-----------------
Address of RSB: B44D6080 GGMODE: PW Status: DIRENTR VALID WTFULRG
Parent RSB: 00000000 CGMODE: PW
Sub-RSB count: 0 FGMODE: PW
Lock Count: 2 CSID: 00000000
BLKAST count: 0 RQSEQNM: 0001
Resource: 00000054 00000028 (...T... Valblk: 00000000 00000000
Length 16 00000001 35E07AC8 Hz`5.... 00000000 00000000
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........ Seqnum: 00000000
Granted queue (Lock ID / Gr mode / Range):
3200316D PW 00000000-FFFFFFFF <- ***********
Conversion queue (Lock ID / Gr mode / Range -> Rq mode / Range):
*** EMPTY QUEUE ***
Waiting queue (Lock ID / Rq mode / Range):
5700E9B7 PW 00000000-FFFFFFFF
SDA> show lock 3200316D
Lock database
-------------
Lock id: 3200316D PID: 001E02C8 Flags: NOQUEUE SYNCSTS SYSTEM
********
Par. id: 00000000 SUBLCKs: 0
LKB: B5952000 BLKAST: 00000000
PRIORTY: 0000
Granted at PW 00000000-FFFFFFFF
Resource: 00000054 00000028 (...T... Status:
Length 16 00000001 35E07AC8 Hz`5....
Exec. mode 00000000 0
0000000 ........
System 00000000 00000000 ........
Local copy
>Next, process 02C8 (VA9401:BFAB) is waiting on a lock owned by process index
>038A.
SDA> show proc/in=2c8
Process index: 02C8 Name: VA9401:BFAB Extended PID: 35E07AC8
----------------------------------------------------------------
Status : 02040001 res,phdres,inter
Status2: 00000001 quantum_resched
PCB address B586E980 JIB address B2F0DF40
PHD address E6F76400 Swapfile disk address 00000000
Master internal PID 001E02C8 Subprocess count 0
Internal PID 001E02C8 Creator internal PID 00000000
Extended PID 35E07AC8 Creator extended PID 00000000
State LEF Termination mailbox 0000
Current priority 5 AST's enabled KESU
Base priority 5 AST's active NONE
UIC [00030,000036] AST's remaining 509
Mutex count 0 Buffered I/O count/limit 512/512
Waiting EF cluster 0 Direct I/O count/limit 128/128
Starting wait time 1A001A19 BUFIO byte count/limit 58976/58976
Event flag wait mask FDFFFFFF # open files allowed left 274
Local EF cluster 0 E0000001 Timer entries allowed left 9
Local EF cluster 1 80000000 Active page table count 0
Global cluster 2 pointer 00000000 Process WS page count 1967
Global cluster 3 pointer 00000000 Global WS page count 256
SDA> show proc/lock
Process index: 02C8 Name: VA9401:BFAB Extended PID: 35E07AC8
----------------------------------------------------------------
Lock data:
Lock id: 61E2001B PID: 001E02C8 Flags: SYSTEM
Par. id: 5800E768 SUBLCKs: 0
LKB: B3B71D80 BLKAST: 023CD2B8
PRIORTY: 0000 RQSEQNM: 2D08
Waiting for CW 00000000-FFFFFFFF
Resource: 00000000 00000046 F....... Status: ASYNC TIMOUTQ
Length 04 00000000 00000000 ........
- Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........
Local copy
SDA> show res/lock=61E2001B
Resource database
-----------------
Address of RSB: B4954340 GGMODE: CW Status: VALID WTFULRG
Parent RSB: B44753C0 CGMODE: CW
Sub-RSB count: 0 FGMODE: CW
Lock Count: 163 CSID: 00000000
BLKAST count: 1 RQSEQNM: 2D7C
Resource: 00000000 00000046 F....... Valblk: 00000000 00000000
Length 4 00000000 00000000 ........ 00000000 00000000
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........ Seqnum: 00000000
Granted queue (Lock ID / Gr mode / Range):
0200B0D0 CW 00000000-FFFFFFFF <- **********
Conversion queue (Lock ID / Gr mode / Range -> Rq mode / Range):
*** EMPTY QUEUE ***
Waiting queue (Lock ID / Rq mode / Range):
4F00FFD4 PR 00000000-FFFFFFFF 36F000A4 PR 00000000-FFFFFFFF
SDA> show lock 0200B0D0
Lock database
-------------
Lock id: 0200B0D0 PID: 001F038A Flags: SYSTEM
********
Par. id: 66020125 SUBLCKs: 0
LKB: B4397180 BLKAST: 0028E6B8
PRIORTY: 0000
Granted at CW 00000000-FFFFFFFF
Resource: 00000000 00000046 F....... Status: BLASTQD
Length 04 00000000 00000000 ........
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........
Local copy
>Finally, process 038A (BFAB_35E0870E) is waiting on a lock owned by the
>DBMS_MONITOR (process index 00A2).
SDA> show proc/in=38a
Process index: 038A Name: BFAB_35E0870E Extended PID: 35E07F8A
------------------------------------------------------------------
Status : 00040023 res,delpen,respen,phdres
Status2: 00000001 quantum_resched
PCB address B5221B40 JIB address B45E5880
PHD address E33A2800 Swapfile disk address 00000000
Master internal PID 0021030E Subprocess count 1
Internal PID 001F038A Creator internal PID 0021030E
Extended PID 35E07F8A Creator extended PID 35E0870E
State LEF Termination mailbox 1650
Current priority 10 AST's enabled KESU
Base priority 5 AST's active NONE
UIC [00030,000036] AST's remaining 510
Mutex count 0 Buffered I/O count/limit 512/512
Waiting EF cluster 0 Direct I/O count/limit 128/128
Starting wait time 1A001A1A BUFIO byte count/limit 39904/40032
Event flag wait mask FEFFFFFF # open files allowed left 181
Local EF cluster 0 E2000001 Timer entries allowed left 10
Local EF cluster 1 98000000 Active page table count 0
Global cluster 2 pointer 00000000 Process WS page count 865
Global cluster 3 pointer 00000000 Global WS page count 439
SDA> show proc/lock
Process index: 038A Name: BFAB_35E0870E Extended PID: 35E07F8A
------------------------------------------------------------------
Lock data:
Lock id: 7900ABB9 PID: 001F038A Flags: CONVERT SYNCSTS NODLCKW
Par. id: 00000000 SUBLCKs: 0
LKB: B30F7280 BLKAST: 00000000
PRIORTY: 0000 RQSEQNM: 0001
Converting to EX 00000000-FFFFFFFF
Granted at NL 00000000-FFFFFFFF
Resource: 00000054 00000028 (...T... Status: ASYNC
Length 16 00000001 35E07F8A ..`5....
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........
Local copy
SDA> show res/lock=7900ABB9
Resource database
-----------------
Address of RSB: B5066B40 GGMODE: PW Status: VALID CVTFULR
Parent RSB: 00000000 CGMODE: PW
Sub-RSB count: 0 FGMODE: PW
Lock Count: 2 CSID: 00000000
BLKAST count: 0 RQSEQNM: 0002
Resource: 00000054 00000028 (...T... Valblk: 00000000 00000000
Length 16 00000001 35E07F8A ..`5.... 00000000 00000000
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........ Seqnum: 00000000
Granted queue (Lock ID / Gr mode / Range):
668A013F PW 00000000-FFFFFFFF <- ************
Conversion queue (Lock ID / Gr mode / Range -> Rq mode / Range):
7900ABB9 NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF
Waiting queue (Lock ID / Rq mode / Range):
*** EMPTY QUEUE ***
SDA> show lock 668A013F
Lock database
-------------
Lock id: 668A013F PID: 000100A2 Flags: VALBLK SYNCSTS SYSTEM
********
Par. id: 00000000 SUBLCKs: 0 PROTECT NODLCKW
LKB: B3AA4F00 BLKAST: 00000000
PRIORTY: 0000
Granted at PW 00000000-FFFFFFFF
Resource: 00000054 00000028 (...T... Status: ASYNC PROTCT
Length 16 00000001 35E07F8A ..`5....
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........
Local copy
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 2594.1 | HOTRDB::PMEAD | Paul, [email protected], 719-577-8032 | Wed Feb 12 1997 19:18 | 29 | |
Richard,
We don't recognize some of those locks as being Rdb locks. Could you
have the customer ask VMS support if they recognize the lock shown
below? I think this lock is killing deadlock detection since it has
the NODLCKW flag lit.
Paul
SDA> show lock 668A013F
Lock database
-------------
Lock id: 668A013F PID: 000100A2 Flags: VALBLK SYNCSTS SYSTEM
********
Par. id: 00000000 SUBLCKs: 0 PROTECT NODLCKW
LKB: B3AA4F00 BLKAST: 00000000
PRIORTY: 0000
Granted at PW 00000000-FFFFFFFF
Resource: 00000054 00000028 (...T... Status: ASYNC PROTCT
Length 16 00000001 35E07F8A ..`5....
Exec. mode 00000000 00000000 ........
System 00000000 00000000 ........
Local copy
| |||||
| 2594.2 | ORAREP::HERON::GODFRIND | Oracle Rdb Engineering | Thu Feb 13 1997 06:50 | 28 | |
The T locks are the termination locks. >Resource: 00000054 00000028 (...T... Status: ASYNC PROTCT > Length 16 00000001 35E07AC8 Hz`5.... > Exec. mode 00000000 00000000 ........ > System 00000000 00000000 ........ Each process connected to the database holds one such lock (the resource name includes the pid of that process). The monitor has a permanent request for all of those locks queued (it will notice a process failure when being granted that lock). Those locks are taken with the NODLCKWT option, and are therefore not seen as causing deadlocks. The actual problem seems to be caused by process 35e07f8a. That process is holding on to the F lock (the freeze lock). It is also waiting to be deleted: >Process index: 038A Name: BFAB_35E0870E Extended PID: 35E07F8A >------------------------------------------------------------------ >Status : 00040023 res,delpen,respen,phdres >Status2: 00000001 quantum_resched The freeze lock is used during database recovery. It should be given up by all processes so that recovery can proceed. If one process does not release it, then the database will seem to hang. Have you got a way to find out what process 35E07F8A was doing ? /albert | |||||
| 2594.3 | HOTRDB::PMEAD | Paul, [email protected], 719-577-8032 | Thu Feb 13 1997 10:27 | 12 | |
Hmmm. Indeed Albert is right as always. It appears this bit of trivia
leaked from my brain.
OK. Sorry for leading you astray. Indeed, someone needs to determine
why the process doesn't give up the freeze lock. There have been tons
of notes on this in the Rdb notesfiles. In practically every case
turning off VMS dynamic lock remastering has prevented the problem from
occurring.
fwiw,
Paul
| |||||
| 2594.4 | PE1 *was* 0, so... | NOMAHS::SECRIST | Rdb WWS; [email protected] | Thu Feb 13 1997 13:53 | 46 |
The process was running a homegrown application that performs
a minor amount of DBMS reads in addition to other things. This is
the second time this problem has occurred. Before, the suspect
process was running a different executable (though accessing the same
DB).
PE1 across the cluster was set to 0. The dumpfile from the crash
shows this:
SDA> exam SGN$GL_PE1
SGN$GL_PE1: 00000000 "...."
Again, I have asked the customer to try PE1 of 1 to see if the
problem goes away.
Here's the process summary from the dumpfile:
Process index: 038A Name: BFAB_35E0870E Extended PID: 35E07F8A
------------------------------------------------------------------
Status : 00040023 res,delpen,respen,phdres
Status2: 00000001 quantum_resched
PCB address B5221B40 JIB address B45E5880
PHD address E33A2800 Swapfile disk address 00000000
Master internal PID 0021030E Subprocess count 1
Internal PID 001F038A Creator internal PID 0021030E
Extended PID 35E07F8A Creator extended PID 35E0870E
State LEF Termination mailbox 1650
Current priority 10 AST's enabled KESU
Base priority 5 AST's active NONE
UIC [00030,000036] AST's remaining 510
Mutex count 0 Buffered I/O count/limit 512/512
Waiting EF cluster 0 Direct I/O count/limit 128/128
Starting wait time 1A001A1A BUFIO byte count/limit 39904/40032
Event flag wait mask FEFFFFFF # open files allowed left 181
Local EF cluster 0 E2000001 Timer entries allowed left 10
Local EF cluster 1 98000000 Active page table count 0
Global cluster 2 pointer 00000000 Process WS page count 865
Global cluster 3 pointer 00000000 Global WS page count 439
Do you see anything of interest ?
Regards,
rcs
| |||||
| 2594.5 | PE1 of 100 ? | NOMAHS::SECRIST | Rdb WWS; [email protected] | Fri Feb 14 1997 14:26 | 12 |
The customer finally set PE1 to 100. I recall a Digital recommendation
some moons ago that if you felt compelled to set PE1 to something, to
set it small so at least small trees should migrate. Since it only
takes 1 tree to potentially hose a database if that is the root of a
locking logjam, my inclination is to set it to -1 and shut it off
entirely or 1 which effectively does the same thing. Does anyone
think 100 would improve matters at all since a lock tree could be
fairly small (but deadly) ?
Regards,
rcs
| |||||
| 2594.6 | DUCATI::LASTOVICA | Is it possible to be totally partial? | Fri Feb 14 1997 15:16 | 1 | |
set it to 1. | |||||