| > SDA> EXA 80AB51C0;100
> 80AB51E0 00001040 02040000 00010000 ........@...�Q�. 80AB51C0
> 00000000 00000000 00000000 00000000 ................ 80AB51D0
> 80AB51E8 80AB51E8 80AB51E0 80AB51E0 �Q�.�Q�.�Q�.�Q�. 80AB51E0
> 80AB51F8 80AB51F8 80AB51F0 80AB51F0 Q�.Q�.�Q�.�Q�. 80AB51F0
> 80AB5208 80AB5208 80AB5200 80AB5200 .R�..R�..R�..R�. 80AB5200
> 80AB5218 80AB5218 80AB5210 80AB5210 .R�..R�..R�..R�. 80AB5210
> 80AB5228 80AB5228 80AB5220 80AB5220 R�. R�.(R�.(R�. 80AB5220
> .
> .
> 80AB53D8 80AB53D8 80969080 80969080 ........ 80AB53D0
> .
> .
This looks like the resource hash table. For any resource, for example a
message queue, there is a name which can be looked up in the hash table.
The first longword is used as mutex. Then there are other data which describe
this table (length, etc.). Then there is a pointer to the first resource
header: 80AB51E0. The other resource headers follow. Each resource header
has a flink and a blink. Hence all the resource headers with identical
entries pointing to itself are unused slots in the hash table. From your output
it seems there is only 1 resource used, header 80969080. But the length of the
table is 0x1040.
> SDA> exa 80969080;c0
> 0004EC7F 0F8E3039 80AB53D0 80AB53D0 S�.S�.90...�.. 80969080
> 00019000 00000200 000000C8 00000001 ....�........... 80969090
> 000000C0 00000003 00C00000 00000000 ......�.....�... 809690A0
> 80AF12C0 80C7EA40 00000000 00000000 ........@��.�.. 809690B0
> 809690C8 809690C8 809690C0 809690C0 �...�...�...�... 809690C0
> 00000026 00000000 0000002D 00000000 ....-.......&... 809690D0
> 00010A01 000001C4 FFFFFFFF 36305953 SY06....�....... 809690E0
> 000141B6 0001000A 00010A01 0001000A ............�A.. 809690F0
> 36D48B78 32EAECCC 000001C4 000001DE ...�...���2x.�6 80969100
> 12F81828 32FB3595 2BB8312C 32FB35CC �5�2,1+.5�2(.�. 80969110
> 00000020 20205359 53534D56 50584102 .AXPVMSSYS ... 80969120
> 0004EC7F 00000000 00000000 00030F8E .............�.. 80969130
This is the resource. This record again has a flink and a blink, pointing back:
only one resource in this hash slot. It looks like a resource describing a
message queue. Do they use message queues? There are 3 messages in the message
queue, a total of 0xc0 bytes. The messages are in a linked list, header is at
809690B8. Nothing pending, no send, no receive. Pids and UIDs follow, and time
stamps (in posix format). I can't see anything unusual in these structures.
Back to the problem. It seems that these processes are waiting for the hash
table mutex. There should be one process, which currently holds that mutex.
The mutex is aquired for searching/updating the hash table and/or updating
resource structures.
This process may have identified itself as last sender or receiver in this or
another message queue (or another resource structure). At offset 0x80 and 0x84
there are the pids of the last sender and last receiver. In this structure the
last operation should be a send. The sender's id is 000001DE, the send time is
32FB35CC 2BB8312C (posix sec and nanosec format). This is obviously bigger than
the creation time and the last receive time (32EAECCC 36D48B78 and 32FB3595
12F81828). The last send was done 0xcc-0x95 seconds after the last receive.
If this is the only resource, then 1de was the last one successfully sending a
message to this queue. If it successfully did that it should have released the
mutex. What happened to this process, is it still around?
On the other hand, if the culprit locked the mutex and failed to identify
itself in this structure, then you have to look at all other posix processes,
not waiting for this mutex. Are there many around? What's their status?
And going some more steps back. Did it work before? Is there anything new onthe
system, new version of VMS, the application? How often does this happen, is
there anything unusual in the application/system going on at the same time?
I'm not sure if this really helps, but I hope it's a starter,
Hartmut
|
|
Thank you for quick response.
The problem occurred just one time.The customer has been using
this system since last October.The application uses the message queue.
I have a forced crash dump and analizing it but I don't have a
knowledge of posix internal, so it is difficult to analize it.
According to previous reply, owner of MUTEX is 0001DE. It can be
confirmed because the process's prioprity is 16.
(VMS raises the priority of process who holding MUTEX to 16.)
It seems that this process is looping in POSIX$KERNEL without
releaseing MUTEX.So other posix processes hang.And also it seems
that R3 points to the resource.This resource has C7 messages.
But,I don't understand the relation between these data structure and
process loop.
Any comments are appreciated.
Regards,
Makito Kuniyoshi/CSC Tokyo
Process index: 00DE Name: posix$000001DE Extended PID: 000001DE
-------------------------------------------------------------------
Process status: 000C8001 RES,NOACNT,PHDRES,HIBER
Required capabilities: 0000000C QUORUM,RUN
PCB address 80A98A80 JIB address 80CBE480
PHD address 823E4000 Swapfile disk address 00000000
Master internal PID 000100DE Subprocess count 0
Internal PID 000100DE Creator internal PID 00000000
Extended PID 000001DE Creator extended PID 00000000
State CUR Termination mailbox 0000
Previous CPU Id 00000001 Current CPU Id 00000000
Previous ASNSEQ 00000000004414C2 Previous ASN 000000000000003D
Current priority 16 # of threads 0000000000000001
Initial process priority 4 Delete pending count 0
Base priority 16 AST's active NONE
UIC [00012,000001] AST's remaining 511
Mutex count 1 Buffered I/O count/limit 500/500
Waiting EF cluster 0 Direct I/O count/limit 500/500
Abs time of last event 0662AE25 BUFIO byte count/limit 595328/595328
Event flag wait mask F7FFFFFF # open files allowed left 465
Process index: 00DE Name: posix$000001DE Extended PID: 000001DE
-------------------------------------------------------------------
Active registers for current process at bugcheck time
-----------------------------------------------------
R0 = FFFFFFFF 80B69908 R1 = 00000000 0011BA48 R2 = FFFFFFFF 81565CD8
R3 = FFFFFFFF 80B698C0 R4 = FFFFFFFF 80B698F8 R5 = FFFFFFFF 80AB51C0
R6 = FFFFFFFF 8155C090 R7 = 00010031 00000000 R8 = 00000000 00000000
R9 = 00050031 09295064 R10 = 00018C28 00000000 R11 = 80B1C300 7F9C0EF8
R12 = 00000000 7F9C3A88 R13 = FFFFFFFF 81579248 R14 = 00000000 00000000
R15 = 00000000 00000130 R16 = 00000000 0000064C R17 = 00000000 32EAED0F
R18 = FFFFFFFF 80A72340 R19 = 00000000 00000006 R20 = 00000000 00000020
R21 = 80B98E00 80AF3780 R22 = 80B98E00 80AF3780 R23 = 00000000 00050031
R24 = 00000000 00000000 R25 = FFFFFFFF 80B69908 R26 = FFFFFFFF 815352FC
R27 = FFFFFFFF 815352C0 R28 = FFFFFFFF 84AC318C R29 = 00000000 7FF91DB0
PC = FFFFFFFF 8006EF34 PS = 38000000 00001F00 KSP = FFFFFFFF 848F1FF8
ESP = 00000000 7FF96000 SSP = 00000000 7FF9E000 USP = 00000000 7F9C0EE0
PTBR = 00000000 000000F4 AST{SR/EN} = 0000002E ASN = 00000000 0000002F
SDA> exa @r3;c0
000534C9 0F8E3039 80AB56A0 80AB56A0 .V�.V�90..�.. 80B698C0
00019000 00000200 000000C8 00000001 ....�.......... 80B698D0
00018C28 000000C7 00C00000 00000000 ......��..(... 80B698E0
80C1A480 809357C0 00000000 0000000A ........��...�. 80B698F0
80AA8F40 80AF01C0 80B69900 80B69900 ..�..��@.� 80B69900
00000000 00000000 00000000 00000000 ................ 80B69910
00010A01 000001DE FFFFFFFF 32355953 SY52.......... 80B69920
000141B6 0001000A 00010A01 0001000A ............��.. 80B69930
06419000 32EAED0F 000001DE 00005A84 .Z.....��2..A. 80B69940
2BF3CAFC 32FB35CC 12746744 32FB357F .5��Dgt.̵����� 80B69950
00000020 20205359 53534D56 50584102 .AXPVMSSYS ... 80B69960
000534C9 00000000 00000000 00030F8E ............�.. 80B69970
Call Frame Information
----------------------
Stack Frame Procedure Descriptor
Flags: Base Register = FP, No Jacket, Native
Procedure Entry: FFFFFFFF 84AE5A90 POSIX$KERNEL_PRO+4BA90
Return address on stack = FFFFFFFF 84AE65BC POSIX$KERNEL_PRO+4C5BC
Registers saved on stack
------------------------
7FF91DC8 FFFFFFFF 81565CF8 Saved R2 POSIX$KERNEL_NPRW+0A8F8
7FF91DD0 00000000 7FF76964 Saved R3
7FF91DD8 FFFFFFFF 80B1D3C0 Saved R4
7FF91DE0 FFFFFFFF 80AB51C0 Saved R5
7FF91DE8 00000000 7FF91DF0 Saved R29
SDA>
Call Frame Information
----------------------
Stack Frame Procedure Descriptor
Flags: Base Register = FP, No Jacket, Native
Procedure Entry: FFFFFFFF 84AE5DE0 POSIX$KERNEL_PRO+4BDE0
Return address on stack = FFFFFFFF 84AD2894 POSIX$KERNEL_PRO+38894
Registers saved on stack
------------------------
7FF91E10 FFFFFFFF 81562990 Saved R2 POSIX$KERNEL_NPRW+07590
7FF91E18 FFFFFFFF 8155BFD0 Saved R3 POSIX$KERNEL_NPRW+00BD0
7FF91E20 7FF76964 00000000 Saved R4
7FF91E28 FFFFFFFF 80B1D3C0 Saved R5
7FF91E30 00000000 7FF76970 Saved R6
7FF91E38 00000000 00000000 Saved R7
7FF91E40 00000000 00050031 Saved R8
7FF91E48 FFFFFFFF 8155C090 Saved R9 POSIX$KERNEL_NPRW+00C90
7FF91E50 7FF76964 00000000 Saved R10
7FF91E58 00000000 7FF91E60 Saved R29
|
| I'm not sure if I found the pc's and the calls in the maps and listings.
Are you sure about V2.0 and ECO 3? It looks like the last call was from a
fifo routine. Do they use fifo's? The called routine should (build a
resource name and) aquire a VMS lock, but I want to double check that. Does
it make sense with what you can see in the dump?
Anyway, I'm not with posix engineering, if you want this problem solved you
should QAR, SPR or IPMT this case (whatever you and your customer think is
apropriate). It may be a simple resource problem on the customer's site it
may also be a problem within posix.
Hartmut
|