[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:	OLD ALL-IN-1 (tm) Support Conference
Notice:	Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:	IOSG::PYE

Created:	Thu Jan 30 1992
Last Modified:	Tue Jan 23 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4343
Total number of notes:	18308

3311.0. "System hung due to FCS/USER having exclusive lock" by GIDDAY::SETHI (Ahhhh (-: an upside down smile from OZ) Wed Sep 22 1993 02:44

    Hi All,
    
    A customer has ALL-IN-1 IOS 3.0-1 installed and OpenVMS 5.5-2.  We have
    had a very odd problem on their system which required us to crash a
    node in the cluster.  All processes on the cluster were hung because a
    user accquired an exclusive lock to a file and than it tried to get
    another exclusive lock to the same file.  Because the file was locked
    for exclusive access the 2nd request couldn't be serviced, so the chain
    reaction started.
    
    At the time this problem occured the user had been using the FCS and
    had been editing documents and working on shared drawers.  The user had
    logged out successfully but for some reason the process was still
    hanging around analyze/system showed the following:
    
      0430  832EA5E0             Busy       NET9299:
    
    Process index: 019F   Name: COHEN LORRAE   Extended PID: 2141399F
    -----------------------------------------------------------------
    Process status:  0204002B   RES,DELPEN,INQUAN,RESPEN,PHDRES
    
    Note the DELPEN (delete pending) and the process was accessing the net
    driver NET9299:, this maybe due to some FCS activity that's just a
    guess.  The customer does not have Distributed Shared Filecab option
    active on their systems.
    
    What we noticed on the system was that people couldn't read or send
    EM's this was due to the SDAF having been locked by a process.  Doing a
    directory/full on SDAF_D hung the process.  The dump was incomplete so
    we are unable to determine what channels were open and what was on the
    stack (instruction being executed).
    
    It maybe difficult to pin point to a reason why this problem occured
    but perhaps someone knows where exclusive locks are granted within
    ALL-IN-1 IOS or the FCS.  If so can it be determined if this type of a
    problem can be caused ?
    
    I will reply to the base note and add comments by the OpenVMS group, I
    had got some assistance to have the dump analysed.  The dump file is
    300,000 blocks the size should have been 1.2+ million, yes they do have
    512 mbytes of memory !!!
    
    I must add as soon as we crashed this one node all the other processes
    hung on other nodes started working.
    
    Regards,
    
    Sunil

T.R	Title	User	Personal Name	Date	Lines
3311.1	OpenVMS engineers comments and observations	GIDDAY::SETHI	Ahhhh (-: an upside down smile from OZ	`Wed Sep 22 1993 02:52`	87
	Here are the notes from the OpenVMS group that may give an insight into the problem. I would like to mention that the base note is based upon my observations and notes taken when I talked to the customer and when I was logged onto their system. If you require further information please let me know and I will do what I can. A process (in=19f) was in RES,DELPEN,INQUAN,RESPEN,PHDRES state. It was waiting on this lock: Lock data: Lock id: 20E20028 PID: 004E019F Flags: VALBLK SYNCSTS SYSTEM Par. id: 01000072 SUBLCKs: 0 NOQUOTA LKB: 87737D00 BLKAST: 00000000 PRIORTY: 0000 RQSEQNM: 19CE Waiting for EX 00000000-FFFFFFFF Resource: 0001002B 24534D52 RMS$+... Status: ASYNC NOQUOTA Length 26 5F464144 53020000 ...SDAF_ Exec. mode 00202020 20445F41 A_D . System 00000000 00000000 ........ Local copy for which the resource is: Resource database ----------------- Address of RSB: 85747410 GGMODE: EX Status: VALID WTFULRG CVTFULR Parent RSB: 85B10A10 CGMODE: EX Sub-RSB count: 0 FGMODE: EX Lock Count: 137 CSID: 00000000 BLKAST count: 1 RQSEQNM: 1C69 Resource: 0001002B 24534D52 RMS$+... Valblk: 000B3400 0000015E Length 26 5F464144 53020000 ...SDAF_ 00000000 00000000 Exec. mode 00202020 20445F41 A_D . System 00000000 00000000 ........ Seqnum: 000021BE Granted queue (Lock ID / Gr mode / Range): 4CB70034 EX 00000000-FFFFFFFF Conversion queue (Lock ID / Gr mode / Range -> Rq mode / Range): 2B007632 NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF 5ADF0001 NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF 0800A127 NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF 4200D2A0 NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF 09004F69 NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF Waiting queue (Lock ID / Rq mode / Range): 20E20028 EX 00000000-FFFFFFFF 6600B36A EX 00000000-FFFFFFFF 7B0016DD EX 00000000-FFFFFFFF 5900FB24 EX 00000000-FFFFFFFF . . (hundreds) . but it has been granted to lock 4CB70034 (in EX mode). This lock (unfortunately or by mistake) belongs to the same process. So, the same process requests a lock in EX mode for the resource for which it already has granted lock in EX mode !!! One thing is not clear, (can not be seen in the dump), and that is a busy cannel to a NET device (seen while looking at the live system): NET9299 Unknown UCB address: 831F5870 Device status: 00010010 online,deleteucb Characteristics: 0C1C2000 net,avl,mnt,mbx,idv,odv 00000000 Owner UIC [016002,000025] Operation count 12 ORB address 831F5920 PID 004E019F Error count 0 DDB address 878FA700 Class/Type 00/00 Reference count 1 DDT address 82F4A878 Def. buf. size 256 BOFF 009F VCB address 82F50450 DEVDEPEND 00000001 Byte count 0005 CRB address 878FA680 DEVDEPND2 00000000 SVAPTE 8557BA90 I/O wait queue empty FLCK index 34 DEVSTS 0002 DLCK address 8126D2F0 Charge PID 004E019F
3311.2	The lock manager does not understand threading	IOSG::MARCHANT	I'd sink therefore I swam	`Wed Sep 22 1993 14:18`	29
	Sunil, > but it has been granted to lock 4CB70034 (in EX mode). This lock > (unfortunately or by mistake) belongs to the same process. So, the same > process requests a lock in EX mode for the resource for which it already > has granted lock in EX mode !!! It may belong to the same process (the FCS) but as far as the FCS is concerned the locks belong to different threads. Unfortunately the VMS lock manager does not understand threading, and so the lock manager has tried to resolve what it thinks is a deadlock situation, and this has caused your customer's problem. Paul Chinnick came across this problem whilst he was here, and one of his suggested workarounds was: `` The simplest form of workaround available is to increase the timeout period for deadlock search initiation which is specified by the SYSGEN parameter DEADLOCK_WAIT. Increasing this value to 30 seconds or more would allow extra processing time to complete resource operations and hence prevent premature and false detection of deadlocks. Unfortunately, such an increase would delay the detection of any genuine deadlocks which may adversely impact other software and applications including such sensitive components as database systems. '' This last part is an important consideration to take account of before using this workaround. Cheers, Paul.
3311.3	787.8 ?	AIMTEC::VOLLER_I	Gordon (T) Gopher for President	`Thu Sep 23 1993 18:47`	10
	Sunil, How did the user exit ALL-IN-1 ? Have you checked Note 787.8 ? It could be a variation on the process run down problem that we analyzed. Cheers, Iain.
3311.4	It appears to be a similar problem	BUSHIE::SETHI	My name is Sunil without the H	`Fri Sep 24 1993 00:35`	17
	Hi Iain, >How did the user exit ALL-IN-1 ? EX, when the user tried to login sometime later she was given an error message. The customer is not too sure as to what it said. >Have you checked Note 787.8 ? It could be a variation on the >process run down problem that we analyzed. I can confirm that it appears to be so, we will try to get as much information from the customer as possible but don't hold your breath. Regards, Sunil
3311.5	Dump file still available ??	AIMTEC::VOLLER_I	Gordon (T) Gopher for President	`Fri Sep 24 1993 15:33`	8
	Sunil, If the dump is still available somewhere it should be easy to confirm/disprove the process rundown theory. Let me know if I can help at all. Iain.