[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference iosg::all-in-1_v30

Title:*OLD* ALL-IN-1 (tm) Support Conference
Notice:Closed - See Note 4331.l to move to IOSG::ALL-IN-1
Moderator:IOSG::PYE
Created:Thu Jan 30 1992
Last Modified:Tue Jan 23 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:4343
Total number of notes:18308

3311.0. "System hung due to FCS/USER having exclusive lock" by GIDDAY::SETHI (Ahhhh (-: an upside down smile from OZ) Wed Sep 22 1993 03:44

    Hi All,
    
    A customer has ALL-IN-1 IOS 3.0-1 installed and OpenVMS 5.5-2.  We have
    had a very odd problem on their system which required us to crash a
    node in the cluster.  All processes on the cluster were hung because a
    user accquired an exclusive lock to a file and than it tried to get
    another exclusive lock to the same file.  Because the file was locked
    for exclusive access the 2nd request couldn't be serviced, so the chain
    reaction started.
    
    At the time this problem occured the user had been using the FCS and
    had been editing documents and working on shared drawers.  The user had
    logged out successfully but for some reason the process was still
    hanging around analyze/system showed the following:
    
      0430  832EA5E0             Busy       NET9299:
    
    Process index: 019F   Name: COHEN LORRAE   Extended PID: 2141399F
    -----------------------------------------------------------------
    Process status:  0204002B   RES,DELPEN,INQUAN,RESPEN,PHDRES
    
    Note the DELPEN (delete pending) and the process was accessing the net
    driver NET9299:, this maybe due to some FCS activity that's just a
    guess.  The customer does not have Distributed Shared Filecab option
    active on their systems.
    
    What we noticed on the system was that people couldn't read or send
    EM's this was due to the SDAF having been locked by a process.  Doing a
    directory/full on SDAF_D hung the process.  The dump was incomplete so
    we are unable to determine what channels were open and what was on the
    stack (instruction being executed).
    
    It maybe difficult to pin point to a reason why this problem occured
    but perhaps someone knows where exclusive locks are granted within
    ALL-IN-1 IOS or the FCS.  If so can it be determined if this type of a
    problem can be caused ?
    
    I will reply to the base note and add comments by the OpenVMS group, I
    had got some assistance to have the dump analysed.  The dump file is
    300,000 blocks the size should have been 1.2+ million, yes they do have
    512 mbytes of memory !!!
    
    I must add as soon as we crashed this one node all the other processes
    hung on other nodes started working.
    
    Regards,
    
    Sunil
T.RTitleUserPersonal
Name
DateLines
3311.1OpenVMS engineers comments and observationsGIDDAY::SETHIAhhhh (-: an upside down smile from OZWed Sep 22 1993 03:5287
    Here are the notes from the OpenVMS group that may give an insight into
    the problem.
    
    I would like to mention that the base note is based upon my
    observations and notes taken when I talked to the customer and when I
    was logged onto their system.
    
    If you require further information please let me know and I will do
    what I can.
    
    A process (in=19f) was in RES,DELPEN,INQUAN,RESPEN,PHDRES state. It was
    waiting on this lock:
    
    Lock data:
    
    Lock id:  20E20028   PID:     004E019F   Flags:   VALBLK  SYNCSTS
    SYSTEM
    Par. id:  01000072   SUBLCKs:        0            NOQUOTA
    LKB:      87737D00   BLKAST:  00000000
    PRIORTY:      0000   RQSEQNM:     19CE
    
    Waiting for     EX   00000000-FFFFFFFF
    
    Resource:      0001002B 24534D52    RMS$+...  Status:  ASYNC   NOQUOTA
     Length   26   5F464144 53020000    ...SDAF_
     Exec. mode    00202020 20445F41    A_D    .
     System        00000000 00000000    ........
    
    Local copy
    
    for which the resource is:
    
    Resource database
    -----------------
    Address of RSB:  85747410  GGMODE:       EX  Status: VALID   WTFULRG
    CVTFULR
    Parent RSB:      85B10A10  CGMODE:       EX
    Sub-RSB count:          0  FGMODE:       EX
    Lock Count:           137  CSID:   00000000
    BLKAST count:           1  RQSEQNM:    1C69
    
    Resource:      0001002B 24534D52   RMS$+...  Valblk: 000B3400 0000015E
     Length   26   5F464144 53020000   ...SDAF_          00000000 00000000
     Exec. mode    00202020 20445F41   A_D    .
     System        00000000 00000000   ........  Seqnum: 000021BE
    
    Granted queue (Lock ID / Gr mode / Range):
     4CB70034  EX 00000000-FFFFFFFF
    
    Conversion queue (Lock ID / Gr mode / Range -> Rq mode / Range):
     2B007632  NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF
     5ADF0001  NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF
     0800A127  NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF
     4200D2A0  NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF
     09004F69  NL 00000000-FFFFFFFF / EX 00000000-FFFFFFFF
    
    Waiting queue (Lock ID / Rq mode / Range):
     20E20028  EX 00000000-FFFFFFFF         6600B36A  EX 00000000-FFFFFFFF
     7B0016DD  EX 00000000-FFFFFFFF         5900FB24  EX 00000000-FFFFFFFF
     .
     . (hundreds)
     .
    
    but it has been granted to lock 4CB70034 (in EX mode). This lock
    (unfortunately or by mistake) belongs to the same process. So, the same 
    process requests a lock in EX mode for the resource for which it already 
    has granted lock in EX mode !!!
    One thing is not clear, (can not be seen in the dump), and that is a
    busy cannel to a NET device (seen while looking at the live system):
    
    NET9299                                 Unknown           UCB address: 
    831F5870
    
    Device status:   00010010 online,deleteucb
    Characteristics: 0C1C2000 net,avl,mnt,mbx,idv,odv
                     00000000
    
    Owner UIC [016002,000025]   Operation count         12   ORB address
    831F5920
    PID        004E019F   Error count              0   DDB address   878FA700
    Class/Type          00/00   Reference count    1   DDT address   82F4A878
    Def. buf. size        256   BOFF            009F   VCB address   82F50450
    DEVDEPEND        00000001   Byte count      0005   CRB address   878FA680
    DEVDEPND2        00000000   SVAPTE        8557BA90   I/O wait queue empty
    FLCK index             34   DEVSTS          0002 DLCK address    8126D2F0
    Charge PID       004E019F
    
3311.2The lock manager does not understand threadingIOSG::MARCHANTI'd sink therefore I swamWed Sep 22 1993 15:1829
Sunil,

>    but it has been granted to lock 4CB70034 (in EX mode). This lock
>    (unfortunately or by mistake) belongs to the same process. So, the same 
>    process requests a lock in EX mode for the resource for which it already 
>    has granted lock in EX mode !!!

It may belong to the same process (the FCS) but as far as the FCS is concerned
the locks belong to *different* threads.  Unfortunately the VMS lock manager
does not understand threading, and so the lock manager has tried to resolve
what it thinks is a deadlock situation, and this has caused your customer's
problem.

Paul Chinnick came across this problem whilst he was here, and one of his
suggested workarounds was:
    `` The simplest form of workaround available is to increase the timeout
    period for deadlock search initiation which is specified by the SYSGEN
    parameter DEADLOCK_WAIT. Increasing this value to 30 seconds or more would
    allow extra processing time to complete resource operations and hence
    prevent premature and false detection of deadlocks. Unfortunately, such an
    increase would delay the detection of any genuine deadlocks which may
    adversely impact other software and applications including such sensitive
    components as database systems. ''

This last part is an important consideration to take account of before using
this workaround.

Cheers,
    Paul.
3311.3787.8 ?AIMTEC::VOLLER_IGordon (T) Gopher for PresidentThu Sep 23 1993 19:4710
    Sunil,
    
    	How did the user exit ALL-IN-1 ?
    
    	Have you checked Note 787.8 ? It could be a variation on the
        process run down problem that we analyzed. 
    
    Cheers,
    
    Iain.
3311.4It appears to be a similar problemBUSHIE::SETHIMy name is Sunil without the HFri Sep 24 1993 01:3517
    Hi Iain,                                                  
    
    >How did the user exit ALL-IN-1 ?
    
    EX, when the user tried to login sometime later she was given an error
    message.  The customer is not too sure as to what it said.
    
    >Have you checked Note 787.8 ? It could be a variation on the
    >process run down problem that we analyzed.
    
    I can confirm that it appears to be so, we will try to get as much
    information from the customer as possible but don't hold your breath.
        
    Regards,
    
    Sunil
    
3311.5Dump file still available ??AIMTEC::VOLLER_IGordon (T) Gopher for PresidentFri Sep 24 1993 16:338
    Sunil,
    
      If the dump is still available somewhere it should be easy to
      confirm/disprove the process rundown theory.
    
      Let me know if I can help at all.
    
    Iain.