[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference csc32::consolemanager

Title:POLYCENTER Console Manager
Notice:Kits, Scans, Docs on CSC32:: as PCM$KITS:,PCM$DOCS:, PCM$SCANS:
Moderator:CSC32::BUTTERWORTH
Created:Thu Aug 06 1992
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1541
Total number of notes:6564

1474.0. "1.6-310 hangs - reboot needed" by BACHUS::BANKEN () Tue Jan 28 1997 07:33

Hello,

Approximatly once a week the system must be rebooted to release
PCM. The system is a AXP2000 5/250 running OpenVMS 6.2-1h2.
This problem started one week after we upgraded PCM from 1.6-213
to 1.6-300 (eco3 alone).
 
The hang, as for as I know, can be describe as follow :
	- Multiline windows are freezed 
	- No more recording in the nodes' logfiles
	- Console shutdown does not remove all PCM processes.
	- No special process status

PCM monitors about 40 machines and most of the multiline windows 
are send to VXT/Xterminals. They have a plenty of multilines 
processes since they are subdivised into type like Hardware,
Security, Batch and so on.
There is also a feeder for watchdog events to PCM.

The problem occured untill now only in the week-end, thus the customer
didn't call us to investigate, he simply tried to stop it from 
the system account, without success, and rebooted the node.

I the meantime I installed version 310 but the problem is still 
there.

I asked him to force a crash. I hope we will see something there.

It is useless to say that this is a hot topic since this is a large
production site and that this node is the central point for x25
communication. 

Any suggestion will be welcome.

Alain.
T.RTitleUserPersonal
Name
DateLines
1474.1CSC32::BUTTERWORTHGun Control is a steady hand.Tue Jan 28 1997 11:0015
    Alain,
      The forced crash is what we really need. This could be nothing more
    than exhaustion of a system resource. Some things to keep an eye on:
    
    Non-paged pool - it could be that non-paged pool needs to expand but
                     cannot becuase there is insufficient pages on the 
    		     free list. 
    
    Page-file space - if you have insufficnet page-file space you have a
    		      problem. Again, check the free memory list. If you
    		      have insufficient memory for your envrionment then
    		      the load is placed onthe page-files.
    
    Regs,
       dan
1474.2BACHUS::BANKENWed Jan 29 1997 02:2228
Dan,

Thanks for your reply.

I checked the pool and page file at the very beginning and 
there is no pool expansion nor page file problem on that 
machine

I monitored this system for about 1 day and there is absolutly
nothing to be worry about. Anyway I started PSDC so that we can
analyze the load at the time the problem occured.
Other products behave normally at this moment.

The fact that processes are not removed by a CONSOLE SHUTDOWN
if for my part something important.
They can maybe have some outstanding I/O or/and a lack of
quotas. In this case I expect those processes to come in RWAST
(certainly after a stop/id). 
As far as I know, customer has never tried to stop/id nor 
observed a RWxxx status.
 
Is there some quotas you recommand for large site ?
Something special with VXT, do you recommand a 'prefered' transport ? 
Is there some ongoing 'case' on hang at this time ?

Many thanks,

Alain.  
1474.3CSC32::BUTTERWORTHGun Control is a steady hand.Wed Jan 29 1997 12:3828
    >Is there some quotas you recommand for large site ?
    
    Quotas on all the daemons are *quite* large and specified on the
    call to SYS$CREPRC that creates them. The point here is that
    UAF changes for the system account do not effect the PCM daemons.
    Note also that action routine processes are detached processes so they
    aren't sharing the ENS daemons quotas. 
    
    Of course the C3 is effected by the UAF quotas of whatever account 
    is running it.  
    
    >Something special with VXT, do you recommand a 'prefered' transport ?
    
    Nope. Use whatever you want.
    
    >Is there some ongoing 'case' on hang at this time ?
    
    Nope. We had some ENS hangs a long time ago but they've all been fixed
    since ECO 3. If your site is running V1.6-311 (312 may have been
    announced as I haven't yet read all the notes from today) your in good
    shape.
    
    If you can get the customer to force a crash with DUMPSTYLE = 0 then
    we can figure out whats going on from the dump. Thats our best action
    plan.
    
    Regs,
      Dan
1474.4BACHUS::BANKENTue Feb 04 1997 03:2379
Dan,

I got the forced crash and Console Notify process is in RWAST.
(should be analzed under OpenVMS Alpha 6.2)

Quick overview :

SDA> read/exec
SDA> read sys$loadable_images:sysdef

SDA> sh sum  => Console Notify in RWAST

SDA> set proc/id=3c
SDA> sh proc

Process index: 003C   Name: Console Notify    Extended PID: 0000023C
--------------------------------------------------------------------
Process status:        00140001  RES,PHDRES,LOGIN
Required capabilities: 0000000C  QUORUM,RUN

PCB address              81930080    JIB address              817BE700
PHD address              8D4E4000    Swapfile disk address    00000000
Master internal PID      0001003C    Subprocess count                0
Internal PID             0001003C    Creator internal PID     00000000
Extended PID             0000023C    Creator extended PID     00000000
State                       RWAST    Termination mailbox          0037
Previous CPU Id          00000000    Current CPU Id           00000000
Previous ASNSEQ  0000000000006935    Previous ASN     000000000000001B
Current priority                6    # of threads     0000000000000000
Initial process priority        4    Delete pending count         0
Base priority                   4    AST's active                 U
UIC                [00001,000004]    AST's remaining               970
Mutex count                     0    Buffered I/O count/limit    36821/36864
Waiting EF cluster              0    Direct I/O count/limit       1023/1024
Abs time of last event   01C0306E    BUFIO byte count/limit     370806/370806
Event flag wait mask     00000001    # open files allowed left    1022

Process index: 003C   Name: Console Notify    Extended PID: 0000023C
--------------------------------------------------------------------
Swapped copy of LEFC0    00000000    Timer entries allowed left   1024
Swapped copy of LEFC1    00000000    Active page table count         0
Global cluster 2 pointer 00000000    Process WS page count         194
Global cluster 3 pointer 00000000    Global WS page count           96

SDA> sh call indicates a EXE$DASSGN_C+0018C
		Waiting for I/O request to complete

SDA> sh proc/channel => a lot a busy MBA devices

SDA> sh device/address=@@r6 => mba causing the RWAST

I/O data structures
-------------------
MBA2898                                 MBX               UCB address:  817293C0

Device status:   08000010 online,exfunc_supp
Characteristics: 0C150001 rec,shr,avl,mbx,idv,odv
                 00000200 nnm

Owner UIC [000001,000004]   Operation count          2   ORB address    81674440
      PID        00000000   Error count              0   DDB address    8C1A4D80
Class/Type          A0/01   Reference count          1   DDT address    8C1BC940
Def. buf. size       1049   BOFF              00000000   CRB address    8C1A4DC0
DEVDEPEND        00000001   Byte count        00000000   LNM address    95CA2B60
DEVDEPND2        00000000   SVAPTE            00000000   I/O wait queue 8172942C
DEVDEPND3        00000000   DEVSTS            00000002
FLCK index             2C
DLCK address     8C1AC280
Charge PID       0001003C

        *** I/O request queue is empty ***

and now ????

If you are interested I can make the dump file available (500.000 blocks).

Best regards,

Alain 
1474.5CSC32::BUTTERWORTHGun Control is a steady hand.Tue Feb 04 1997 12:2811
    Alain,
      Format the UCB of the MBA device and post it. You can't trust the
    IO request queue empty business on mailboxes. I'll bet we'll find an
    IRP at UCB+UCB$$L_MB_READQFL. If so, then SYS$CANCEL was never called
    which really bugs me. Also, do a SHOW CALL followed by four SHOW CALL/NEXT
    commands. 
    
    Are you still running V1.6-310 or have you upgraded it to V1.6-311?
    
    Regs,
      Dan
1474.6BACHUS::BANKENWed Feb 05 1997 02:57214
Dan,

Same problem with 1.6-311.


SDA> format 817293C0
817293C0   UCB$L_FQFL                      816222C0
           UCB$L_MB_MSGQFL
           UCB$L_RQFL
           UCB$W_MB_SEED
           UCB$W_UNIT_SEED
817293C4   UCB$L_FQBL                      816222C0
           UCB$L_MB_MSGQBL
           UCB$L_RQBL
817293C8   UCB$W_SIZE                          0118
817293CA   UCB$B_TYPE                        10
817293CB   UCB$B_FLCK                      2C
817293CC   UCB$L_ASTQFL                    00000000
           UCB$L_FPC
           UCB$L_MB_W_AST
           UCB$T_PARTNER
817293D0   UCB$L_ASTQBL                    00000000
           UCB$L_MB_R_AST
           UCB$Q_FR3
817293D4                                   00000000
817293D8   UCB$L_FIRST                     00010000
           UCB$Q_FR4
           UCB$W_MSGMAX
           UCB$W_MSGCNT
817293DC                                   00000000
817293E0   UCB$W_BUFQUO                        147C
           UCB$W_DSTADDR
817293E2   UCB$W_INIQUO                    147D
           UCB$W_SRCADDR
817293E4   UCB$L_ORB                       81674440     ORB
817293E8   UCB$L_CPID                      0001003C
           UCB$L_LOCKID
817293EC   UCB$PS_CRAM                     00000000
817293F0   UCB$L_CRB                       8C1A4DC0     CRB
817293F4   UCB$L_DLCK                      8C1AC280     SMP$GL_MAILBOX
817293F8   UCB$L_DDB                       8C1A4D80     DDB
817293FC   UCB$L_PID                       00000000
81729400   UCB$L_LINK                      8189AD00
81729404   UCB$L_VCB                       00000000
81729408   UCB$L_DEVCHAR                   0C150001
           UCB$Q_DEVCHAR
8172940C   UCB$L_DEVCHAR2                  00000200
81729410   UCB$L_AFFINITY                  FFFFFFFF
81729414   UCB$L_ALTIOWQ                   00000000
           UCB$L_XTRA
81729418   UCB$B_DEVCLASS                        A0
81729419   UCB$B_DEVTYPE                       01
8172941A   UCB$W_DEVBUFSIZ                 0419
8172941C   UCB$B_LOCSRV                          01
           UCB$B_SECTORS
           UCB$L_DEVDEPEND
           UCB$Q_DEVDEPEND
           UCB$R_DEVDEPEND_Q_BLOCK
           UCB$R_DISK_DEVDEPEND
           UCB$R_NET_DEVDEPEND
           UCB$R_TERM_DEVDEPEND
8172941D   UCB$B_REMSRV                        00
           UCB$B_TRACKS
8172941E   UCB$W_BYTESTOGO                 0000
           UCB$W_CYLINDERS
           UCB$B_VERTSZ
81729420   UCB$L_DEVDEPND2                 00000000
           UCB$L_TT_DEVDP1
           UCB$W_TU_FORMENU
81729424   UCB$L_DEVDEPND3                 00000000
           UCB$Q_DEVDEPEND2
           UCB$R_DEVDEPEND2_Q_BLOCK
           UCB$R_TMV_BCNT
           UCB$W_TMV_BCNT1
           UCB$W_TMV_BCNT2
81729428   UCB$L_DEVDEPND4                 00000000
           UCB$W_TMV_BCNT3
           UCB$W_TMV_BCNT4
8172942C   UCB$L_IOQFL                     8172942C     UCB+0006C
81729430   UCB$L_IOQBL                     8172942C     UCB+0006C
81729434   UCB$W_UNIT                          0B52
81729436   UCB$B_CM1                         95
           UCB$W_CHARGE
           UCB$W_RWAITCNT
81729437   UCB$B_CM2                       15
81729438   UCB$L_IRP                       00000000
8172943C   UCB$L_REFC                      00000001
81729440   UCB$B_DIPL                            0B
           UCB$B_STATE
81729441   UCB$B_AMOD                          00
81729442   UCB$W_FILL_0                    0000
81729444   UCB$L_AMB                       00000000
81729448   UCB$L_STS                       08000010
8172944C   UCB$L_DEVSTS                    00000002
81729450   UCB$L_QLEN                      00000000
81729454   UCB$L_DUETIM                    00000000
81729458   UCB$L_OPCNT                     00000002
8172945C   UCB$L_SVPN                      00000000
81729460   UCB$L_SVAPTE                    00000000
81729464   UCB$L_BCNT                      00000000
81729468   UCB$L_BOFF                      00000000
8172946C   UCB$L_SOFTERRCNT                00000000
81729470   UCB$L_ERTCNT                    00000000
81729474   UCB$L_ERTMAX                    00000000
81729478   UCB$L_ERRCNT                    00000000
8172947C   UCB$L_PDT                       00000000
81729480   UCB$L_DDT                       8C1BC940     MB$DDT
81729484   UCB$PS_ADP                      00000000
81729488   UCB$PS_CRCTX                    00000000
8172948C   UCB$L_MEDIA_ID                  00000000
81729490   UCB$PS_DTN                      00000000
81729494   UCB$PS_DTN_LINK                 00000000
81729498   UCB$PS_TOUTROUT                 00000000
8172949C                                   00000000
817294A0   UCB$L_MB_READERREFC             00000000
817294A4   UCB$L_MB_WRITERREFC             00000000
817294A8   UCB$L_MB_READQFL                817294A8     UCB+000E8
817294AC   UCB$L_MB_READQBL                817294A8     UCB+000E8
817294B0   UCB$L_MB_WRITERWAITQFL          817294B0     UCB+000F0
817294B4   UCB$L_MB_WRITERWAITQBL          817294B0     UCB+000F0
817294B8   UCB$L_MB_READERWAITQFL          817294B8     UCB+000F8
817294BC   UCB$L_MB_READERWAITQBL          817294B8     UCB+000F8
817294C0   UCB$L_MB_NOWRITERWAITQFL        817294C0     UCB+00100
817294C4   UCB$L_MB_NOWRITERWAITQBL        817294C0     UCB+00100
817294C8   UCB$L_MB_NOREADERWAITQFL        817294C8     UCB+00108
817294CC   UCB$L_MB_NOREADERWAITQBL        817294C8     UCB+00108
817294D0   UCB$L_MB_ROOM_NOTIFY            00000000
817294D4   UCB$L_MB_LOGADR                 95CA2B60     LNM

SDA> sh call

Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: FFFFFFFF 8001C7C0              SCH$RESOURCE_WAIT_PS_C
        Return address on stack = FFFFFFFF 8006A64C     EXE$DASSGN_C+0018C

Registers saved on stack
------------------------
7FF91F28  00000000 0000000C  Saved R0
7FF91F30  00000000 00000003  Saved R1
7FF91F38  FFFFFFFF 81930080  Saved R4     PCB
7FF91F40  FFFFFFFF 8C1BA310  Saved R13    EXE$DASSGN
7FF91F48  00000000 7FF91F50  Saved R29

SDA> sh call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: FFFFFFFF 8006A4C0              EXE$DASSGN_C
        Return address on stack = FFFFFFFF 8004F184     EXE$CMODKRNL_C+000C4

Registers saved on stack
------------------------
7FF91F70  00000000 00000003  Saved R2
7FF91F78  FFFFFFFF 8C1B60E0  Saved R3     EXE$GR_CMOD_LINKAGE_SECT
7FF91F80  FFFFFFFF 81930080  Saved R4     PCB
7FF91F88  00000000 00000000  Saved R5
7FF91F90  00000000 00066444  Saved R6
7FF91F98  00000000 7FF91FC0  Saved R7
7FF91FA0  00000000 7FF9C1F8  Saved R8
7FF91FA8  FFFFFFFF 8C1C15D8  Saved R13    KERNEL_ASTDEL_C
7FF91FB0  00000000 7EEF5DA0  Saved R15
7FF91FB8  00000000 7EE497F0  Saved R29

SDA> sh call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: 00000000 000662D0
        Return address on stack = FFFFFFFF 8008E5A4     EXE$AST_RETURN

Registers saved on stack
------------------------
7EE49800  00000000 00000003  Saved R2
7EE49808  00000000 00020260  Saved R3     UCB$M_LCL_VALID+00260
7EE49810  00000000 00020280  Saved R4     UCB$M_LCL_VALID+00280
7EE49818  00000000 00020190  Saved R5     UCB$M_LCL_VALID+00190
7EE49820  00000000 7EE49840  Saved R29

SDA> sh call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: FFFFFFFF 8008CDC0              SCH$ASTDEL_C
        Return address on stack = FFFFFFFF 8008E81C     SCH$KERNEL_ASTDEL+000DC

Registers saved on stack
------------------------
7EE49858  FFFFFFFF 8C1C16D8  Saved R3     USER_ASTDEL
7EE49860  00000000 7EEF6900  Saved R13
7EE49868  00000000 7EE49A00  Saved R29

SDA> sh call/next
Call Frame Information
----------------------
        Stack Frame Procedure Descriptor
Flags:  Base Register = FP, No Jacket, Native
        Procedure Entry: 00000000 0004C690
        Return address on stack = 00000000 0003096C

Registers saved on stack
------------------------
7EE49A10  00000000 00010208  Saved R2     SYS$K_VERSION_16+001C8
7EE49A18  00000000 00020040  Saved R3     UCB$M_LCL_VALID+00040
7EE49A20  00000000 7F3EC01C  Saved R4
7EE49A28  00000000 00000000  Saved R5
7EE49A30  00000000 7EE49A70  Saved R29

                                               
1474.7CSC32::BUTTERWORTHGun Control is a steady hand.Wed Feb 05 1997 14:4819
    Now this is really wierd. Notice you have a refcount of 1 yet the
    READERREFC AND WRITERREFC fields are 0!!! Thats *never* supposed to
    happen!!  This tells me that the $DASSGN service must have run
    up to some point and then hung.
    
    Also there are messages queued to the UCB. The first two
    longwords of the UCB are a listhead. Do a VAL QUE @UCB to see how many
    elements there are and then do:
    
    SDA> EXAMINE @UCB;100
    
    followed by the necessary iterations of the following command to get
    to the end of the message list.
    
    SDA> EXAMINE @.;100
    
    
    Regs,
      Dan
1474.8BACHUS::BANKENFri Feb 07 1997 06:1246
Dan,

We had yesterday the problem again, I checked the dump and it is 
exactly the same.
Examine @ucb;100 displays xterm3, in the new dump I found xterm4
which are vtx2000.

SDA> val que @ucb
Queue is complete, total of 1 element in the queue

SDA> examine @ucb;100
00000000 28130040 817293C0 817293C0  �.r.�.r.@..(....     816222C0
816222E4 00000000 00000000 00180145  E...........�"b.     816222D0
00000000 00000000 009AF457 00010000  ....W�..........     816222E0
00000000 0001003C 0000003D 00000043  C...=...<.......     816222F0
00010008 A3750040 8162230C 81835F80  ._...#[email protected]�....     81622300
00000000 336D7265 74780006 00670040  @.g...xterm3....     81622310
00000000 00000000 00000000 00000000  ................     81622320
00000000 00000000 00000000 00000000  ................     81622330
00000000 00170040 81622358 81621280  ..b.X#b.@.......     81622340
00000000 336D7265 74780006 00670040  @.g...xterm3....     81622350
00000000 00000000 00000000 00000000  ................     81622360
00000000 00000000 00000000 00000000  ................     81622370
001000B0 313500C0 81868530 81868530  0...0...�.51�...     81622380
FFFFFFE0 FFFFFFE0 7EFA00D0 8C1F5460  `T..�.�~�...�...     816223900
7EFA0114 8C1F5460 00000000 8C1C7F38  8.......`T....�~     816223A0
330012A9 00000001 06000000 0000061B  ............�..3     816223B0

SDA> examine @.;100
00000000 2C100118 816222C0 816222C0  �"b.�"b....,....     817293C0
00000000 00010000 00000000 00000000  ................     817293D0
00000000 0001003C 81674440 147D147C  |.}.@Dg.<.......     817293E0
00000000 8C1A4D80 8C1AC280 8C1A4DC0  �M...�...M......     817293F0
00000200 0C150001 00000000 8189AD00  .�..............     817294000
00000001 041901A0 00000000 FFFFFFFF  ................     81729410
8172942C 00000000 00000000 00000000  ............,.r.     81729420
00000001 00000000 15950B52 8172942C  ,.r.R...........     81729430
00000002 08000010 00000000 0000000B  ................     81729440
00000000 00000002 00000000 00000000  ................     81729450
00000000 00000000 00000000 00000000  ................     81729460
00000000 00000000 00000000 00000000  ................     81729470
00000000 00000000 00000000 8C1BC940  @�..............     81729480
00000000 00000000 00000000 00000000  ................     81729490
817294A8 817294A8 00000000 00000000  ........�.r.�.r.     817294A0
817294B8 817294B8 817294B0 817294B0  �.r.�.r.�.r.�.r.     817294B0B0