[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference csc32::consolemanager

Title:POLYCENTER Console Manager
Notice:Kits, Scans, Docs on CSC32:: as PCM$KITS:,PCM$DOCS:, PCM$SCANS:
Moderator:CSC32::BUTTERWORTH
Created:Thu Aug 06 1992
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1541
Total number of notes:6564

522.0. "RWMBX Problem with V1.5-003" by CGOOA::VCOOKE (Vern Cooke @CTU (Western Canada CNS)) Sat Dec 17 1994 04:53

Hello:

We had a VAX/VMS 6.1 Console Manager 1.5-003 controller process enter RWMBX
state tonight. I was fortunate enough to see it happen and can describe the
sequence of events which is what I believe brought it about. Phil had asked for
feedback on 1.5-003; I hope this is enough to nail this problem once and for
all!

One of the systems the Console Manager node monitors, CTUTAU, acts as a
watchdog consolidator for the site. One of the production systems CTUTAU
consolidates had a cluster state transition causing its hundred-odd disks to
enter and leave mount verification. Watchdog picked up on this and passed that
info on to PCM. For each of the 100+ disks, Watchdog generated a reply message:

%%%%%%%%%%%  OPCOM  16-DEC-1994 19:02:28.44  %%%%%%%%%%%
Message from user SYSTEM on CTUTAU
SNS$AR_94, System Watchdog Detected Event
_MAIL   Disk DSA3114: status is mount verification (SNS_C_DSS:SNS_C_NEW:TAU)

The replies were generated continuously, one after another. PCM's ENS display
displayed quite a few before stopping. At that point the CONSOLE MONITOR
command could not be invoked - it produced a blank screen. When I did a show
system, one of the controller processes was in an RWMBX state.

I ended up shutting down PCM, manually stopping the PCM processes that refused
to shut down, and saved the log files. I had to manually delete the left over
LTA ports before re-starting PCM. It came up fine.

Below are the last few lines from the CONTROLLER_01.LOG and 02 log file. I can
also provide you with copies of all the files that were in the CONSOLE$TMP
directory if you require them. Beware: they are 15+K and 57+K blocks in size!

We do not have OSCint installed as described in note 422, but our TTY_ALTYPAHD
(referred to in note 428) was set to 200. I have increased it to 8192 and will
implement this change soon as possible via a reboot.

Please let me know your recommendations.
						Thank you,
							....... Vern.

-----------------------------------------------------------------------------
Contents of the CONTROLLER_01.LOG file:
		:
	(a lot of these messages)
		:
Flushing log and Time files for dun ...done.
Flushing log and Time files for duo ...done.
Flushing log and Time files for dup ...done.
Flushing log and Time files for duq ...done.
Flushing log and Time files for dur ...done.
Flushing log and Time files for dus ...done.
Flushing log and Time files for dut ...done.
CMHostControlAcceptCallback: Start
CMHostControlAcceptCallback: End
CMHostControlReadCallback: Start
User connecting to system ctutau, flushing ring buffer.
    0 bytes will be flushed
Flushing log and Time files for ctutau ...done.
    User is : OPER
CMHostControlReadCallback: End
CMHostControlAcceptCallback: Start
CMHostControlAcceptCallback: End
CMHostControlReadCallback: Start
CMHostControlReadCallback: End
CMConsoleControlCloseCB: Start
CMConsoleControlCloseCB: End
		:
		:
(two more repetitions of the 14 lines above)
		:
		:
CMHostControlAcceptCallback: Start
CMHostControlAcceptCallback: End
CMHostControlReadCallback: Start
User connecting to system ctutau, flushing ring buffer.
    0 bytes will be flushed
Flushing log and Time files for ctutau ...done.
    User is : OPER
CMHostControlReadCallback: End
CMHostControlAcceptCallback: Start
CMHostControlAcceptCallback: End
CMHostControlReadCallback: Start
CMHostControlReadCallback: End
CMChildControlSocketReadCB: Start
Event port ENS closed by partner
Unable to connect to event port
Failed to connect to Local listener CONSOLE_EVT_ENS on node 
no such file or directory

The file ends with the "no such file or directory" line.
-----------------------------------------------------------------------------
The CONTROLLER_02.LOG file seemed to end normally at the time I did the shut
down:
		:
		:
Flushing log and Time files for r2d2 ...done.
Flushing log and Time files for thor ...done.
Flushing log and Time files for vxd ...done.
Flushing log and Time files for vxo ...done.
CMConsoleControlCloseCB: Start
CMConsoleControlCloseCB: End
CMChildControlSocketReadCB: Start
$ EXIT $STATUS
  SYSTEM       job terminated at 16-DEC-1994 19:08:21.58
  Accounting information:
  Buffered I/O count:          649839         Peak working set size:   10584
  Direct I/O count:            237865         Peak page file size:     18920
  Page faults:                  11352         Mounted volumes:             0
  Charged CPU time:           0 01:02:57.11   Elapsed time:     2 03:39:14.69
T.RTitleUserPersonal
Name
DateLines
522.1OPG::PHILIPAnd through the square window...Sat Dec 17 1994 16:0717
Vern,

  Thanks for the info, I beleive we now have a good idea about what is 
  causing the problems.

  Unfortunately however, its not something that is a 5-minute fix, we need to
  spend some time working the problem which we will be doing over the next few
  weeks. 

  This fix is not something we are going to hold up the MUP kit for, we will 
  issue a new ECO for this fix when it is ready.

  Thanks for your help and patience, I would also like to solicit your help 
  when we have a patch available for testing.

Cheers,
Phil
522.2Glad to Help TestCGOOA::VCOOKEVern Cooke @CTU (Western Canada CNS)Sun Dec 18 1994 14:058
    Hi Phil!
    
    I'll be glad to help test out the patch when it is ready. Please VAXmail
    me at CGOOA::VCOOKE to let me know where to pull the patch from since I
    won't be checking this particular note for additional replies.
    
    				Have a Merry Christmas and a Happy New Year!
    							......... Vern.
522.3PCM Hung Again58392::COOKEVern Cooke @CTU (Western Canada CNS)Wed Jan 11 1995 15:5621
    Happy New Year Phil!
    
    Well, we upgraded to 1.5-006 last week and over the weekend PCM hung
    twice. Unfortunately, I didn't save the log files before restarting PCM
    so I can't be sure of what PCM was up to when it hung.
    
    The "hang" manifested itself as CONSOLE MONITOR commands stopping at
    the blank screen before the prompt appears. Also, only one of the
    controller processes was clocking CPU, and a small amount at that. None
    of the PCM processes were left in RWMBX state.
    
    I documented a procedure for our operations staff to follow in order to
    save the log files the next time PCM hangs. So far, it has run for two
    days without hanging.
    
    Phil, did you manage to include the patch you spoke of in the MUP
    (-006) or it is still under development? Any suggestions for a fix or
    workaround would be greatly appreciated.
    
    					Thank you,
    							..... Vern.
522.4OPG::PHILIPAnd through the square window...Wed Jan 11 1995 17:078
Vern,

  No the patch I spoke of is actually vapour-ware at this time, we just
  have not had any time to investigate this further, but dont worry, as
  soon as we do and we have something to test, we will be in touch.

cheers,
Phil
522.5Offer to Contact Upon FailureCGOOA::VCOOKEVern Cooke @CTU (Western Canada CNS)Wed Jan 11 1995 17:5112
    Hi Phil!
    
    Glad to hear this problem is still being worked. I would like to offer
    to contact you the next time the problem occurs so that you may look at
    a live system with the problem. If this is acceptable to you, please
    let me know how to go about contacting you (or your team). Since the
    systems PCM manages are production, we can't tolerate an extended
    period with PCM down, but should be able to manage 15-30 minutes
    (depending on the activity at the time).
    
    				Please let me know what you think,
    						....... Vern.