Title: | POLYCENTER Console Manager |
Notice: | Kits, Scans, Docs on CSC32:: as PCM$KITS:,PCM$DOCS:, PCM$SCANS: |
Moderator: | CSC32::BUTTERWORTH |
Created: | Thu Aug 06 1992 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1541 |
Total number of notes: | 6564 |
Hello: We had a VAX/VMS 6.1 Console Manager 1.5-003 controller process enter RWMBX state tonight. I was fortunate enough to see it happen and can describe the sequence of events which is what I believe brought it about. Phil had asked for feedback on 1.5-003; I hope this is enough to nail this problem once and for all! One of the systems the Console Manager node monitors, CTUTAU, acts as a watchdog consolidator for the site. One of the production systems CTUTAU consolidates had a cluster state transition causing its hundred-odd disks to enter and leave mount verification. Watchdog picked up on this and passed that info on to PCM. For each of the 100+ disks, Watchdog generated a reply message: %%%%%%%%%%% OPCOM 16-DEC-1994 19:02:28.44 %%%%%%%%%%% Message from user SYSTEM on CTUTAU SNS$AR_94, System Watchdog Detected Event _MAIL Disk DSA3114: status is mount verification (SNS_C_DSS:SNS_C_NEW:TAU) The replies were generated continuously, one after another. PCM's ENS display displayed quite a few before stopping. At that point the CONSOLE MONITOR command could not be invoked - it produced a blank screen. When I did a show system, one of the controller processes was in an RWMBX state. I ended up shutting down PCM, manually stopping the PCM processes that refused to shut down, and saved the log files. I had to manually delete the left over LTA ports before re-starting PCM. It came up fine. Below are the last few lines from the CONTROLLER_01.LOG and 02 log file. I can also provide you with copies of all the files that were in the CONSOLE$TMP directory if you require them. Beware: they are 15+K and 57+K blocks in size! We do not have OSCint installed as described in note 422, but our TTY_ALTYPAHD (referred to in note 428) was set to 200. I have increased it to 8192 and will implement this change soon as possible via a reboot. Please let me know your recommendations. Thank you, ....... Vern. ----------------------------------------------------------------------------- Contents of the CONTROLLER_01.LOG file: : (a lot of these messages) : Flushing log and Time files for dun ...done. Flushing log and Time files for duo ...done. Flushing log and Time files for dup ...done. Flushing log and Time files for duq ...done. Flushing log and Time files for dur ...done. Flushing log and Time files for dus ...done. Flushing log and Time files for dut ...done. CMHostControlAcceptCallback: Start CMHostControlAcceptCallback: End CMHostControlReadCallback: Start User connecting to system ctutau, flushing ring buffer. 0 bytes will be flushed Flushing log and Time files for ctutau ...done. User is : OPER CMHostControlReadCallback: End CMHostControlAcceptCallback: Start CMHostControlAcceptCallback: End CMHostControlReadCallback: Start CMHostControlReadCallback: End CMConsoleControlCloseCB: Start CMConsoleControlCloseCB: End : : (two more repetitions of the 14 lines above) : : CMHostControlAcceptCallback: Start CMHostControlAcceptCallback: End CMHostControlReadCallback: Start User connecting to system ctutau, flushing ring buffer. 0 bytes will be flushed Flushing log and Time files for ctutau ...done. User is : OPER CMHostControlReadCallback: End CMHostControlAcceptCallback: Start CMHostControlAcceptCallback: End CMHostControlReadCallback: Start CMHostControlReadCallback: End CMChildControlSocketReadCB: Start Event port ENS closed by partner Unable to connect to event port Failed to connect to Local listener CONSOLE_EVT_ENS on node no such file or directory The file ends with the "no such file or directory" line. ----------------------------------------------------------------------------- The CONTROLLER_02.LOG file seemed to end normally at the time I did the shut down: : : Flushing log and Time files for r2d2 ...done. Flushing log and Time files for thor ...done. Flushing log and Time files for vxd ...done. Flushing log and Time files for vxo ...done. CMConsoleControlCloseCB: Start CMConsoleControlCloseCB: End CMChildControlSocketReadCB: Start $ EXIT $STATUS SYSTEM job terminated at 16-DEC-1994 19:08:21.58 Accounting information: Buffered I/O count: 649839 Peak working set size: 10584 Direct I/O count: 237865 Peak page file size: 18920 Page faults: 11352 Mounted volumes: 0 Charged CPU time: 0 01:02:57.11 Elapsed time: 2 03:39:14.69
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
522.1 | OPG::PHILIP | And through the square window... | Sat Dec 17 1994 16:07 | 17 | |
Vern, Thanks for the info, I beleive we now have a good idea about what is causing the problems. Unfortunately however, its not something that is a 5-minute fix, we need to spend some time working the problem which we will be doing over the next few weeks. This fix is not something we are going to hold up the MUP kit for, we will issue a new ECO for this fix when it is ready. Thanks for your help and patience, I would also like to solicit your help when we have a patch available for testing. Cheers, Phil | |||||
522.2 | Glad to Help Test | CGOOA::VCOOKE | Vern Cooke @CTU (Western Canada CNS) | Sun Dec 18 1994 14:05 | 8 |
Hi Phil! I'll be glad to help test out the patch when it is ready. Please VAXmail me at CGOOA::VCOOKE to let me know where to pull the patch from since I won't be checking this particular note for additional replies. Have a Merry Christmas and a Happy New Year! ......... Vern. | |||||
522.3 | PCM Hung Again | 58392::COOKE | Vern Cooke @CTU (Western Canada CNS) | Wed Jan 11 1995 15:56 | 21 |
Happy New Year Phil! Well, we upgraded to 1.5-006 last week and over the weekend PCM hung twice. Unfortunately, I didn't save the log files before restarting PCM so I can't be sure of what PCM was up to when it hung. The "hang" manifested itself as CONSOLE MONITOR commands stopping at the blank screen before the prompt appears. Also, only one of the controller processes was clocking CPU, and a small amount at that. None of the PCM processes were left in RWMBX state. I documented a procedure for our operations staff to follow in order to save the log files the next time PCM hangs. So far, it has run for two days without hanging. Phil, did you manage to include the patch you spoke of in the MUP (-006) or it is still under development? Any suggestions for a fix or workaround would be greatly appreciated. Thank you, ..... Vern. | |||||
522.4 | OPG::PHILIP | And through the square window... | Wed Jan 11 1995 17:07 | 8 | |
Vern, No the patch I spoke of is actually vapour-ware at this time, we just have not had any time to investigate this further, but dont worry, as soon as we do and we have something to test, we will be in touch. cheers, Phil | |||||
522.5 | Offer to Contact Upon Failure | CGOOA::VCOOKE | Vern Cooke @CTU (Western Canada CNS) | Wed Jan 11 1995 17:51 | 12 |
Hi Phil! Glad to hear this problem is still being worked. I would like to offer to contact you the next time the problem occurs so that you may look at a live system with the problem. If this is acceptable to you, please let me know how to go about contacting you (or your team). Since the systems PCM manages are production, we can't tolerate an extended period with PCM down, but should be able to manage 15-30 minutes (depending on the activity at the time). Please let me know what you think, ....... Vern. |