[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::sns

Title:POLYCENTER System Watchdog for VMS OSF/1 ULTRIX HP-UX AIX SunOS
Notice:Wishes:406,FAQ:845,Kits-VMS:1000,UNIX:694 VMS ECO01 FT kit: 521
Moderator:AZUR::HUREZZ
Created:Fri May 15 1992
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1033
Total number of notes:4584

985.0. "SNS_C_DSK message removed/added problems" by CSC32::R_RIDGWAY () Mon Jan 20 1997 13:21

T.RTitleUserPersonal
Name
DateLines
985.1???AZUR::HUREZConnectivity & Computing Services @VBE. DTN 828-5159Tue Jan 21 1997 13:2618
985.2Any luck on the dial in...CSC32::R_RIDGWAYTue Jan 28 1997 14:3211
    
    Hi Olivier;
    
    I'm sure your a busy fellow but I had sent you dial
    for the problem system.  I was wondering if you had
    the time to look at the problem with extraneous 
    message removed/added SNS_C_DSK events.
    
    
    Thanks;
    Rodger Ridgway, csc
985.3Consolidation problem indeed.AZUR::HUREZConnectivity & Computing Services @VBE. DTN 828-5159Wed Jan 29 1997 12:0222
    Busy you say... Indeed.  I'm starting to get serious difficulties in
    serializing things :-(  I didn't log as yet.  Sorry.
    
    However, I could see the extraneous messages on the local cluster I'm
    using, considering the recent hardware problems I've got on the Alpha node
    that is connected there...
    
    28-JAN 04:48 LAVA   Disk _CCOMCA$DKA300: status is mount verify timeout
    28-JAN 04:48 LAVA   Disk _CCOMCA$DKA200: status is mount verify timeout
    28-JAN 04:47 YIPPEE Disk _CCOMCA$DKA300: status is mount verify timeout
    28-JAN 04:47 YIPPEE Disk _CCOMCA$DKA200: status is mount verify timeout
    
    despite of the LAVA, YIPPEE and CCOMCA membership to the AZUR cluster.
    The SYS$CLUSTER_NAME is well defined and the profile is OK, so there
    must be a bug somewhere.
    
    This is in the ECO03 as well.  It seems that I've got one more thing
    to debug, and one more ECO to issue :-(
    
    Regards,
    
    	-- Olivier.
985.4CSC32::BUTTERWORTHGun Control is a steady hand.Tue Feb 04 1997 20:0822
    Ollie,
      An update on this problem as I spoke with the customer today. It
    seems this is happening during periods of very high CPU utilization
    on the agents ( as in 100% or close to it). It may be that the agent is
    not quite getting enough CPU time to perform it's work. The customer
    has set the following logicals on the consolidator node:
    
     SNS$DECNET_CONNECTION_TIMEOUT = "60"
     SNS$UNR_RETRY_NUMBER" = "2"
    
    The customer has speculated that the consolidator has timed-out the
    first attempt and the timer for the second attmept is running when we 
    finally get a response from the Agent node.
    
    We are going to try increasing the base-priority of the Agent process
    to 6 and see if it has any effect. If there is a hole inthe code I
    would suspect the agent in that it thinks it has lost connect to the
    consolidator but the consolidator has not net exhausted it's retry
    limit. It's very wierd that this only happens with "disk errors".
    
    Regards,
       Dan
985.5There's a bug in the event consolidation part.AZUR::HUREZConnectivity & Computing Services @VBE. DTN 828-5159Wed Feb 05 1997 08:056
    It isn't a timer problem.  I located the bug in the code.  It is
    strange it was not reported before, as it must have been there for a
    while...  The fix is uneasy;  I'm busy working on it.  It will be
    available in the ECO04.
                   
    	-- Olivier.