[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::sns

Title:POLYCENTER System Watchdog for VMS OSF/1 ULTRIX HP-UX AIX SunOS
Notice:Wishes:406,FAQ:845,Kits-VMS:1000,UNIX:694 VMS ECO01 FT kit: 521
Moderator:AZUR::HUREZZ
Created:Fri May 15 1992
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1033
Total number of notes:4584

995.0. "Watchdog V2.2 ECO 2 - SCSI cluster disk errors" by CSC32::BUTTERWORTH (Gun Control is a steady hand.) Thu Feb 06 1997 16:47

    Hello all,
       I have the following interesting scenario:
    
    I have a customer with an OpenVMS "SCSI-cluster" of two Alpha 8400's .
    There is at least one common SCSI bus with several HSZ controllers. 
    The customer reported that if only one of the 8400's is running then
    Watchdog will detect increments in the disk-error counts. If both
    systems are running then increments in the disk error count are *not*
    reported. The customer is running V2.2 with ECO 2 applied. Currently
    one system is down so I cannot check some things I would like such as
    the value of SNS$CLUSTER_NAME etc.
    
     A couple of questions:
    
    Does Watchdog track increments in disk error counts by using SYS$GETDVI
    to return the error count from the UCB? If so then I would suspect it
    builds an internal table of devices so that a comaprison can be made?
    If all of the above is true this means that SNS would be able to
    interrogate any disk device regardless of the type of controller. I was
    going to install the Agent on a local SCSI cluster and then use Delta
    to bump the error count in one of the UCB's but this only works if
    the above assumptions are correct. 
    

    Regards,
       Dan
    
T.RTitleUserPersonal
Name
DateLines
995.1It works this way...AZUR::HUREZConnectivity & Computing Services @VBE. DTN 828-5159Fri Feb 07 1997 05:5021
    Hi Dan,
    
    The disk list available on the system is originally fetched (and
    regularly updated) into the Agent memory using the $DEVICE_SCAN system
    service with a DVS$_DEVCLASS item code valued with the DC$_DISK constant
    (out of the $DCDEF macro).
    
    Then, on each Agent poll time, we use a $GETDVI indeed on each individual
    disk to fetch error counters (DVI$_ERRCNT) and disk status (DVI$_STS,
    DVI$_DEVCHAR and DVI$_DEVCHAR2).
    
    Old and new error counts are stored, along with the disk specification,
    in a linked list within the Agent, and values are compared at each poll
    time in order to determine whether events should be generated or not.
    
    I hope this helps...
    
    Regards,
    
    	-- Olivier.
    
995.2CSC32::BUTTERWORTHGun Control is a steady hand.Fri Feb 07 1997 19:286
    Thanks Olivier. SInce you use $DEVICE_SCAN then it should find all
    disks regardless of the controller or storage architecture. Looks like
    I'll have to try and set up a local SCSI cluster. 
    
    Regs,
      Dan
995.3COMICS::JOLLEYDWed Feb 19 1997 09:4237
Hello all,


	Sorry to piggy back this entry but it fits the bill for an error I have 
	seen.

	Customer is running 6.1 of VMS Watchdog 2.2 eco 2 and is getting the 
	following error on starting an agent.

	%ADA-F-EXCCOPX, Exception was copied at a raise or accept statement
	-SYSTEM-F-RANGEERR, range error, PC=00044C7C , PS=0000001B

	Upon setting the logical sns$watchdog to full the following output is 
	seen from the device_scan routine I guess.
	WD_HW:      Disk :_$1$DKC300:
	WD_HW:       -> New. 0
	WD_HW:      Disk :_$1$DKC400:
	WD_HW:       -> New. 0
	WD_HW:      Disk :_$1$DKC500:
	WD_HW:       -> New. 0
	
	The program then give the ada error above. The next device is $1$dkc600

Device                  Device           Error    Volume         Free  Trans Mnt
 Name                   Status           Count     Label        Blocks Count Cnt
$1$DKC600:    (HUMV12)  Online           20779

	You can see the large error count.

	Is this breaking the code ?

	My customer cannot reload for a week so I cannot confirm that this is 	
	the case.

	Regards

	Darren (OpenVMS Support. UK CSC)	
995.4There's something wrong indeed...AZUR::HUREZConnectivity & Computing Services @VBE. DTN 828-5159Wed Feb 19 1997 12:1722
    Well, this is strange enough, since
    
    	. We're using GETDVI with the DVI$_ERRCNT item code,
          which returns the error count as a 32 bits decimal number.
    
    	. We cast this down to a 16 bits unsigned decimal number, which looks
    	  weird and probably is the source of the problem you're experiencing...
    
    	. ... although 16 bits unsigned provides a range from 0 up to 65535
    	  which is enough to hold the 20779 error count you got (unless
          VMS doesn't show it correcly either or another disk would have an
          even greater count...).
    
    Anyway, I'll address that in the source code for ECO04 (currently in
    Field Test) and let you know when a new kit will be ready for you to
    cross-check its efficiency on customer site if you wish...
    
    Best Regards,
    
    	-- Olivier Hurez.
    
    
995.5CSC32::BUTTERWORTHGun Control is a steady hand.Tue Mar 25 1997 21:3124
    The plot thickens. I have reproduced this problem on a small scsi
    cluster that consists of an Alphaserver 1000 and an Alphastation 400.
    There is at least one RZ28B sitting right on the common SCSI
    bus.
    
    I can bump the error counts with DELTA and no event is reported.
    The SNS$DSK_FILTER_OFF is set to TRUE so all increments should
    be reported. As a control, I tried the delta trick on a CI
    based VAXCluster and all errors were reported faithfully.
    
    I set SNS$WATCHDOG_TRACE to FULL and indeed the agent sees the
    increment *but* it never adds an entry to the message list thus the
    consolidator never sees it. The agents in question are T2.2-08 so
    I want to install at least eco 3 and retest. If it fails we will
    go ahead and IPMT it. The consolidator has eco 3 installed and is
    a VAXStation 4000-60 by the way.
    
    Any thoughts Ollie?
    
    Regs,
      Dan.