[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::sns

Title:	POLYCENTER System Watchdog for VMS OSF/1 ULTRIX HP-UX AIX SunOS
Notice:	Wishes:406,FAQ:845,Kits-VMS:1000,UNIX:694 VMS ECO01 FT kit: 521
Moderator:	AZUR::HUREZZ

Created:	Fri May 15 1992
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1033
Total number of notes:	4584

995.0. "Watchdog V2.2 ECO 2 - SCSI cluster disk errors" by CSC32::BUTTERWORTH (Gun Control is a steady hand.) Thu Feb 06 1997 16:47

    Hello all,
       I have the following interesting scenario:
    
    I have a customer with an OpenVMS "SCSI-cluster" of two Alpha 8400's .
    There is at least one common SCSI bus with several HSZ controllers. 
    The customer reported that if only one of the 8400's is running then
    Watchdog will detect increments in the disk-error counts. If both
    systems are running then increments in the disk error count are *not*
    reported. The customer is running V2.2 with ECO 2 applied. Currently
    one system is down so I cannot check some things I would like such as
    the value of SNS$CLUSTER_NAME etc.
    
     A couple of questions:
    
    Does Watchdog track increments in disk error counts by using SYS$GETDVI
    to return the error count from the UCB? If so then I would suspect it
    builds an internal table of devices so that a comaprison can be made?
    If all of the above is true this means that SNS would be able to
    interrogate any disk device regardless of the type of controller. I was
    going to install the Agent on a local SCSI cluster and then use Delta
    to bump the error count in one of the UCB's but this only works if
    the above assumptions are correct. 
    

    Regards,
       Dan

T.R	Title	User	Personal Name	Date	Lines
995.1	It works this way...	AZUR::HUREZ	Connectivity & Computing Services @VBE. DTN 828-5159	`Fri Feb 07 1997 05:50`	21
	Hi Dan, The disk list available on the system is originally fetched (and regularly updated) into the Agent memory using the $DEVICE_SCAN system service with a DVS$_DEVCLASS item code valued with the DC$_DISK constant (out of the $DCDEF macro). Then, on each Agent poll time, we use a $GETDVI indeed on each individual disk to fetch error counters (DVI$_ERRCNT) and disk status (DVI$_STS, DVI$_DEVCHAR and DVI$_DEVCHAR2). Old and new error counts are stored, along with the disk specification, in a linked list within the Agent, and values are compared at each poll time in order to determine whether events should be generated or not. I hope this helps... Regards, -- Olivier.
995.2		CSC32::BUTTERWORTH	Gun Control is a steady hand.	`Fri Feb 07 1997 19:28`	6
	Thanks Olivier. SInce you use $DEVICE_SCAN then it should find all disks regardless of the controller or storage architecture. Looks like I'll have to try and set up a local SCSI cluster. Regs, Dan
995.3		COMICS::JOLLEYD		`Wed Feb 19 1997 09:42`	37
	Hello all, Sorry to piggy back this entry but it fits the bill for an error I have seen. Customer is running 6.1 of VMS Watchdog 2.2 eco 2 and is getting the following error on starting an agent. %ADA-F-EXCCOPX, Exception was copied at a raise or accept statement -SYSTEM-F-RANGEERR, range error, PC=00044C7C , PS=0000001B Upon setting the logical sns$watchdog to full the following output is seen from the device_scan routine I guess. WD_HW: Disk :_$1$DKC300: WD_HW: -> New. 0 WD_HW: Disk :_$1$DKC400: WD_HW: -> New. 0 WD_HW: Disk :_$1$DKC500: WD_HW: -> New. 0 The program then give the ada error above. The next device is $1$dkc600 Device Device Error Volume Free Trans Mnt Name Status Count Label Blocks Count Cnt $1$DKC600: (HUMV12) Online 20779 You can see the large error count. Is this breaking the code ? My customer cannot reload for a week so I cannot confirm that this is the case. Regards Darren (OpenVMS Support. UK CSC)
995.4	There's something wrong indeed...	AZUR::HUREZ	Connectivity & Computing Services @VBE. DTN 828-5159	`Wed Feb 19 1997 12:17`	22
	Well, this is strange enough, since . We're using GETDVI with the DVI$_ERRCNT item code, which returns the error count as a 32 bits decimal number. . We cast this down to a 16 bits unsigned decimal number, which looks weird and probably is the source of the problem you're experiencing... . ... although 16 bits unsigned provides a range from 0 up to 65535 which is enough to hold the 20779 error count you got (unless VMS doesn't show it correcly either or another disk would have an even greater count...). Anyway, I'll address that in the source code for ECO04 (currently in Field Test) and let you know when a new kit will be ready for you to cross-check its efficiency on customer site if you wish... Best Regards, -- Olivier Hurez.
995.5		CSC32::BUTTERWORTH	Gun Control is a steady hand.	`Tue Mar 25 1997 21:31`	24
	The plot thickens. I have reproduced this problem on a small scsi cluster that consists of an Alphaserver 1000 and an Alphastation 400. There is at least one RZ28B sitting right on the common SCSI bus. I can bump the error counts with DELTA and no event is reported. The SNS$DSK_FILTER_OFF is set to TRUE so all increments should be reported. As a control, I tried the delta trick on a CI based VAXCluster and all errors were reported faithfully. I set SNS$WATCHDOG_TRACE to FULL and indeed the agent sees the increment but it never adds an entry to the message list thus the consolidator never sees it. The agents in question are T2.2-08 so I want to install at least eco 3 and retest. If it fails we will go ahead and IPMT it. The consolidator has eco 3 installed and is a VAXStation 4000-60 by the way. Any thoughts Ollie? Regs, Dan.