[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ssdevo::hsz40_product

Title:HSZ40 Product Conference
Moderator:SSDEVO::EDMONDS
Created:Mon Apr 11 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:902
Total number of notes:3319

728.0. "HSZ50 Cache version 0 errors" by SWAM1::WOLFE_LE () Sun Jan 05 1997 06:09

T.RTitleUserPersonal
Name
DateLines
728.1SSDEVO::T_GONZALESMon Jan 06 1997 10:492
728.2Could be, but......SWAM1::WOLFE_LETue Jan 07 1997 17:147
728.3Cache Errors on HSZ50SWAM1::WOLFE_LEMon Feb 24 1997 10:4567
      Well the system has failed with cache version mis-match errors again,
    and this time is down hard.  The BA350-MA box has been replaced to
    eliminate this as a possible cause. The BA350-MA has dual BA35X-HF
    power and is running V5.0 in each of the HSZ50's. Each controller
    has 64meg cache. There is external batteries.  Here is a snap of
    the error sequence. Disregard any shelf fan or power supply bad errors,
    as only one power controller is plugged in.
    
    
    %EVL--Left_HSZ50> --13-JAN-1946 04:32:54 (time not set)-- Instance
    Code: 0102030A (not yet reported to host)
     Template: 1.(01)
     Occurred on 23-FEB-1997 at 10:47:50
     Power On Time: 0. Years, 51. Days, 19. Hours, 16. Minutes, 33. Seconds
     Controller Model: HSZ50-AX
     Serial Number: ZG63300556 Hardware Version:  A01(01)
     Firmware Version: V50Z(50)
     Informational Report
     Instance Code: 0102030A
     Last Failure Code: 20080000 (No Last Failure Parameters)
    
    %EVL--Left_HSZ50> --13-JAN-1946 04:32:54 (time not set)-- Instance
    Code: 02072201 (not yet reported to host)
     Template: 20.(14)
     Power On Time: 0. Years, 51. Days, 19. Hours, 16. Minutes, 33. Seconds
     Controller Model: HSZ50-AX
     Serial Number: ZG63300556 Hardware Version:  A01(01)
     Firmware Version: V50Z(50)
     Reported via non-maskable interrupt
     Memory Address: 00000000
     Byte Count: 0.(00000000)
     DRAB Registers:
      DSR:  20136830  CSR:  201385C0 DCSR:  20138C40  DER:  20138B60  EAR: 
    20136780
      EDR:  20138B60  ERR:  20138B60  RSR:  20083E30  CHC:  A4FCFCFD  CMC: 
    20B8FEF0
     Diagnostic Registers:
      RDR0: A4FCFCFD  RDR1: 20B8FEF0  WDR0: 7F0397B0  WDR1: FF0E020D
     Instance Code: 02072201
    Left_HSZ50> SHO THIS
    Controller:
            HSZ50-AX ZG63300556 Firmware V50Z-1, Hardware  A01
            Not configured for dual-redundancy
                Controller misconfigured -- other controller present
            SCSI address 7
            Time: NOT SET
    Host port:
            SCSI target(s) (0, 1, 2), Preferred target(s) (0)
            TRANSFER_RATE_REQUESTED = 10MHZ
    Cache:
            Unknown size read cache, version unknown
            Cache is FAILED
            Host Functionality Mode = D
    Cache module failed diagnostic testing of memory controllers
    Controllers misconfigured.  Type SHOW THIS_CONTROLLER
    Left_HSZ50> set failover copy=this
    Error 6130: Both caches must be at version 3.  This cache is at version
    4294967295,
                the other cache is at version 0
    Right_HSZ50> 
    Right_HSZ50> set failover copy=this
    Error 6130: Both caches must be at version 3.  This cache is at version
    4294967295,
                the other cache is at version 0
    Right_HSZ50>
    Right_HSZ50> 
    
728.4Board broken?SSDEVO::RMCLEANMon Feb 24 1997 10:521
Get a new cache board.  All the error messages point to a failed board.
728.5Modules have been replacedSWAM1::WOLFE_LETue Feb 25 1997 07:227
    Have replaced the cache module with a new module once, same results.
    Currenly the cache board in the right side (failure side) is from the 
    left side which runs error-free. The cache module which always fails in 
    the right side in running in the left side error-free. BA350-MA box has
    been replaced, along with simms (2) 32mb.
    
    -Lee 
728.6Try replacing the controllerSSDEVO::RMCLEANTue Feb 25 1997 10:295
  After looking at your error log and discussing this with a hardware 
engineer.  I would suggest that you try replacing the controller itself.
The error log points to controller memory and not cache memory.  There
has to be something related to the interface between the cache and
controller.  It is very confused because it says it is a only read cache.
728.7The Controller was blastedSWAM1::WOLFE_LETue Feb 25 1997 21:4111
    That's a good idea, so I tried it with the same results. If I move the
    failing SET (both cache and controller) to the left side, it runs
    error-free.  I then moved the error-free left SET to the right hand
    side, and cache version mis-matches.
    
    Could it be that it's ID 7 (right side) that has responsibility for some 
    minor testing of both sides if populated, and the fact there is a bad 
    module in ID 6 (left side) would cause an error??
    
    -Thanks for your efforts
     Lee
728.8Try this to isolate the problemSSDEVO::FAVA4 Yrs of Eng Sch & Never Saw a TrainWed Feb 26 1997 16:3342
	Your problem is certainly perplexing but it must be some kind of 
	module problem.  I would suggest the following procedure to try to
	isolate the bad module.  Note that some of these you have already 
	tried, but I couldn't determine exactly which ones from the earlier
	replies.

	Use this notation to try different module combinations and note the
	results of each:

		KA = Controller A
		CA = Cache A
		KB = Controller B
		CB = Cache B
		-- = Empty slot

	The following combinations assume your "left" and "right" notation
	from earlier replies.

	1)  CA  KA  --  --
	2)  --  --  CA  KA
	3)  CB  KB  --  --
	4)  --  --  CB  KB

	5)  CB  KA  --  --
	6)  --  --  CB  KA
	7)  CA  KB  --  --
	8)  --  --  CA  KB

	If these 8 combinations all work (which from your earlier description
	I suspect they will), then the problem is most likely a cache module.
	To isolate which one, try the following combinations:

	 9)  CA  KA  CB  KB
	10)  CB  KB  CA  KA
	11)  CB  KA  CA  KB
	12)  CA  KB  CB  KA

	I am guessing at this point that one controller will complain
	about its partner's cache, but not about its own.  Let us know
	the results and we'll go from there.

						Tom Fava
728.9Software looks like a hardware errorSSDEVO::RMCLEANWed Feb 26 1997 16:4410
Thanks Tom for the hardware view.

From the software view you are getting a DRAB error.  The diagnostics are 
failing the cache tests.  This is why you are being told that the version of
the cache is 0.  I have no idea what is failing in the diagnostics from
looking at this but it is not a fatal error since it thinks you can run
without it (which you can as long as you don't want dual redundant).

When the software gets a diagnostic error on a drab it doesn't try and 
determine which version of the cache is installed.
728.10Error message is wrongSSDEVO::RMCLEANWed Feb 26 1997 16:533
I also noted that the error message reported is not correct.  Cache version
0 is correct but the large number is wrong.  I have reported this problem and
hopefully it will be fixed in a future release of the software.
728.11Will take a stab on 9thSWAM1::WOLFE_LESat Mar 01 1997 01:084
    The customer wishes to schedule the downtime to fix this problem on the
    9th of March.  I will keep everyone updated.
    
    -Lee
728.12Waiting...........SWAM1::WOLFE_LETue Mar 11 1997 02:324
    Still waiting for the customer to schedule downtime.  I will report
    back.
    
    -Lee
728.13<Show me the CACHE>SWAM1::WOLFE_LESun Mar 23 1997 23:3719
     The customer scheduled down time, and I brought two controllers, two
    32meg simms and 1 cache module.  After replacing both the cache module
    and two 32 meg simms, the board would pass intermittently for the first
    5 restarts. Then the right controller (ID 7) would generate cache
    failure errors.  
    
     All modules were removed and placed in the left slots (ID 6 ) and
    tested with selftest this.  Both sets passed.  Both sets were placed in
    the right hand side (single controller mode) and tested.  With no
    module mixing including a new controller and cache, ALL module sets
    would fail in the right hand side (ID 7). The cache batteries and
    PCMCIA program cards were rotated.  
    
     Since the BA350-MA has been replaced, could this be a rev or part number 
    issue?  Is the BA350-MA the correct controller box to use? TIMA QRL
    shows this as the replacement part number. This was another Avnet erector 
    set.  All other HSZ50's installed in the area have been operational.
    
    -Lw
728.14Need 16 BitsSWAM1::WOLFE_LETue Mar 25 1997 17:2710
      Just to update anyone monitoring this notes file, the wrong
    controller shelf has been in use since the install. Avnet sent an
    unassembled SW800 erector set, and included a BA350-MA rev A02
    controller shelf. The part number is 70-29760-03.  This is a 8-bit
    shelf.  The correct part number to use is a 70-31489-01 BA356 16-bit
    controller shelf. Any newer BA350-MB shelf with a power supply would be
    a BA356 rev D01 shelf, and would function normally.
    
    -Lee
    
728.13<The need for double speed>SWAM1::WOLFE_LESat Mar 29 1997 23:2812
      The customer brought down the HSZ50 today, and the BA350-MA was
    replaced. After visiting several other HSZ50 sites, it was found all
    these sites had BA350-MA rev B01 with double speed fans. The BA350-MA
    delivered with the Avent erector set SW800 was rev A02. This appears to
    have fixed the problem.  After numerous warm and cold restarts, there
    are no cache version 0 errors on the slot 7 cache. The backplane part
    numbers were identical between the A02 and B01. The only difference I
    could detect was the double speed fans. It could be possible the
    controller was to monitor these fans, and instead on the rev A02 was
    presented with a different or floating signal????
    
    -Lee