T.R | Title | User | Personal Name | Date | Lines |
---|
728.1 | | SSDEVO::T_GONZALES | | Mon Jan 06 1997 10:49 | 2 |
728.2 | Could be, but...... | SWAM1::WOLFE_LE | | Tue Jan 07 1997 17:14 | 7 |
728.3 | Cache Errors on HSZ50 | SWAM1::WOLFE_LE | | Mon Feb 24 1997 10:45 | 67 |
| Well the system has failed with cache version mis-match errors again,
and this time is down hard. The BA350-MA box has been replaced to
eliminate this as a possible cause. The BA350-MA has dual BA35X-HF
power and is running V5.0 in each of the HSZ50's. Each controller
has 64meg cache. There is external batteries. Here is a snap of
the error sequence. Disregard any shelf fan or power supply bad errors,
as only one power controller is plugged in.
%EVL--Left_HSZ50> --13-JAN-1946 04:32:54 (time not set)-- Instance
Code: 0102030A (not yet reported to host)
Template: 1.(01)
Occurred on 23-FEB-1997 at 10:47:50
Power On Time: 0. Years, 51. Days, 19. Hours, 16. Minutes, 33. Seconds
Controller Model: HSZ50-AX
Serial Number: ZG63300556 Hardware Version: A01(01)
Firmware Version: V50Z(50)
Informational Report
Instance Code: 0102030A
Last Failure Code: 20080000 (No Last Failure Parameters)
%EVL--Left_HSZ50> --13-JAN-1946 04:32:54 (time not set)-- Instance
Code: 02072201 (not yet reported to host)
Template: 20.(14)
Power On Time: 0. Years, 51. Days, 19. Hours, 16. Minutes, 33. Seconds
Controller Model: HSZ50-AX
Serial Number: ZG63300556 Hardware Version: A01(01)
Firmware Version: V50Z(50)
Reported via non-maskable interrupt
Memory Address: 00000000
Byte Count: 0.(00000000)
DRAB Registers:
DSR: 20136830 CSR: 201385C0 DCSR: 20138C40 DER: 20138B60 EAR:
20136780
EDR: 20138B60 ERR: 20138B60 RSR: 20083E30 CHC: A4FCFCFD CMC:
20B8FEF0
Diagnostic Registers:
RDR0: A4FCFCFD RDR1: 20B8FEF0 WDR0: 7F0397B0 WDR1: FF0E020D
Instance Code: 02072201
Left_HSZ50> SHO THIS
Controller:
HSZ50-AX ZG63300556 Firmware V50Z-1, Hardware A01
Not configured for dual-redundancy
Controller misconfigured -- other controller present
SCSI address 7
Time: NOT SET
Host port:
SCSI target(s) (0, 1, 2), Preferred target(s) (0)
TRANSFER_RATE_REQUESTED = 10MHZ
Cache:
Unknown size read cache, version unknown
Cache is FAILED
Host Functionality Mode = D
Cache module failed diagnostic testing of memory controllers
Controllers misconfigured. Type SHOW THIS_CONTROLLER
Left_HSZ50> set failover copy=this
Error 6130: Both caches must be at version 3. This cache is at version
4294967295,
the other cache is at version 0
Right_HSZ50>
Right_HSZ50> set failover copy=this
Error 6130: Both caches must be at version 3. This cache is at version
4294967295,
the other cache is at version 0
Right_HSZ50>
Right_HSZ50>
|
728.4 | Board broken? | SSDEVO::RMCLEAN | | Mon Feb 24 1997 10:52 | 1 |
| Get a new cache board. All the error messages point to a failed board.
|
728.5 | Modules have been replaced | SWAM1::WOLFE_LE | | Tue Feb 25 1997 07:22 | 7 |
| Have replaced the cache module with a new module once, same results.
Currenly the cache board in the right side (failure side) is from the
left side which runs error-free. The cache module which always fails in
the right side in running in the left side error-free. BA350-MA box has
been replaced, along with simms (2) 32mb.
-Lee
|
728.6 | Try replacing the controller | SSDEVO::RMCLEAN | | Tue Feb 25 1997 10:29 | 5 |
| After looking at your error log and discussing this with a hardware
engineer. I would suggest that you try replacing the controller itself.
The error log points to controller memory and not cache memory. There
has to be something related to the interface between the cache and
controller. It is very confused because it says it is a only read cache.
|
728.7 | The Controller was blasted | SWAM1::WOLFE_LE | | Tue Feb 25 1997 21:41 | 11 |
| That's a good idea, so I tried it with the same results. If I move the
failing SET (both cache and controller) to the left side, it runs
error-free. I then moved the error-free left SET to the right hand
side, and cache version mis-matches.
Could it be that it's ID 7 (right side) that has responsibility for some
minor testing of both sides if populated, and the fact there is a bad
module in ID 6 (left side) would cause an error??
-Thanks for your efforts
Lee
|
728.8 | Try this to isolate the problem | SSDEVO::FAVA | 4 Yrs of Eng Sch & Never Saw a Train | Wed Feb 26 1997 16:33 | 42 |
| Your problem is certainly perplexing but it must be some kind of
module problem. I would suggest the following procedure to try to
isolate the bad module. Note that some of these you have already
tried, but I couldn't determine exactly which ones from the earlier
replies.
Use this notation to try different module combinations and note the
results of each:
KA = Controller A
CA = Cache A
KB = Controller B
CB = Cache B
-- = Empty slot
The following combinations assume your "left" and "right" notation
from earlier replies.
1) CA KA -- --
2) -- -- CA KA
3) CB KB -- --
4) -- -- CB KB
5) CB KA -- --
6) -- -- CB KA
7) CA KB -- --
8) -- -- CA KB
If these 8 combinations all work (which from your earlier description
I suspect they will), then the problem is most likely a cache module.
To isolate which one, try the following combinations:
9) CA KA CB KB
10) CB KB CA KA
11) CB KA CA KB
12) CA KB CB KA
I am guessing at this point that one controller will complain
about its partner's cache, but not about its own. Let us know
the results and we'll go from there.
Tom Fava
|
728.9 | Software looks like a hardware error | SSDEVO::RMCLEAN | | Wed Feb 26 1997 16:44 | 10 |
| Thanks Tom for the hardware view.
From the software view you are getting a DRAB error. The diagnostics are
failing the cache tests. This is why you are being told that the version of
the cache is 0. I have no idea what is failing in the diagnostics from
looking at this but it is not a fatal error since it thinks you can run
without it (which you can as long as you don't want dual redundant).
When the software gets a diagnostic error on a drab it doesn't try and
determine which version of the cache is installed.
|
728.10 | Error message is wrong | SSDEVO::RMCLEAN | | Wed Feb 26 1997 16:53 | 3 |
| I also noted that the error message reported is not correct. Cache version
0 is correct but the large number is wrong. I have reported this problem and
hopefully it will be fixed in a future release of the software.
|
728.11 | Will take a stab on 9th | SWAM1::WOLFE_LE | | Sat Mar 01 1997 01:08 | 4 |
| The customer wishes to schedule the downtime to fix this problem on the
9th of March. I will keep everyone updated.
-Lee
|
728.12 | Waiting........... | SWAM1::WOLFE_LE | | Tue Mar 11 1997 02:32 | 4 |
| Still waiting for the customer to schedule downtime. I will report
back.
-Lee
|
728.13 | <Show me the CACHE> | SWAM1::WOLFE_LE | | Sun Mar 23 1997 23:37 | 19 |
| The customer scheduled down time, and I brought two controllers, two
32meg simms and 1 cache module. After replacing both the cache module
and two 32 meg simms, the board would pass intermittently for the first
5 restarts. Then the right controller (ID 7) would generate cache
failure errors.
All modules were removed and placed in the left slots (ID 6 ) and
tested with selftest this. Both sets passed. Both sets were placed in
the right hand side (single controller mode) and tested. With no
module mixing including a new controller and cache, ALL module sets
would fail in the right hand side (ID 7). The cache batteries and
PCMCIA program cards were rotated.
Since the BA350-MA has been replaced, could this be a rev or part number
issue? Is the BA350-MA the correct controller box to use? TIMA QRL
shows this as the replacement part number. This was another Avnet erector
set. All other HSZ50's installed in the area have been operational.
-Lw
|
728.14 | Need 16 Bits | SWAM1::WOLFE_LE | | Tue Mar 25 1997 17:27 | 10 |
| Just to update anyone monitoring this notes file, the wrong
controller shelf has been in use since the install. Avnet sent an
unassembled SW800 erector set, and included a BA350-MA rev A02
controller shelf. The part number is 70-29760-03. This is a 8-bit
shelf. The correct part number to use is a 70-31489-01 BA356 16-bit
controller shelf. Any newer BA350-MB shelf with a power supply would be
a BA356 rev D01 shelf, and would function normally.
-Lee
|
728.13 | <The need for double speed> | SWAM1::WOLFE_LE | | Sat Mar 29 1997 23:28 | 12 |
| The customer brought down the HSZ50 today, and the BA350-MA was
replaced. After visiting several other HSZ50 sites, it was found all
these sites had BA350-MA rev B01 with double speed fans. The BA350-MA
delivered with the Avent erector set SW800 was rev A02. This appears to
have fixed the problem. After numerous warm and cold restarts, there
are no cache version 0 errors on the slot 7 cache. The backplane part
numbers were identical between the A02 and B01. The only difference I
could detect was the double speed fans. It could be possible the
controller was to monitor these fans, and instead on the rev A02 was
presented with a different or floating signal????
-Lee
|