T.R | Title | User | Personal Name | Date | Lines |
---|
1215.1 | No functional difference between B01 and B02 54-24721-01 | PROXY::JEAN | MAUREEN JEAN | Mon Jun 02 1997 17:08 | 20 |
|
The difference between B01 and B02 is that there was
a part number change to the map rams that eliminated
a specific ram vendor from the QVL. The Toshiba
sram is not to be used on the MB.
As for the failures. This test is a DMA loopback test
from the PCI to the System memory. The errors in HPC0
error registor indicate that a CSR overrun occured
as well as a non-existent PCI address error.
Are both B01's failing with the same exact error?
If so, is there any way I can get a hold of one of these
modules? I can be reached at DTN 223-6348.
Thanks,
Maureen Jean
RSE Tlaser I/O support
|
1215.2 | will try to supply the faulty boards | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Tue Jun 03 1997 01:57 | 13 |
| Many thanks, Maureen. Yes, the engineer told me that the two B01
boards fail with exactly the same error. These two boards are in
transit from Melbourne to me (in Sydney) so I can test them in the
Sydney CSC's 8400. If I confirm the faults I can arrange to send the
modules to you, although that will take a little while.
In the meantime, we have another module on order and the Australia Post
8400 is running with the original and intermittently faulty board
installed. That board has caused another couple of crashes since
yesterday, but fortunately the 8400 hadn't been formally accepted at
the time the problems started. Even so, if it wasn't for the fact that
the CSC's 8400 has a DWLPA instead of a DWLPB we'd have given Australia
Post the board from that to try to improve our relations with them.
|
1215.3 | | PROXY::JEAN | MAUREEN JEAN | Tue Jun 03 1997 11:43 | 5 |
|
What was the reason for the crash on the first
DWLPB motherboard?
Maureen
|
1215.4 | original motherboard's error | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Wed Jun 04 1997 02:34 | 29 |
| I don't have soft-copy of the original DWLPB errors, only a fax.
However, here's the most significant part of a typical error entry:
MRETRY1 x00400000
ERR 1 x00000041 ERROR SUMMARY
DMA READ RETURN DATA PARITY/LENGTH ERR
FADR 1 x02033440 DMA Read from Memory
IMask PCI Interrupt Mask x01031001 Slot 0 - Interrupt A Enable
Slot 3 - Interrupt A Enable
DIAG 1 x00000008 Generate Correct parity
HPC Gate Array Revision = 0
RM Down Hose Translate Ad x00000000
IPEND 1 x00000000
IPROG 1 x0000000C Interrupt Source Slot 3 INTA
These errors only occur under heavy I/O load. This motherboard never
fails self-test.
A third replacement motherboard failed exactly the same way as the
first two, so we're ordering a complete replacement DWLPB while we try
to figure out what's going on.
I've just received the first two replacement motherboards, which I'm
going to test in the Sydney CSC's 8400.
-Bret
PS:
Maureen, have you been getting the mail I sent you at PROXY::JEAN?
|
1215.5 | Seen it before (unfortunately). | IJSAPL::RIETKERK | Bart Rietkerk-Hoogeveen-Holland | Wed Jun 04 1997 04:49 | 50 |
|
Goodday, downunder.....
We recently had a horror story on a 12 CPU 440 Mhz TLASER with
the same errorlog entry as you entered in .4. I don't have any
revisions at hand, so FWIW.
april 25. Middle 48V regulator has got its amber LED on. Both
te other regulators appear to be ok. System running
fine. After a complete power down the middle regulator
comes back normal. Replaced it anyway as a precaution.
may 22 System crashes with a DMA READ RETURN DATA PARITY/
LENGTH ERROR on PCI-box #3. DWLPB Motherboard replaced.
may 26 System crashes 3 times, among other funnies: PCIA MAP
RAM PARITY ERROR on PCI Box #3 (!) After replacing the
hose cable (and a power cycle of course) 2 out of 4
PCI boxes (0 and 3) fail their selftest. Solidly blown.
Had to replace 2 DWLPB motherboards, an also replaced
TIOP module (1 of the common factors). System running
fine again.
june 3 Replaced middle 48V power regulator again just a a
precaution, because it has been swapped into the system
recently, and before the troubles started.
The above gives the bare facts. Now for the gutfeelings: DWLPA/DWLPB
is JUNK (!) Sorry to be so blunt, but 8400's are fine and problem free
machines, except for the PCI boxes. 1) Construction s*cks. Apart from
the generally known hints, kinks and blitzes: all those y-cables
hanging of badly bended PCI (KZPSA) modules will give trouble sooner
or later. 2) Electronically I don't trust them. Apart from the story
above I've seen other intermittent problems, and even 2 motherboards
with components gone up in smoke (1 of them just during swith-on after
installing a brand new machine) 3) Power is no good. Apart from
the funny problems you get because of the mounting of the piggy-back
power board in the PCI box I suspect you will end up having problems
on any 8400 after swithing off and on often enough.
I am sorry to say all this, but it is my honest opinion. I had a chance
to configure a Compac Proliant a months ago or so (everybody is going
Bill G. 's way these days), and I think our PCI box designers can learn
a lot from at least the PCI construction of this machine. We (DIGITAL)
should be ashamed about the TLASER PCI implementation!
(gutfeelings back to normal)
I hope the Aussies can use my info to solve the case, and they stay
ahead of further problems. My guess would be marginal power in the
troubled PCI-box.
Cheers, Bart Rietkerk (looking after 7 Tlasers among other things).
|
1215.6 | the plot thickens... | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Wed Jun 04 1997 10:46 | 46 |
| G'day, Bart. I'm actually reasonably happy with the quality and
reliability of the electronics in _factory integrated_ Turbolaser
systems (although the stories I hear about failure rates during FA&T
are disturbing). But add-in option quality isn't what I'd like and the
reliability of Turbolaser modules in MCS spares inventory is
unacceptable. This is because the options and spares weren't (still
aren't?) getting proper burn-in testing.
However, 48V power supply problems on 8400s aren't really a Turbolaser
issue. The 48V supplies are carry-overs from the VAX/DEC7000s, and the
ones I've seen (415V 50Hz 3-phase) have been buy-ins. Anyway, I've
found them to be quite reliable.
I mostly agree with you about the Turbolaser mechanical/packaging
aspect - it's pretty vile, particularly on the 8200, and a big
reliability and maintainability issue. The CSS rack-mount Turbolasers
can be _real_ horrors when it comes to maintainability.
But I'm digressing...
We ordered and installed a whole new DWLPB shelf; it passed self-test
and DUNIX booted OK. It got through a complete LSM copy operation with
no errors, which is much further than the original DWLPB ever got.
NOTE: The new DWLPB has the old style -02 metalwork, a -01 variant power
board and a rev B01 motherboard. The failing DWLPB has the
following parts:
PCI metalwork: 70-31092-03 Rev B01
PCI motherboard : 54-24721-01 Rev B02
Power board: 54-23470-02 Rev B01
The two rev B01 motherboard spares from Melbourne reached me today and
both worked OK when I installed them in the CSC 8400's DWLPA shelf.
However, that shelf has different metalwork and power board (i.e. a -01
instead of an -02) to the original DWLPB shelf in the Australia Post
8400.
The complete Aust. Post shelf will be sent to me for further testing
and analysis. It seems likely that there is something about this shelf
causing a hose cable mating problem. The problem manifests when the
rev B01 motherboards from our spares stock are installed, although the
rev level may not have anything to do with it. FWIW, we've been
following the procedure in Blitz TD-2153 when installing motherboards.
I have also asked the site engineer to get Aust. Post's hose cable part
numbers and rev levels for me tomorrow, just in case they have some
relevance to the problem.
|
1215.7 | agree...how about the piggy-back p/s? | IJSAPL::RIETKERK | Bart Rietkerk-Hoogeveen-Holland | Wed Jun 04 1997 11:30 | 24 |
|
Hi Bret,
First, I know about the 48 V regulators. I don't think that is where
the problem is either. I've got 2 XMI based 8400 over here, and about
20 7000's (AXP and VAX)- hardly any problems with the 48V regulators.
What I suspect more is the quality (mostly during switching) of the
PCI-box piggy-back power boards. The 48V regulator swaps I've done
on the troubled system over here where "just in case" swaps. I can't
explain solid failure of 2 pci boxes at the same moment over here.
There has to be a common factor (or was it just bad luck???)
Has the engineer-on-site that looks after the system where the 3 spare
DWLPB motherboards have failed swapped the PCI-box piggy-back P/S?
If that one is marginal it could explain 1) failure because of
rev-level difference (marginal change in power consumption? 2) the
fact that 2 of those modules run fine in your system. 3) the fact that
the original motherboard jumps out only during heavy load.
Just guessing....
Good luck!
Bart Rietkerk.
|
1215.8 | Interesting Failure!! :-) | MASS10::geraldo.reo.dec.com::ConnollyG | [email protected] | Wed Jun 04 1997 15:11 | 2 |
| > motherboard in the DWLPB. The replacement board failed with a hard-on
|
1215.9 | I think it's hose connection related | GIDDAY::HIRSHMAN | Hugged your Webmeister today? | Thu Jun 05 1997 01:44 | 25 |
| Bart,
Yes, I think the engineer did swap the power board with the one from
the 8400's other (working) DWLPB shelf. He also swapped over the hose
cable from the other shelf.
Actually, almost from the beginning the errors looked to me like they
were probably due to hose connection problems. The errors on the
original motherboard were DMA READ RETURN DATA PARITY/LENGTH errors,
which can be caused by a poor hose connection. The self-test errors
were CSR overrun errors which also can be caused by a poor hose
connection, although I've never seen a self-test failure for this
reason before.
I just didn't know whether it was a board-to-box revision related
mechanical incompatibility problem or simply faulty DWLPB metalwork.
I'm now fairly sure that it's the latter, but I still don't know
whether it's a one-off or an instance of a wider problem.
I'll know more when I receive the complete faulty DWLPB. If I find a
hose connection problem and it looks like it might be manufacturing
process related, I'll arrange to have the whole DWLPB shipped to
Maureen Jean of RSE for further analysis.
-Bret
|