[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference wonder::turbolaser

Title:	TurboLaser Notesfile - AlphaServer 8200 and 8400 systems
Notice:	Welcome to WONDER::TURBOLASER in it's new homeshortly
Moderator:	LANDO::DROBNER

Created:	Tue Dec 20 1994
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1218
Total number of notes:	4645

1215.0. "DWLPB revision compatibility problem?" by GIDDAY::HIRSHMAN (Hugged your Webmeister today?) Mon Jun 02 1997 06:32

    I am assisting an engineer doing an 8400 installation at Australia
    Post.  An intermittent fault was found on one of the two DWLPB boxes,
    and the engineer attempted to replace the 54-24721-01 rev B02
    motherboard in the DWLPB.  The replacement board failed with a hard-on
    power-up self test error, so another replacement was obtained.  The 2nd
    54-24721-01 gave exactly the same test failure as the first!  Both of
    the replacement boards were rev B01.

    Australia Post already have an IPMT outage open for low MTBF on another
    8400 and for high infant mortality of replacements, so even though this
    is "just" an install this problem is making us look VERY bad!
    
    The engineer cycled through the original and two replacement boards
    again, confirming that the original motherboard works but the two
    replacements fail with the same error.  (FWIW, using a different KFTHA
    hose port made no difference to the symptoms.)

    Note 1208 implies that there may be a minimum rev of 54-24721-01 (B02??)
    that must be used with rev B02 DWLPB-Bx boxes (also see Blitz TD 2284),
    but this is by no means certain - can anyone clarify this?

    The console "el" log error info for the two failing replacement boards
    follows.  Can someone tell me what this means, please?
    
04:29.18 Executing hpc_diag on device pci0
04:29.18 Executing hpc_diag on device pci1
04:29.23
04:29.23 *** Hard Error - Error #4 - Data compare error
04:29.23
04:29.23 Diagnostic Name        ID             Device  Pass  Test  Hard/Soft 1-JUN-2045
04:29.23 hpc_diag         00000012               pci1     1    12     1    0
04:29:23
04:29.23 Expected value:                       10
04:29.23 Received value:                 ffffffef
04:29.23 Failing addr:             820020
04:29.23
04:29.23 hpc error register 0 :  4009
04:29.23
04:29.23 hpc Failing address register 0 :   800100
04:29.23
04:29.23 wmask_a0 :   7f0000    wmask_b0 :   7f0000     wmask_c0 :   7f0000
04:29.23 wbase_a0 :   800002    wbase_b0 :        0     wbase_c0 :        0
04:29.23 tbase_a0 :        2    tbase_b0 :        2     tbase_c0 :        2
04:29.23
04:29.23 hpc error register 1 :        9
04:29.23
04:29.23 hpc Failing address register 1 :   800200
04:29.23
04:29.23 wmask_a1 :   7f0000    wmask_b1 :   7f0000     wmask_c1 :   7f0000
04:29.23 wbase_a1 :        0    wbase_b1 :        0     wbase_c1 :        0
04:29.23 tbase_a1 :        2    tbase_b1 :        2     tbase_c1 :        2
04:29.23
04:29.23 hpc error register 2 :        9
04:29.23
04:29.23 hpc Failing address register 2 :   800300
04:29.23
04:29.23 wmask_a2 :   7f0000    wmask_b2 :   7f0000     wmask_c2 :   7f0000
04:29.23 wbase_a2 :        0    wbase_b2 :        0     wbase_c2 :        0
04:29.23 tbase_a2 :        2    tbase_b2 :        2     tbase_c2 :        2
04:29.23
04:29.23 *** End of Error ***

T.R	Title	User	Personal Name	Date	Lines
1215.1	No functional difference between B01 and B02 54-24721-01	PROXY::JEAN	MAUREEN JEAN	`Mon Jun 02 1997 16:08`	20
	The difference between B01 and B02 is that there was a part number change to the map rams that eliminated a specific ram vendor from the QVL. The Toshiba sram is not to be used on the MB. As for the failures. This test is a DMA loopback test from the PCI to the System memory. The errors in HPC0 error registor indicate that a CSR overrun occured as well as a non-existent PCI address error. Are both B01's failing with the same exact error? If so, is there any way I can get a hold of one of these modules? I can be reached at DTN 223-6348. Thanks, Maureen Jean RSE Tlaser I/O support
1215.2	will try to supply the faulty boards	GIDDAY::HIRSHMAN	Hugged your Webmeister today?	`Tue Jun 03 1997 00:57`	13
	Many thanks, Maureen. Yes, the engineer told me that the two B01 boards fail with exactly the same error. These two boards are in transit from Melbourne to me (in Sydney) so I can test them in the Sydney CSC's 8400. If I confirm the faults I can arrange to send the modules to you, although that will take a little while. In the meantime, we have another module on order and the Australia Post 8400 is running with the original and intermittently faulty board installed. That board has caused another couple of crashes since yesterday, but fortunately the 8400 hadn't been formally accepted at the time the problems started. Even so, if it wasn't for the fact that the CSC's 8400 has a DWLPA instead of a DWLPB we'd have given Australia Post the board from that to try to improve our relations with them.
1215.3		PROXY::JEAN	MAUREEN JEAN	`Tue Jun 03 1997 10:43`	5
	What was the reason for the crash on the first DWLPB motherboard? Maureen
1215.4	original motherboard's error	GIDDAY::HIRSHMAN	Hugged your Webmeister today?	`Wed Jun 04 1997 01:34`	29
	I don't have soft-copy of the original DWLPB errors, only a fax. However, here's the most significant part of a typical error entry: MRETRY1 x00400000 ERR 1 x00000041 ERROR SUMMARY DMA READ RETURN DATA PARITY/LENGTH ERR FADR 1 x02033440 DMA Read from Memory IMask PCI Interrupt Mask x01031001 Slot 0 - Interrupt A Enable Slot 3 - Interrupt A Enable DIAG 1 x00000008 Generate Correct parity HPC Gate Array Revision = 0 RM Down Hose Translate Ad x00000000 IPEND 1 x00000000 IPROG 1 x0000000C Interrupt Source Slot 3 INTA These errors only occur under heavy I/O load. This motherboard never fails self-test. A third replacement motherboard failed exactly the same way as the first two, so we're ordering a complete replacement DWLPB while we try to figure out what's going on. I've just received the first two replacement motherboards, which I'm going to test in the Sydney CSC's 8400. -Bret PS: Maureen, have you been getting the mail I sent you at PROXY::JEAN?
1215.5	Seen it before (unfortunately).	IJSAPL::RIETKERK	Bart Rietkerk-Hoogeveen-Holland	`Wed Jun 04 1997 03:49`	50
	Goodday, downunder..... We recently had a horror story on a 12 CPU 440 Mhz TLASER with the same errorlog entry as you entered in .4. I don't have any revisions at hand, so FWIW. april 25. Middle 48V regulator has got its amber LED on. Both te other regulators appear to be ok. System running fine. After a complete power down the middle regulator comes back normal. Replaced it anyway as a precaution. may 22 System crashes with a DMA READ RETURN DATA PARITY/ LENGTH ERROR on PCI-box #3. DWLPB Motherboard replaced. may 26 System crashes 3 times, among other funnies: PCIA MAP RAM PARITY ERROR on PCI Box #3 (!) After replacing the hose cable (and a power cycle of course) 2 out of 4 PCI boxes (0 and 3) fail their selftest. Solidly blown. Had to replace 2 DWLPB motherboards, an also replaced TIOP module (1 of the common factors). System running fine again. june 3 Replaced middle 48V power regulator again just a a precaution, because it has been swapped into the system recently, and before the troubles started. The above gives the bare facts. Now for the gutfeelings: DWLPA/DWLPB is JUNK (!) Sorry to be so blunt, but 8400's are fine and problem free machines, except for the PCI boxes. 1) Construction s*cks. Apart from the generally known hints, kinks and blitzes: all those y-cables hanging of badly bended PCI (KZPSA) modules will give trouble sooner or later. 2) Electronically I don't trust them. Apart from the story above I've seen other intermittent problems, and even 2 motherboards with components gone up in smoke (1 of them just during swith-on after installing a brand new machine) 3) Power is no good. Apart from the funny problems you get because of the mounting of the piggy-back power board in the PCI box I suspect you will end up having problems on any 8400 after swithing off and on often enough. I am sorry to say all this, but it is my honest opinion. I had a chance to configure a Compac Proliant a months ago or so (everybody is going Bill G. 's way these days), and I think our PCI box designers can learn a lot from at least the PCI construction of this machine. We (DIGITAL) should be ashamed about the TLASER PCI implementation! (gutfeelings back to normal) I hope the Aussies can use my info to solve the case, and they stay ahead of further problems. My guess would be marginal power in the troubled PCI-box. Cheers, Bart Rietkerk (looking after 7 Tlasers among other things).
1215.6	the plot thickens...	GIDDAY::HIRSHMAN	Hugged your Webmeister today?	`Wed Jun 04 1997 09:46`	46
	G'day, Bart. I'm actually reasonably happy with the quality and reliability of the electronics in _factory integrated_ Turbolaser systems (although the stories I hear about failure rates during FA&T are disturbing). But add-in option quality isn't what I'd like and the reliability of Turbolaser modules in MCS spares inventory is unacceptable. This is because the options and spares weren't (still aren't?) getting proper burn-in testing. However, 48V power supply problems on 8400s aren't really a Turbolaser issue. The 48V supplies are carry-overs from the VAX/DEC7000s, and the ones I've seen (415V 50Hz 3-phase) have been buy-ins. Anyway, I've found them to be quite reliable. I mostly agree with you about the Turbolaser mechanical/packaging aspect - it's pretty vile, particularly on the 8200, and a big reliability and maintainability issue. The CSS rack-mount Turbolasers can be _real_ horrors when it comes to maintainability. But I'm digressing... We ordered and installed a whole new DWLPB shelf; it passed self-test and DUNIX booted OK. It got through a complete LSM copy operation with no errors, which is much further than the original DWLPB ever got. NOTE: The new DWLPB has the old style -02 metalwork, a -01 variant power board and a rev B01 motherboard. The failing DWLPB has the following parts: PCI metalwork: 70-31092-03 Rev B01 PCI motherboard : 54-24721-01 Rev B02 Power board: 54-23470-02 Rev B01 The two rev B01 motherboard spares from Melbourne reached me today and both worked OK when I installed them in the CSC 8400's DWLPA shelf. However, that shelf has different metalwork and power board (i.e. a -01 instead of an -02) to the original DWLPB shelf in the Australia Post 8400. The complete Aust. Post shelf will be sent to me for further testing and analysis. It seems likely that there is something about this shelf causing a hose cable mating problem. The problem manifests when the rev B01 motherboards from our spares stock are installed, although the rev level may not have anything to do with it. FWIW, we've been following the procedure in Blitz TD-2153 when installing motherboards. I have also asked the site engineer to get Aust. Post's hose cable part numbers and rev levels for me tomorrow, just in case they have some relevance to the problem.
1215.7	agree...how about the piggy-back p/s?	IJSAPL::RIETKERK	Bart Rietkerk-Hoogeveen-Holland	`Wed Jun 04 1997 10:30`	24
	Hi Bret, First, I know about the 48 V regulators. I don't think that is where the problem is either. I've got 2 XMI based 8400 over here, and about 20 7000's (AXP and VAX)- hardly any problems with the 48V regulators. What I suspect more is the quality (mostly during switching) of the PCI-box piggy-back power boards. The 48V regulator swaps I've done on the troubled system over here where "just in case" swaps. I can't explain solid failure of 2 pci boxes at the same moment over here. There has to be a common factor (or was it just bad luck???) Has the engineer-on-site that looks after the system where the 3 spare DWLPB motherboards have failed swapped the PCI-box piggy-back P/S? If that one is marginal it could explain 1) failure because of rev-level difference (marginal change in power consumption? 2) the fact that 2 of those modules run fine in your system. 3) the fact that the original motherboard jumps out only during heavy load. Just guessing.... Good luck! Bart Rietkerk.
1215.8	Interesting Failure!! :-)	MASS10::geraldo.reo.dec.com::ConnollyG	[email protected]	`Wed Jun 04 1997 14:11`	2
	> motherboard in the DWLPB. The replacement board failed with a hard-on
1215.9	I think it's hose connection related	GIDDAY::HIRSHMAN	Hugged your Webmeister today?	`Thu Jun 05 1997 00:44`	25
	Bart, Yes, I think the engineer did swap the power board with the one from the 8400's other (working) DWLPB shelf. He also swapped over the hose cable from the other shelf. Actually, almost from the beginning the errors looked to me like they were probably due to hose connection problems. The errors on the original motherboard were DMA READ RETURN DATA PARITY/LENGTH errors, which can be caused by a poor hose connection. The self-test errors were CSR overrun errors which also can be caused by a poor hose connection, although I've never seen a self-test failure for this reason before. I just didn't know whether it was a board-to-box revision related mechanical incompatibility problem or simply faulty DWLPB metalwork. I'm now fairly sure that it's the latter, but I still don't know whether it's a one-off or an instance of a wider problem. I'll know more when I receive the complete faulty DWLPB. If I find a hose connection problem and it looks like it might be manufacturing process related, I'll arrange to have the whole DWLPB shipped to Maureen Jean of RSE for further analysis. -Bret