[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference wonder::turbolaser

Title:TurboLaser Notesfile - AlphaServer 8200 and 8400 systems
Notice:Welcome to WONDER::TURBOLASER in it's new homeshortly
Moderator:LANDO::DROBNER
Created:Tue Dec 20 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1218
Total number of notes:4645

1215.0. "DWLPB revision compatibility problem?" by GIDDAY::HIRSHMAN (Hugged your Webmeister today?) Mon Jun 02 1997 07:32

    I am assisting an engineer doing an 8400 installation at Australia
    Post.  An intermittent fault was found on one of the two DWLPB boxes,
    and the engineer attempted to replace the 54-24721-01 rev B02
    motherboard in the DWLPB.  The replacement board failed with a hard-on
    power-up self test error, so another replacement was obtained.  The 2nd
    54-24721-01 gave exactly the same test failure as the first!  Both of
    the replacement boards were rev B01.

    Australia Post already have an IPMT outage open for low MTBF on another
    8400 and for high infant mortality of replacements, so even though this
    is "just" an install this problem is making us look VERY bad!
    
    The engineer cycled through the original and two replacement boards
    again, confirming that the original motherboard works but the two
    replacements fail with the same error.  (FWIW, using a different KFTHA
    hose port made no difference to the symptoms.)

    Note 1208 implies that there may be a minimum rev of 54-24721-01 (B02??)
    that must be used with rev B02 DWLPB-Bx boxes (also see Blitz TD 2284),
    but this is by no means certain - can anyone clarify this?

    The console "el" log error info for the two failing replacement boards
    follows.  Can someone tell me what this means, please?
    
04:29.18 Executing hpc_diag on device pci0
04:29.18 Executing hpc_diag on device pci1
04:29.23
04:29.23 *** Hard Error - Error #4 - Data compare error
04:29.23
04:29.23 Diagnostic Name        ID             Device  Pass  Test  Hard/Soft 1-JUN-2045
04:29.23 hpc_diag         00000012               pci1     1    12     1    0
04:29:23
04:29.23 Expected value:                       10
04:29.23 Received value:                 ffffffef
04:29.23 Failing addr:             820020
04:29.23
04:29.23 hpc error register 0 :  4009
04:29.23
04:29.23 hpc Failing address register 0 :   800100
04:29.23
04:29.23 wmask_a0 :   7f0000    wmask_b0 :   7f0000     wmask_c0 :   7f0000
04:29.23 wbase_a0 :   800002    wbase_b0 :        0     wbase_c0 :        0
04:29.23 tbase_a0 :        2    tbase_b0 :        2     tbase_c0 :        2
04:29.23
04:29.23 hpc error register 1 :        9
04:29.23
04:29.23 hpc Failing address register 1 :   800200
04:29.23
04:29.23 wmask_a1 :   7f0000    wmask_b1 :   7f0000     wmask_c1 :   7f0000
04:29.23 wbase_a1 :        0    wbase_b1 :        0     wbase_c1 :        0
04:29.23 tbase_a1 :        2    tbase_b1 :        2     tbase_c1 :        2
04:29.23
04:29.23 hpc error register 2 :        9
04:29.23
04:29.23 hpc Failing address register 2 :   800300
04:29.23
04:29.23 wmask_a2 :   7f0000    wmask_b2 :   7f0000     wmask_c2 :   7f0000
04:29.23 wbase_a2 :        0    wbase_b2 :        0     wbase_c2 :        0
04:29.23 tbase_a2 :        2    tbase_b2 :        2     tbase_c2 :        2
04:29.23
04:29.23 *** End of Error ***

T.RTitleUserPersonal
Name
DateLines
1215.1No functional difference between B01 and B02 54-24721-01PROXY::JEANMAUREEN JEANMon Jun 02 1997 17:0820

The difference between B01 and B02 is that there was
a part number change to the map rams that eliminated
a specific ram vendor from the QVL.   The Toshiba
sram is not to be used on the MB.   

As for the failures.  This test is a DMA loopback test
from the PCI to the System memory.  The errors in HPC0
error registor indicate that a CSR overrun occured
as well as a non-existent PCI address error.   

Are both B01's failing with the same exact error?
If so, is there any way I can get a hold of one of these
modules?  I can be reached at DTN 223-6348.

Thanks,

Maureen Jean
RSE Tlaser I/O support
1215.2will try to supply the faulty boardsGIDDAY::HIRSHMANHugged your Webmeister today?Tue Jun 03 1997 01:5713
    Many thanks, Maureen.  Yes, the engineer told me that the two B01
    boards fail with exactly the same error.  These two boards are in
    transit from Melbourne to me (in Sydney) so I can test them in the
    Sydney CSC's 8400.  If I confirm the faults I can arrange to send the
    modules to you, although that will take a little while.
    
    In the meantime, we have another module on order and the Australia Post
    8400 is running with the original and intermittently faulty board
    installed.  That board has caused another couple of crashes since
    yesterday, but fortunately the 8400 hadn't been formally accepted at
    the time the problems started.  Even so, if it wasn't for the fact that
    the CSC's 8400 has a DWLPA instead of a DWLPB we'd have given Australia
    Post the board from that to try to improve our relations with them.
1215.3PROXY::JEANMAUREEN JEANTue Jun 03 1997 11:435
What was the reason for the crash on the first
DWLPB motherboard?

Maureen
1215.4original motherboard's errorGIDDAY::HIRSHMANHugged your Webmeister today?Wed Jun 04 1997 02:3429
    I don't have soft-copy of the original DWLPB errors, only a fax. 
    However, here's the most significant part of a typical error entry:

MRETRY1                   x00400000
ERR 1                     x00000041  ERROR SUMMARY
                                     DMA READ RETURN DATA PARITY/LENGTH ERR
FADR 1                    x02033440  DMA Read from Memory
IMask PCI Interrupt Mask  x01031001  Slot 0 - Interrupt A Enable
                                     Slot 3 - Interrupt A Enable
DIAG 1                    x00000008  Generate Correct parity
                                     HPC Gate Array Revision = 0
                                     RM Down Hose Translate Ad x00000000
IPEND 1                   x00000000
IPROG 1                   x0000000C  Interrupt Source  Slot 3 INTA
    
    These errors only occur under heavy I/O load.  This motherboard never
    fails self-test.
    
    A third replacement motherboard failed exactly the same way as the
    first two, so we're ordering a complete replacement DWLPB while we try
    to figure out what's going on.
    
    I've just received the first two replacement motherboards, which I'm
    going to test in the Sydney CSC's 8400.
    
    -Bret
    
    PS:
    Maureen, have you been getting the mail I sent you at PROXY::JEAN?
1215.5Seen it before (unfortunately).IJSAPL::RIETKERKBart Rietkerk-Hoogeveen-HollandWed Jun 04 1997 04:4950
    
    Goodday, downunder.....
    
    We recently had a horror story on a 12 CPU 440 Mhz TLASER with
    the same errorlog entry as you entered in .4. I don't have any
    revisions at hand, so FWIW.
    
    april 25. 	Middle 48V regulator has got its amber LED on. Both
    		te other regulators appear to be ok. System running
    		fine. After a complete power down the middle regulator
    		comes back normal. Replaced it anyway as a precaution.
    may 22	System crashes with a DMA READ RETURN DATA PARITY/
    		LENGTH ERROR on PCI-box #3. DWLPB Motherboard replaced.
    may 26	System crashes 3 times, among other funnies: PCIA MAP
    		RAM PARITY ERROR on PCI Box #3 (!) After replacing the
    		hose cable (and a power cycle of course) 2 out of 4
    		PCI boxes (0 and 3) fail their selftest. Solidly blown.
    		Had to replace 2 DWLPB motherboards, an also replaced
    		TIOP module (1 of the common factors). System running
    		fine again.
    june 3	Replaced middle 48V power regulator again just a a
    		precaution, because it has been swapped into the system
    		recently, and before the troubles started.
    
    The above gives the bare facts. Now for the gutfeelings: DWLPA/DWLPB
    is JUNK (!) Sorry to be so blunt, but 8400's are fine and problem free
    machines, except for the PCI boxes. 1) Construction s*cks. Apart from
    the generally known hints, kinks and blitzes: all those y-cables
    hanging of badly bended PCI (KZPSA) modules will give trouble sooner
    or later. 2) Electronically I don't trust them. Apart from the story 
    above I've seen other intermittent problems, and even 2 motherboards
    with components gone up in smoke (1 of them just during swith-on after
    installing a brand new machine) 3) Power is no good. Apart from
    the funny problems you get because of the mounting of the piggy-back
    power board in the PCI box I suspect you will end up having problems
    on any 8400 after swithing off and on often enough.
    
    I am sorry to say all this, but it is my honest opinion. I had a chance
    to configure a Compac Proliant a months ago or so (everybody is going
    Bill G. 's way these days), and I think our PCI box designers can learn
    a lot from at least the PCI construction of this machine. We (DIGITAL)
    should be ashamed about the TLASER PCI implementation!
    
    (gutfeelings back to normal)
    
    	I hope the Aussies can use my info to solve the case, and they stay
    	ahead of further problems. My guess would be marginal power in the
    	troubled PCI-box.
    
    	Cheers, Bart Rietkerk (looking after 7 Tlasers among other things).
1215.6the plot thickens...GIDDAY::HIRSHMANHugged your Webmeister today?Wed Jun 04 1997 10:4646
    G'day, Bart.  I'm actually reasonably happy with the quality and
    reliability of the electronics in _factory integrated_ Turbolaser
    systems (although the stories I hear about failure rates during FA&T
    are disturbing).  But add-in option quality isn't what I'd like and the
    reliability of Turbolaser modules in MCS spares inventory is
    unacceptable.  This is because the options and spares weren't (still
    aren't?) getting proper burn-in testing.
    
    However, 48V power supply problems on 8400s aren't really a Turbolaser
    issue.  The 48V supplies are carry-overs from the VAX/DEC7000s, and the
    ones I've seen (415V 50Hz 3-phase) have been buy-ins.  Anyway, I've
    found them to be quite reliable.
    
    I mostly agree with you about the Turbolaser mechanical/packaging
    aspect - it's pretty vile, particularly on the 8200, and a big
    reliability and maintainability issue.  The CSS rack-mount Turbolasers
    can be _real_ horrors when it comes to maintainability.
    
    But I'm digressing...
    
    We ordered and installed a whole new DWLPB shelf; it passed self-test
    and DUNIX booted OK.  It got through a complete LSM copy operation with
    no errors, which is much further than the original DWLPB ever got.
    NOTE: The new DWLPB has the old style -02 metalwork, a -01 variant power
          board and a rev B01 motherboard.  The failing DWLPB has the
          following parts:
            PCI metalwork: 70-31092-03 Rev B01
            PCI motherboard : 54-24721-01 Rev B02
            Power board: 54-23470-02 Rev B01
    
    The two rev B01 motherboard spares from Melbourne reached me today and
    both worked OK when I installed them in the CSC 8400's DWLPA shelf. 
    However, that shelf has different metalwork and power board (i.e. a -01
    instead of an -02) to the original DWLPB shelf in the Australia Post
    8400.
    
    The complete Aust. Post shelf will be sent to me for further testing
    and analysis.  It seems likely that there is something about this shelf
    causing a hose cable mating problem.  The problem manifests when the
    rev B01 motherboards from our spares stock are installed, although the
    rev level may not have anything to do with it.  FWIW, we've been
    following the procedure in Blitz TD-2153 when installing motherboards.
    
    I have also asked the site engineer to get Aust. Post's hose cable part
    numbers and rev levels for me tomorrow, just in case they have some
    relevance to the problem.
1215.7agree...how about the piggy-back p/s?IJSAPL::RIETKERKBart Rietkerk-Hoogeveen-HollandWed Jun 04 1997 11:3024
    
    Hi Bret,
    
    First, I know about the 48 V regulators. I don't think that is where
    the problem is either. I've got 2 XMI based 8400 over here, and about
    20 7000's (AXP and VAX)- hardly any problems with the 48V regulators.
    What I suspect more is the quality (mostly during switching) of the
    PCI-box piggy-back power boards. The 48V regulator swaps I've done
    on the troubled system over here where "just in case" swaps. I can't
    explain solid failure of 2 pci boxes at the same moment over here.
    There has to be a common factor (or was it just bad luck???)
    
    Has the engineer-on-site that looks after the system where the 3 spare 
    DWLPB motherboards have failed swapped the PCI-box piggy-back P/S?
    If that one is marginal it could explain 1) failure because of
    rev-level difference (marginal change in power consumption? 2) the
    fact that 2 of those modules run fine in your system. 3) the fact that
    the original motherboard jumps out only during heavy load.
    
    Just guessing....
    
    	Good luck!
    
    	Bart Rietkerk.
1215.8Interesting Failure!! :-)MASS10::geraldo.reo.dec.com::ConnollyG[email protected]Wed Jun 04 1997 15:112
>    motherboard in the DWLPB.  The replacement board failed with a hard-on

1215.9I think it's hose connection relatedGIDDAY::HIRSHMANHugged your Webmeister today?Thu Jun 05 1997 01:4425
    Bart,
    
    Yes, I think the engineer did swap the power board with the one from
    the 8400's other (working) DWLPB shelf.  He also swapped over the hose
    cable from the other shelf.
    
    Actually, almost from the beginning the errors looked to me like they
    were probably due to hose connection problems.  The errors on the
    original motherboard were DMA READ RETURN DATA PARITY/LENGTH errors,
    which can be caused by a poor hose connection.  The self-test errors
    were CSR overrun errors which also can be caused by a poor hose
    connection, although I've never seen a self-test failure for this
    reason before.
    
    I just didn't know whether it was a board-to-box revision related
    mechanical incompatibility problem or simply faulty DWLPB metalwork. 
    I'm now fairly sure that it's the latter, but I still don't know
    whether it's a one-off or an instance of a wider problem.
    
    I'll know more when I receive the complete faulty DWLPB.  If I find a
    hose connection problem and it looks like it might be manufacturing
    process related, I'll arrange to have the whole DWLPB shipped to
    Maureen Jean of RSE for further analysis.
    
    -Bret