[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

540.0. "Diagnosability dialogue and questions" by HARMNY::CUMMINS () Thu Mar 20 1997 16:58

What follows are some comments about diagnosability/ease-of-use on the
4100/4000 platform and some questions for persons involved in servicing
and otherwise supporting 4100/4000 machines. I'd appreciate feedback.

I've noticed what seems to me a large number of module swaps on the
4100/4000 platform (before arriving at the real fix). My guess is this
is due to a combination of five (and maybe more) reasons:

  1. We could have done a better job in the a) hardware design/test,
     b) firmware design/implementation, and c) software design/rollout
     (e.g. DECevent, NT HAL, RCU to name a few) areas. Some examples
     and discussion follow:

     a. We've found several design issues after HW FRS. This means we
        could have done a better job during DVT and PVT. We also have a
        design which doesn't allow for communication of problems to the
        user unless the path from the CPU, across the motherboard, through
        the PCI bridge card, and out to the PCI motherboard's Xbus is good
        enough to report OCP or COM1 failure/status messages. Another
        example is the MDP bug which precludes capture of IOD-detected CRD
        syndrome data by PALcode (without risking user data corruption). 
        This has presumably not helped w.r.t. isolation to a memory pair
        member (though, typically two IOD-detected CRDs errors are
        followed by a CPU-detected CRD with valid syndrome data..)

     b. We have two consoles versus one (SRM and AlphaBIOS) - this overly
        complicates the user interface. We've also had many bugs/features 
        which have caused headaches in the Field. Examples:

        --> the SRM console MEMORY_TEST EV and the associated SRM and
            Digital UNIX bugs in this area.
        --> the AlphaBIOS NVRAM write/update bug (fast/slow NVRAMs).

        And most importantly, we need to improve our ease-of-use; and in
        both consoles (assuming we'll never get to one console). These
        machines are very complicated and so it would follow that the FW
        might also be more complex than say a VCR or TV might provide.
        
        Complexity examples: boot device support for seemingly every
        device under the sun, various networking protocols, with complex
        configurations such as clusters, multi-path devices, etc.; BIOS
        emulation of Intel bits on VGA option ROMs; VGAs and serial lines;
        PCI + EISA + Xbus junk I/O; there's a lot of ugliness / complexity
        underneath it all..

     c. Examples:

        --> DECevent was not all there at FRS. It's still not (no fault
            analysis support).
        --> 4100/4000 NT HAL isn't as robust as UNIX/VMS PALcode and the 
            operating system proper as far as error reporting and handling
            are concerned. E.g. I believe some errors were not even enabled
            (at least initially, anyway.. I've lost track here..)
        --> Anyone who has used RCU on Rawhide has likely experienced some
            level of frustration.

  2. FW release notes are inadequate. On top of this, I find that they
     frequently only get read after a problem arises. We need to do a much
     better job making the release notes and readme files comprehensive,
     yet concise/clear; especially in the area of warning users about
     potential problems. Even better, we need to build features into the
     console and FW update utilities to warn users and/or fix problems
     automatically - versus depending upon the release notes and readmes
     alone to head off problems.

  3. Repair/support personnel are not sufficiently trained/experienced on
     the 4100/4000 platform yet. My limited insight tells me this is due
     primarily to two problems:

     --> Reorgs/layoffs in Customer Services during the past couple years.
     --> There are so many products supported by our Service organization,
         even by single individuals, that there's not enough time to learn
         sufficient detail about each one.

     Since the second is unlikely to change, it means we in Engineering
     have to make our machines and software as maintenance-free and as
     easy-to-use and diagnose as possible.

  4. The average server customer will not permit a repair person to spend
     very much time at all fixing a down machine, so the repair process
     typically involves swapping the most suspect part; then the next;
     then the next..

  5. We need better tools for proactive maintenance. Example: DECevent
     fault analysis capability is still not available (it's a crime that
     it wasn't available at FRS - my understanding is that six of seven
     DECevent developers were laid off many months ago! Not sure what the
     re-hire/manpower status is now..)

Questions for Service Engineers and anyone else with an opinion:

  Are the above five on the mark? Did I miss something?

  What are the top few areas of the HW, SW, and FW design that you'd like
  to see changed/improved on Rawhide and/or on next-generation designs?

  When modules/parts are pulled and returned, are return tags used to
  explain the failure symptoms / problem being troubleshooted? If not, is
  some other mechanism for conveying valuable info to the repair centers
  used, and if so, what mechanism is used?

Thanks for your time.
BC
T.RTitleUserPersonal
Name
DateLines
540.1NEMAIL::SOBECKYFacts,tho interesting,are irrelevantThu Mar 20 1997 21:0435
    
    Bill
    
    First of all, thank you for giving the opportunity and mechanism for
    feedback on maintaining this product. Hopefully, you'll get some good
    feedback to make future products easier to maintain.
    
    IMHO, your reason number 3 is the biggest culprit. Lack of training,
    coupled with the large number of products that MCS is required to
    support, causes long repair times with many parts replacements. Too
    often, the typical FS person finds it difficult to remember simple
    facts about the particular platform; things like where the firmware is
    located in the machine. Let's see, I worked on a Laser yesterday, the
    fw was on the cpu module. A Sable is on the SIO. A Flamingo is in two
    places. A Rawhide is...umm, I forget.
    
    Info overload. Too many products to support, too little time to handle
    a call (pressure from both the customer and management).
    
    How to solve this? Beef up the support organizations so that you have a
    core group of people that can keep up with, and specialize in, certain
    platforms. Then make it *easy* for the average field service tech to
    reach them; no long hold times, no wait for a callback. Note that I
    make no statement about the CSC's; I realize they are overloaded and
    have been cut deeply, too.
    
    One more thing that has made this platform particularly difficult to
    maintain has been the large number of BLITZes generated against this
    product. Not pointing fingers, but the plain truth is that no-one can
    remember all the details of all the BLITZes. I can't remember ever
    seeing a DEC product with so many BLITZes in such a short time.
    
    Just my .02,
    
    -john
540.2GSO Engineering's opinionCARECL::CASTIENHans Castien 889-9225Fri Mar 21 1997 04:20132
      1. We could have done a better job in the a) hardware design/test,
         b) firmware design/implementation, and c) software design/rollout
         (e.g. DECevent, NT HAL, RCU to name a few) areas. Some examples
         and discussion follow:
    
    HC>> The problem with the long path from CPU to XBUS is almost a
    HC>> generic  communication problem. It exists on more platforms. 
    HC>> Hopefully this can change
    HC>> Another design problem may be the location of XSROM and Console.
    HC>> In order for a system to properly diagnose, it is necessary that the
    HC>> majority of the system bus is working. From XBUS through PCI/EISA
    HC>> (Horse/Saddle) onto CPU
    HC>> If something in this path is broken XSROM can't unload (?) and may
    HC>> be a significant problem to isolate the failing FRU. 
    HC>> My beleive is XSROM (and may be Console) should be closer to CPU
    HC>> My beleive is the different engineering groups have to agree on a
    HC>> standard location for this, but preferably as close to CPU as possible
    
    HC>> A wrong setting may be a problem. If the system is set for AlphaBIOS
    HC>> and console is serial, with only a VT type attached, console is
    HC>> hard to work with.
    HC>> If the EV Console prevents proper reading of messages when set to
    HC>> graphic due to fact VGA has to be initialized. May a solution can be
    HC>> found.
    
    
            --> DECevent was not all there at FRS. It's still not (no fault
                analysis support).
    HC>>> This very dramatic
    
            --> 4100/4000 NT HAL isn't as robust as UNIX/VMS PALcode and
    		the operating system proper as far as error reporting and
    		handling are concerned. E.g. I believe some errors were not even
    		enabled
                (at least initially, anyway.. I've lost track here..)
    HC>> HAL must include some sort of diagnosability
    
      2. FW release notes are inadequate. On top of this, I find that they
         frequently only get read after a problem arises. We need to do a
         much better job making the release notes and readme files
         comprehensive,
         yet concise/clear; especially in the area of warning users about
         potential problems. Even better, we need to build features into
         the console and FW update utilities to warn users and/or fix problems
         automatically - versus depending upon the release notes and
    	 readmes alone to head off problems.
    
    HC>> My impresion is people do not have the time to read the release
    HC>> notes and read me first. So if the changes can be reported when
    HC>> upgrading this will be fine.
    
      3. Repair/support personnel are not sufficiently trained/experienced
    	 the 4100/4000 platform yet. My limited insight tells me this is
    	 due primarily to two problems:
    
    HC>> This is caused by the fact that every system uses a different HW
    HC>> implementation of console. Some on the CPU module, other one far
    HC>> away on a XBUS, while yet other products have the console split on two
    HC>> modules (This should never happen again. Flamingo caused enough 
    HC>> problems with incompatiblity of versions). 
    HC>> Also a difference in console commands and the lack of a proper 
    HC>> document describing the console commands and ALL the parameters is a 
    HC>> contributing factor. 
    HC>> My opinion is engineering must do their utmost to achieve ONE common
    HC>> console standard (SRM commmands) and issue a proper document
    HC>> describing this. Also standarisation on console hardware and diagnose
    HC>> port (SPORT or IIC or the Flamingo implementation) should be
    HC>> choosen.
    
    HC>> Also people should be allowed to visit training (management should
    HC>>take care of this). The decision makers should realize that 
    HC>> AlphaServers and most AlphaStations are NOT PC boxes and require more 
    HC>> skills.
    
    HC>> Lack of (bootable) diagnostics (like we had in the VAX days) may
    HC>> lead to insufficient service and a longer time to fix the system.
    
         --> Reorgs/layoffs in Customer Services during the past couple years.
         --> There are so many products supported by our Service
             organization even by single individuals, that there's not enough 
             time to learn sufficient detail about each one.
    
    HC>> I had a conversation in the mail recently with Ted Gent about
    HC>> this.
    HC>> We in GSO Engineering are seeing similar problems than SBU engineering
    HC>> on this.
    HC>> People simply assume if they know one system they know them all.
    HC>> If this is the perception (of management) may be it is wise to
    HC>> develop them with the same console and with Firmware on the same 
    HC>> location.
    
         Since the second is unlikely to change, it means we in Engineering
         have to make our machines and software as maintenance-free and as
         easy-to-use and diagnose as possible.
    
    HC>> I agree
    
      4. The average server customer will not permit a repair person to
         spend very much time at all fixing a down machine, so the repair 
         process typically involves swapping the most suspect part; then the 
         next; then the next..
    
    HC>> Yep, leading to high service cost. Can be (partly) solved by
    HC>> better diagnoseability of systems
    
      5. We need better tools for proactive maintenance. Example: DECevent
         fault analysis capability is still not available (it's a crime
         that it wasn't available at FRS - my understanding is that six of seven
         DECevent developers were laid off many months ago! Not sure what
         the re-hire/manpower status is now..)
    
    HC>> Yep, I agree
    
    Questions for Service Engineers and anyone else with an opinion:
    
      When modules/parts are pulled and returned, are return tags used to
      explain the failure symptoms / problem being troubleshooted? If not,
      is some other mechanism for conveying valuable info to the repair
      centers used, and if so, what mechanism is used?
    
    HC>> In most instances there is some information on the Field Tag. We
    HC>> in JGO
    HC>> do keep track of the repair steps taken on every module. We store
    HC>> this in 
    HC>> a database. I'm in the process of writing and issuing a Repair
    HC>> report on
    HC>> the Rawhide modules (CPU's, Saddle/Horse and Motherboards) and
    HC>> I'll send a copy to you Bill, and to Ted Gent to use for internal
    HC>> distribution. 
    HC>> It will take one to two weeks to finalize this.
    
    
540.3re: blitzesMOVMON::DAVISFri Mar 21 1997 09:3511
    re: .1
    
    Your comment about the large number of field blitzes is sort of a
    double-edged sword.  This was a deliberate attempt by the Rawhide
    engineering organization (in particular Ted Gent) to provide real-time
    information to the field.  This was in response to MCS who had been
    requesting more and better system information.  
    
    So, yes, it's a lot of information, but at least it's there if needed. 
    
    Todd
540.4agreeNEMAIL::SOBECKYFacts,tho interesting,are irrelevantSun Mar 23 1997 09:497
    
    re -1
    
    You're right. I'd rather have this many blitzes any day that be kept in
    the dark.
    
    -john