[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:	AlphaServer 4100

Moderator:	MOVMON::DAVISS

Created:	Tue Apr 16 1996
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	648
Total number of notes:	3158

540.0. "Diagnosability dialogue and questions" by HARMNY::CUMMINS () Thu Mar 20 1997 16:58

What follows are some comments about diagnosability/ease-of-use on the
4100/4000 platform and some questions for persons involved in servicing
and otherwise supporting 4100/4000 machines. I'd appreciate feedback.

I've noticed what seems to me a large number of module swaps on the
4100/4000 platform (before arriving at the real fix). My guess is this
is due to a combination of five (and maybe more) reasons:

  1. We could have done a better job in the a) hardware design/test,
     b) firmware design/implementation, and c) software design/rollout
     (e.g. DECevent, NT HAL, RCU to name a few) areas. Some examples
     and discussion follow:

     a. We've found several design issues after HW FRS. This means we
        could have done a better job during DVT and PVT. We also have a
        design which doesn't allow for communication of problems to the
        user unless the path from the CPU, across the motherboard, through
        the PCI bridge card, and out to the PCI motherboard's Xbus is good
        enough to report OCP or COM1 failure/status messages. Another
        example is the MDP bug which precludes capture of IOD-detected CRD
        syndrome data by PALcode (without risking user data corruption). 
        This has presumably not helped w.r.t. isolation to a memory pair
        member (though, typically two IOD-detected CRDs errors are
        followed by a CPU-detected CRD with valid syndrome data..)

     b. We have two consoles versus one (SRM and AlphaBIOS) - this overly
        complicates the user interface. We've also had many bugs/features 
        which have caused headaches in the Field. Examples:

        --> the SRM console MEMORY_TEST EV and the associated SRM and
            Digital UNIX bugs in this area.
        --> the AlphaBIOS NVRAM write/update bug (fast/slow NVRAMs).

        And most importantly, we need to improve our ease-of-use; and in
        both consoles (assuming we'll never get to one console). These
        machines are very complicated and so it would follow that the FW
        might also be more complex than say a VCR or TV might provide.
        
        Complexity examples: boot device support for seemingly every
        device under the sun, various networking protocols, with complex
        configurations such as clusters, multi-path devices, etc.; BIOS
        emulation of Intel bits on VGA option ROMs; VGAs and serial lines;
        PCI + EISA + Xbus junk I/O; there's a lot of ugliness / complexity
        underneath it all..

     c. Examples:

        --> DECevent was not all there at FRS. It's still not (no fault
            analysis support).
        --> 4100/4000 NT HAL isn't as robust as UNIX/VMS PALcode and the 
            operating system proper as far as error reporting and handling
            are concerned. E.g. I believe some errors were not even enabled
            (at least initially, anyway.. I've lost track here..)
        --> Anyone who has used RCU on Rawhide has likely experienced some
            level of frustration.

  2. FW release notes are inadequate. On top of this, I find that they
     frequently only get read after a problem arises. We need to do a much
     better job making the release notes and readme files comprehensive,
     yet concise/clear; especially in the area of warning users about
     potential problems. Even better, we need to build features into the
     console and FW update utilities to warn users and/or fix problems
     automatically - versus depending upon the release notes and readmes
     alone to head off problems.

  3. Repair/support personnel are not sufficiently trained/experienced on
     the 4100/4000 platform yet. My limited insight tells me this is due
     primarily to two problems:

     --> Reorgs/layoffs in Customer Services during the past couple years.
     --> There are so many products supported by our Service organization,
         even by single individuals, that there's not enough time to learn
         sufficient detail about each one.

     Since the second is unlikely to change, it means we in Engineering
     have to make our machines and software as maintenance-free and as
     easy-to-use and diagnose as possible.

  4. The average server customer will not permit a repair person to spend
     very much time at all fixing a down machine, so the repair process
     typically involves swapping the most suspect part; then the next;
     then the next..

  5. We need better tools for proactive maintenance. Example: DECevent
     fault analysis capability is still not available (it's a crime that
     it wasn't available at FRS - my understanding is that six of seven
     DECevent developers were laid off many months ago! Not sure what the
     re-hire/manpower status is now..)

Questions for Service Engineers and anyone else with an opinion:

  Are the above five on the mark? Did I miss something?

  What are the top few areas of the HW, SW, and FW design that you'd like
  to see changed/improved on Rawhide and/or on next-generation designs?

  When modules/parts are pulled and returned, are return tags used to
  explain the failure symptoms / problem being troubleshooted? If not, is
  some other mechanism for conveying valuable info to the repair centers
  used, and if so, what mechanism is used?

Thanks for your time.
BC

T.R	Title	User	Personal Name	Date	Lines
540.1		NEMAIL::SOBECKY	Facts,tho interesting,are irrelevant	`Thu Mar 20 1997 21:04`	35
	Bill First of all, thank you for giving the opportunity and mechanism for feedback on maintaining this product. Hopefully, you'll get some good feedback to make future products easier to maintain. IMHO, your reason number 3 is the biggest culprit. Lack of training, coupled with the large number of products that MCS is required to support, causes long repair times with many parts replacements. Too often, the typical FS person finds it difficult to remember simple facts about the particular platform; things like where the firmware is located in the machine. Let's see, I worked on a Laser yesterday, the fw was on the cpu module. A Sable is on the SIO. A Flamingo is in two places. A Rawhide is...umm, I forget. Info overload. Too many products to support, too little time to handle a call (pressure from both the customer and management). How to solve this? Beef up the support organizations so that you have a core group of people that can keep up with, and specialize in, certain platforms. Then make it easy for the average field service tech to reach them; no long hold times, no wait for a callback. Note that I make no statement about the CSC's; I realize they are overloaded and have been cut deeply, too. One more thing that has made this platform particularly difficult to maintain has been the large number of BLITZes generated against this product. Not pointing fingers, but the plain truth is that no-one can remember all the details of all the BLITZes. I can't remember ever seeing a DEC product with so many BLITZes in such a short time. Just my .02, -john
540.2	GSO Engineering's opinion	CARECL::CASTIEN	Hans Castien 889-9225	`Fri Mar 21 1997 04:20`	132
	1. We could have done a better job in the a) hardware design/test, b) firmware design/implementation, and c) software design/rollout (e.g. DECevent, NT HAL, RCU to name a few) areas. Some examples and discussion follow: HC>> The problem with the long path from CPU to XBUS is almost a HC>> generic communication problem. It exists on more platforms. HC>> Hopefully this can change HC>> Another design problem may be the location of XSROM and Console. HC>> In order for a system to properly diagnose, it is necessary that the HC>> majority of the system bus is working. From XBUS through PCI/EISA HC>> (Horse/Saddle) onto CPU HC>> If something in this path is broken XSROM can't unload (?) and may HC>> be a significant problem to isolate the failing FRU. HC>> My beleive is XSROM (and may be Console) should be closer to CPU HC>> My beleive is the different engineering groups have to agree on a HC>> standard location for this, but preferably as close to CPU as possible HC>> A wrong setting may be a problem. If the system is set for AlphaBIOS HC>> and console is serial, with only a VT type attached, console is HC>> hard to work with. HC>> If the EV Console prevents proper reading of messages when set to HC>> graphic due to fact VGA has to be initialized. May a solution can be HC>> found. --> DECevent was not all there at FRS. It's still not (no fault analysis support). HC>>> This very dramatic --> 4100/4000 NT HAL isn't as robust as UNIX/VMS PALcode and the operating system proper as far as error reporting and handling are concerned. E.g. I believe some errors were not even enabled (at least initially, anyway.. I've lost track here..) HC>> HAL must include some sort of diagnosability 2. FW release notes are inadequate. On top of this, I find that they frequently only get read after a problem arises. We need to do a much better job making the release notes and readme files comprehensive, yet concise/clear; especially in the area of warning users about potential problems. Even better, we need to build features into the console and FW update utilities to warn users and/or fix problems automatically - versus depending upon the release notes and readmes alone to head off problems. HC>> My impresion is people do not have the time to read the release HC>> notes and read me first. So if the changes can be reported when HC>> upgrading this will be fine. 3. Repair/support personnel are not sufficiently trained/experienced the 4100/4000 platform yet. My limited insight tells me this is due primarily to two problems: HC>> This is caused by the fact that every system uses a different HW HC>> implementation of console. Some on the CPU module, other one far HC>> away on a XBUS, while yet other products have the console split on two HC>> modules (This should never happen again. Flamingo caused enough HC>> problems with incompatiblity of versions). HC>> Also a difference in console commands and the lack of a proper HC>> document describing the console commands and ALL the parameters is a HC>> contributing factor. HC>> My opinion is engineering must do their utmost to achieve ONE common HC>> console standard (SRM commmands) and issue a proper document HC>> describing this. Also standarisation on console hardware and diagnose HC>> port (SPORT or IIC or the Flamingo implementation) should be HC>> choosen. HC>> Also people should be allowed to visit training (management should HC>>take care of this). The decision makers should realize that HC>> AlphaServers and most AlphaStations are NOT PC boxes and require more HC>> skills. HC>> Lack of (bootable) diagnostics (like we had in the VAX days) may HC>> lead to insufficient service and a longer time to fix the system. --> Reorgs/layoffs in Customer Services during the past couple years. --> There are so many products supported by our Service organization even by single individuals, that there's not enough time to learn sufficient detail about each one. HC>> I had a conversation in the mail recently with Ted Gent about HC>> this. HC>> We in GSO Engineering are seeing similar problems than SBU engineering HC>> on this. HC>> People simply assume if they know one system they know them all. HC>> If this is the perception (of management) may be it is wise to HC>> develop them with the same console and with Firmware on the same HC>> location. Since the second is unlikely to change, it means we in Engineering have to make our machines and software as maintenance-free and as easy-to-use and diagnose as possible. HC>> I agree 4. The average server customer will not permit a repair person to spend very much time at all fixing a down machine, so the repair process typically involves swapping the most suspect part; then the next; then the next.. HC>> Yep, leading to high service cost. Can be (partly) solved by HC>> better diagnoseability of systems 5. We need better tools for proactive maintenance. Example: DECevent fault analysis capability is still not available (it's a crime that it wasn't available at FRS - my understanding is that six of seven DECevent developers were laid off many months ago! Not sure what the re-hire/manpower status is now..) HC>> Yep, I agree Questions for Service Engineers and anyone else with an opinion: When modules/parts are pulled and returned, are return tags used to explain the failure symptoms / problem being troubleshooted? If not, is some other mechanism for conveying valuable info to the repair centers used, and if so, what mechanism is used? HC>> In most instances there is some information on the Field Tag. We HC>> in JGO HC>> do keep track of the repair steps taken on every module. We store HC>> this in HC>> a database. I'm in the process of writing and issuing a Repair HC>> report on HC>> the Rawhide modules (CPU's, Saddle/Horse and Motherboards) and HC>> I'll send a copy to you Bill, and to Ted Gent to use for internal HC>> distribution. HC>> It will take one to two weeks to finalize this.
540.3	re: blitzes	MOVMON::DAVIS		`Fri Mar 21 1997 09:35`	11
	re: .1 Your comment about the large number of field blitzes is sort of a double-edged sword. This was a deliberate attempt by the Rawhide engineering organization (in particular Ted Gent) to provide real-time information to the field. This was in response to MCS who had been requesting more and better system information. So, yes, it's a lot of information, but at least it's there if needed. Todd
540.4	agree	NEMAIL::SOBECKY	Facts,tho interesting,are irrelevant	`Sun Mar 23 1997 09:49`	7
	re -1 You're right. I'd rather have this many blitzes any day that be kept in the dark. -john