|
Bill
First of all, thank you for giving the opportunity and mechanism for
feedback on maintaining this product. Hopefully, you'll get some good
feedback to make future products easier to maintain.
IMHO, your reason number 3 is the biggest culprit. Lack of training,
coupled with the large number of products that MCS is required to
support, causes long repair times with many parts replacements. Too
often, the typical FS person finds it difficult to remember simple
facts about the particular platform; things like where the firmware is
located in the machine. Let's see, I worked on a Laser yesterday, the
fw was on the cpu module. A Sable is on the SIO. A Flamingo is in two
places. A Rawhide is...umm, I forget.
Info overload. Too many products to support, too little time to handle
a call (pressure from both the customer and management).
How to solve this? Beef up the support organizations so that you have a
core group of people that can keep up with, and specialize in, certain
platforms. Then make it *easy* for the average field service tech to
reach them; no long hold times, no wait for a callback. Note that I
make no statement about the CSC's; I realize they are overloaded and
have been cut deeply, too.
One more thing that has made this platform particularly difficult to
maintain has been the large number of BLITZes generated against this
product. Not pointing fingers, but the plain truth is that no-one can
remember all the details of all the BLITZes. I can't remember ever
seeing a DEC product with so many BLITZes in such a short time.
Just my .02,
-john
|
| 1. We could have done a better job in the a) hardware design/test,
b) firmware design/implementation, and c) software design/rollout
(e.g. DECevent, NT HAL, RCU to name a few) areas. Some examples
and discussion follow:
HC>> The problem with the long path from CPU to XBUS is almost a
HC>> generic communication problem. It exists on more platforms.
HC>> Hopefully this can change
HC>> Another design problem may be the location of XSROM and Console.
HC>> In order for a system to properly diagnose, it is necessary that the
HC>> majority of the system bus is working. From XBUS through PCI/EISA
HC>> (Horse/Saddle) onto CPU
HC>> If something in this path is broken XSROM can't unload (?) and may
HC>> be a significant problem to isolate the failing FRU.
HC>> My beleive is XSROM (and may be Console) should be closer to CPU
HC>> My beleive is the different engineering groups have to agree on a
HC>> standard location for this, but preferably as close to CPU as possible
HC>> A wrong setting may be a problem. If the system is set for AlphaBIOS
HC>> and console is serial, with only a VT type attached, console is
HC>> hard to work with.
HC>> If the EV Console prevents proper reading of messages when set to
HC>> graphic due to fact VGA has to be initialized. May a solution can be
HC>> found.
--> DECevent was not all there at FRS. It's still not (no fault
analysis support).
HC>>> This very dramatic
--> 4100/4000 NT HAL isn't as robust as UNIX/VMS PALcode and
the operating system proper as far as error reporting and
handling are concerned. E.g. I believe some errors were not even
enabled
(at least initially, anyway.. I've lost track here..)
HC>> HAL must include some sort of diagnosability
2. FW release notes are inadequate. On top of this, I find that they
frequently only get read after a problem arises. We need to do a
much better job making the release notes and readme files
comprehensive,
yet concise/clear; especially in the area of warning users about
potential problems. Even better, we need to build features into
the console and FW update utilities to warn users and/or fix problems
automatically - versus depending upon the release notes and
readmes alone to head off problems.
HC>> My impresion is people do not have the time to read the release
HC>> notes and read me first. So if the changes can be reported when
HC>> upgrading this will be fine.
3. Repair/support personnel are not sufficiently trained/experienced
the 4100/4000 platform yet. My limited insight tells me this is
due primarily to two problems:
HC>> This is caused by the fact that every system uses a different HW
HC>> implementation of console. Some on the CPU module, other one far
HC>> away on a XBUS, while yet other products have the console split on two
HC>> modules (This should never happen again. Flamingo caused enough
HC>> problems with incompatiblity of versions).
HC>> Also a difference in console commands and the lack of a proper
HC>> document describing the console commands and ALL the parameters is a
HC>> contributing factor.
HC>> My opinion is engineering must do their utmost to achieve ONE common
HC>> console standard (SRM commmands) and issue a proper document
HC>> describing this. Also standarisation on console hardware and diagnose
HC>> port (SPORT or IIC or the Flamingo implementation) should be
HC>> choosen.
HC>> Also people should be allowed to visit training (management should
HC>>take care of this). The decision makers should realize that
HC>> AlphaServers and most AlphaStations are NOT PC boxes and require more
HC>> skills.
HC>> Lack of (bootable) diagnostics (like we had in the VAX days) may
HC>> lead to insufficient service and a longer time to fix the system.
--> Reorgs/layoffs in Customer Services during the past couple years.
--> There are so many products supported by our Service
organization even by single individuals, that there's not enough
time to learn sufficient detail about each one.
HC>> I had a conversation in the mail recently with Ted Gent about
HC>> this.
HC>> We in GSO Engineering are seeing similar problems than SBU engineering
HC>> on this.
HC>> People simply assume if they know one system they know them all.
HC>> If this is the perception (of management) may be it is wise to
HC>> develop them with the same console and with Firmware on the same
HC>> location.
Since the second is unlikely to change, it means we in Engineering
have to make our machines and software as maintenance-free and as
easy-to-use and diagnose as possible.
HC>> I agree
4. The average server customer will not permit a repair person to
spend very much time at all fixing a down machine, so the repair
process typically involves swapping the most suspect part; then the
next; then the next..
HC>> Yep, leading to high service cost. Can be (partly) solved by
HC>> better diagnoseability of systems
5. We need better tools for proactive maintenance. Example: DECevent
fault analysis capability is still not available (it's a crime
that it wasn't available at FRS - my understanding is that six of seven
DECevent developers were laid off many months ago! Not sure what
the re-hire/manpower status is now..)
HC>> Yep, I agree
Questions for Service Engineers and anyone else with an opinion:
When modules/parts are pulled and returned, are return tags used to
explain the failure symptoms / problem being troubleshooted? If not,
is some other mechanism for conveying valuable info to the repair
centers used, and if so, what mechanism is used?
HC>> In most instances there is some information on the Field Tag. We
HC>> in JGO
HC>> do keep track of the repair steps taken on every module. We store
HC>> this in
HC>> a database. I'm in the process of writing and issuing a Repair
HC>> report on
HC>> the Rawhide modules (CPU's, Saddle/Horse and Motherboards) and
HC>> I'll send a copy to you Bill, and to Ted Gent to use for internal
HC>> distribution.
HC>> It will take one to two weeks to finalize this.
|
| re: .1
Your comment about the large number of field blitzes is sort of a
double-edged sword. This was a deliberate attempt by the Rawhide
engineering organization (in particular Ted Gent) to provide real-time
information to the field. This was in response to MCS who had been
requesting more and better system information.
So, yes, it's a lot of information, but at least it's there if needed.
Todd
|