[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::digital

Title:The Digital way of working
Moderator:QUARK::LIONELON
Created:Fri Feb 14 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5321
Total number of notes:139771

4424.0. "MTBF on Alphas?" by MKOTS3::TLAPOINTE () Wed Feb 14 1996 16:53

    Does anyone know where I can get MTBF (mean time between failure) data
    on the following:
    	AS 2000 4/275
    	AW 200 4/166
    	AW 250 4/266
    
    I need this data ASAP as I'm competing against HP and they have already
    supplied data on their machine (( HP715 est 4 to 4.25 yrs MTBF (per my 
    VAR))
    
    	We used to have an easy way of requesting this data but I was told
    the process was killed.  Any assistance will be greatly appreciated.
    
    Regards,
    
    Tony LaPointe
T.RTitleUserPersonal
Name
DateLines
4424.1+ or - .01%ODIXIE::KINGWed Feb 14 1996 19:058
    No MTBF STATS...but Let's connect on Friday before another weekend
    blows us by.
    
    You might want to give Tom Walker a call for info on MTBF data. Tom
    is part of the competitive watch team covering workstations. You can
    reach Tom by calling 407-6602100.
    
    Russ the ISSMister
4424.2MTBF/MTTR automated systemODIXIE::MOREAUKen Moreau;Technical Support;FloridaWed Feb 14 1996 19:2957
RE: .0

I use an automated system which gives me turn-around within 24 hours:
all you need is the part #.  I last used this system about 3 months
ago, so it should still work.

Fill out the form below, and mail it to either MTBF @OGO or CSSE::MTBF.
You will get an answer via the e-mail address you specify.

-- Ken Moreau


		MTBF/MTTR Request Form
		----------------------

Fill in the information below, and send this memo to either MTBF @OGO or
CSSE::MTBF.  The information, along with the required disclaimer, will
be sent to you at the ALL-IN-1 address or VMS address you specify below.

************************************************************************
*** IMPORTANT - AN AUTOMATED SYSTEM WILL PROCESS AND REPLY TO YOUR   ***
***     REQUEST.  PLEASE DO NOT DEVIATE FROM THIS FORMAT.            ***
************************************************************************

If you have any questons, please write or call the ADEG Program Engineering
Group at GSGPROGENG @MKO, or DTN 264-4727.

YOUR NAME:

YOUR ALL-IN-1 ADDRESS:
		(Example: JOHN JONES @OGO)
or
YOUR VMS ADDRESS:
		(Example: CSSE::JONES)

YOUR COST CENTER:

YOUR BADGE NUMBER:

CUSTOMER NAME:

CUSTOMER LOCATION:

BUSINESS REASON FOR RELEASING THIS INFORMATION:


Using one line per part number, list the part numbers for which you need
reliability (MTBF/MTTR) data below between the words "BEGIN" and "END".  Please
use the 2-5-2 part number format (00-TK50-AA) or you WILL NOT receive the
information that you requested.

DO NOT REMOVE THE WORDS "BEGIN" AND "END".

BEGIN


END
4424.3No automated system availableDECIDE::MOFFITTWed Feb 14 1996 22:2025
Ken,

Good idea but the automated system died on Oct 31 due to lack of funding. BTW,
the data had become pretty stale over the last year or so. Here's what the 
header looked like during its last month. Trust me, it's gone.


    #14         26-OCT-1995 06:33:42.42                                     MTBF
From:   CSSE::MTBF "26-Oct-1995 0830 -0400"
To:     TURCOTTE,DECIDE::MOFFITT
CC:     MTBF
Subj:   COMPLETED_MTBF_REQUEST

*******************************************************************************

      NOTICE:  This system will END-OF-SERVICE as of OCTOBER 31, 1995.


*******************************************************************************


I made a couple of suggestions to Tony off line.

enjoy,
tim m.
4424.4Sigh :-(ODIXIE::MOREAUKen Moreau;Technical Support;FloridaThu Feb 15 1996 00:190
4424.5It's not easy...TRUCKS::KEMPSTERThu Feb 15 1996 03:599
    I recently went through a similar exercise and for very much the same
    reasons. 
    	Firstly I found that entries in the relevant notes conferences
    helped. Secondly I was told that if I was to release these figures to a
    customer the source should be the product manager.
    
    	Hope this helps,
    
    	Tom Kempster
4424.6Contact in APSNETCAD::GENOVAThu Feb 15 1996 07:378
    
    Hi,
    
    Dan Riccio, wrksys::riccio was the Mechanical Engineer for the
    AlphaStation 200 and 250, he could tell you the MTBF, as I remember
    they were quite high.
    
    /art
4424.7VANGA::KERRELLsalva res estThu Feb 15 1996 07:598
I recently had the same problem and was told the owner of MTBF info process for
the SBU is:-

Rick Howe @MRO (RELYON::HOWE)

If you mail him the part nos, he should be able to help.

Dave.
4424.820,000 hours and still goingBBPBV1::WALLACEUNIX is digital. Use Digital UNIX.Thu Feb 15 1996 15:068
    Ignoring the politics:
    
    if you have access to TIMA/STARS (I don't), you can often find MTBF
    figures of some sort in the Product Service Plan for the widget you're
    interested in.
    
    regards
    john
4424.9The usual caveats...ATLANT::SCHMIDTSee http://atlant2.zko.dec.com/Thu Feb 15 1996 15:5428
  Folks may already know this, but it bears repeating:

  Even if you can *GET* an MTBF number, be cautious in using it.

    1. Digital considers its MTBF numbers proprietary information
       as they can be very useful to competitors of MCS. It's a
       lot easier to quote a maintenance price on a product when
       you know (with pretty good accuracy) what the vendor con-
       siders the failure rate to be.

    2. The MTBF number is often wrapped up in a complex set of
       assumptions about the operating conditions for the unit.
       For example, we might quote an MTBF of 100K hours for
       equipment operating in the temperature range of 5-50�C,
       whereas a competitor might quote an MTBF of 300K hours
       for identical equipment, but operating over a temperature
       range of 20-35�C. This can lead to a *BIG* apparent
       difference in the "goodness" of our product versus theirs
       even though no actual difference exists.


  There's certainly no ultimate problem in quoting MTBF numbers and
  we do it all the time -- I'm just advising caution. Don't just
  find a number in some database soemwhere and blindly quote it.
  The Product Manager is a good source of help when you're being
  asked for MTBF numbers.
                                   Atlant

4424.10WE HAVE HIGH AVAILABILITY SERVICESUTROP1::KOOIJMANLIFE IS HELL THEN YOU DIEFri Feb 16 1996 02:54482
    Gentlemen,
    
    
    MTTR and MTBF figures are very interresting but are allmost meaningless 
    today. Some disks we specify MTBF is 800,000 hours! (Do you believe it?)
    Also we must have corporate approval to give them to customers.
    Digital has developed a couple of services called High Availability
    Services under the leadership of Dave Varner @OGO. He is the corporate
    business manager for this. Engineering for the AVANTO application is
    located in Shrewsbury Mass. The engineer for AVANTO is Ron Rocheleau @SHR.
    In Holland we have made availability models for hundreds of systems
    dutring the past year and we have earned lots of revenue with it. This
    is a unique capability. A short write-up of what we can do is included.
    Please contact Dave Varner @OGO for more information. This is the best
    in the IT industry. With Availability Review and Partnership services
    we have taken out the competition many times in Holland so please 
    involve Dave.
    In its simplest form AVANTO can be used to produce hardware
    availability figure in a very very professional way.
    
    
    
    Best Regards,
    
    
    Aad Kooijman @ UTO (The Netherlands, which is over in Europe)
    Business manager High Availability Services
                                                
    AVANTO
    Over the past four years, Digitals Multivendor Customer Services
    organization has developed and frequently applied an availability
    analysis application. This application is called AVANTO (AVailability
    ANalysis TOol) and has been used successfully in practice to determine
    the availability of many hundreds of configurations. It has also been
    empirically established that the predicted results are realized
    actually in 95% of all cases. This means a unique tool is now available
    to IT managers. This of course leaves other factors unimpeded such as
    the management organization, the applications, etc. These aspects
    (domains) will be involved in an Availability Review Service conducted
    by Digital. So how do things proceed when using AVANTO?
    
    The essence of AVANTO is that it enables a system to be designed so
    that the anticipated availability can be made to correspond to the
    demands made upon it from the business. AVANTO is frequently used when
    configuring new application systems. AVANTO is an application that
    enables the availability of very complex systems to be modeled and to
    determine beforehand how the demands with respect to availability can
    be realized without working in an arbitrary way.
    In the simplest of applications it is possible to calculate the average
    availability of a systems hardware by using MTTR  and MTBF  data.
    However, as stated above, this will result in an incomplete picture as
    many other aspects will codetermine the availability in practice. One
    example is the quality of the environment in which the equipment has
    been installed. Also very important is the organization of the helpdesk
    and the underlying second and third-line support of the various
    suppliers.
    When establishing the potential availability of an existing system, it
    will be necessary to investigate how the management, the environment,
    the software and the other domains have been set up. Besides the
    hardware, all these domains influence the level of availability. By
    incorporating parameters and setting up a business scenario, AVANTO can
    show availability as a function of the business requirements.
    
    Digitals Availability Review and Partnership Services  are also
    conducted with the aid of AVANTO. An existing situation is scrutinized
    in an Availability Review and a very detailed investigation determines
    what can be done to improve availability management. Alternative
    situations can also be modelled. It is not difficult to imagine that
    this approach is much more preferable than one in which measures are
    taken more or less by guesswork, after which we have to measure what
    effects these measures have had. Moreover, the costs incurred when
    improving availability in retrospect are generally much higher than
    those associated with conducting an analysis in advance. AVANTO is now
    in structural use by a large number of organizations within Change and
    Availability Management in order to determine the availability effects
    of scheduled configuration changes in advance.
    
    
    Availability investigation
    
    Lets assume that an IT organization with a given infrastructure wants
    to determine what availability can be offered to its users. Or the
    existing availability has to be increased. Digital Multivendor Customer
    Services can provide answers to these questions based on investigation
    and with the aid of AVANTO.
    So how does such an investigation proceed?
    At the start, the customer is consulted to establish which areas
    (domains) are to be involved in the investigation. If a decision is
    taken to limit the investigation to the hardware configuration, the
    result will be of limited value. In accordance with the information
    provided by ITIL (Information Technology Infrastructure Library), 
    there are five other domains besides the hardware
    that must be involved in an Availability Review:
    1. The environment, including climatic control, power supply, service
    contracts, etc.
    2. The system management, the organization, the procedures and the
    system configuration
    3. The network
    4. The system software
    5. The applications and the application management.
    
    Only when all these domains have been exhaustively charted will it be
    possible to determine what availability can be offered. If it is found
    that the availability to be offered is inadequate, alternative
    scenarios can be formulated with the aid of AVANTO. An Availability
    Review provides a complete picture of all the aspects of availability
    management.
    
    AVANTO business model
    When carrying out an Availability Review Service it will be established
    what availability can be offered with an existing configuration or a
    new one yet to be installed. Furthermore, extensive modelling
    activities are also possible using AVANTO. Based on an initial AVANTO
    model, the reference model, we can modify the redundancies and/or the
    service contracts to determine the costs at which the desired
    availability can be realized.
    In the first place, however, it will be necessary to indicate when a
    certain availability is to be offered. In this way, a central system
    with an important database may require 100% availability during the day
    while this level will also be required for the back-up equipment during
    the evening and night. It might also be so that not all the hardware is
    important for a particular application while a different application
    does require all the hardware in the system in order to operate.
    Charting these aspects is called setting up the business model or
    business scenario in AVANTO. This business scenario is entered into
    AVANTO, which always displays the availability as a function of the
    business model.
    
    Example of a simple business scenario					
    	Day 1	Day 2	Day 3	Day 4	Day 5	Day 6	Day 7
    Shift 1	50%	50%	100%	100%	75%	50%	25%
    Shift 2	100%	100%	100%	100%	100%	50%	50%
    Shift 3	100%	100%	100%	100%	100%	50%	50%
    Shift 4	75%	75%	75%	100%	100%	50%	25%
    The percentages indicate when and to what extent availability is
    required.
    
    AVANTO enables a business scenario to be created for each part of the
    configuration. This means, for example, that a systems ideal service
    mix can be modelled.
    
    Cost of downtime
    If known, the costs of the downtime can now be entered. In several
    cases it is possible to establish which costs are associated with the
    failure of the information system. This can also be entered into AVANTO
    as part of the business scenario. If the costs of downtime cannot be
    quantified, AVANTO will express the costs of downtime in a number of
    points per hour. The customer can then call upon the assistance of his
    financial department to convert this into the costs of downtime.
    Various risk-analysis techniques are available for determining the
    costs of downtime.
    
    
      
     Figure 4  Redundancy model (Figures not included)
    Redundancy
    The diagram on the left displays the topology of a simple hardware
    configuration. This diagram shows that components A, B, D, E and F must
    function correctly for the operation of the entire chain. If component
    A fails, the chain will be broken and the application will come to a
    standstill. If component B1 fails then B2 will assume functionality.
    Here, component B has been executed redundantly and will assume the
    function from B1 automatically or via a manual procedure and vice
    versa. It will be clear that the likelihood of failure of the
    functionality of the entire chain as a consequence of component A is
    greater than as a consequence of component B. Especially if A and B are
    equally reliable. The reliability of the entire chain will decrease as
    more non-redundant components are included in the chain.
    For configurations as in Figure 1, the number of elements in the chain
    can easily rise to many dozens.
    AVANTO also offers facilities to take account of the effects of
    activating redundancies and then switching back to the normal
    situation. If a redundancy measure is to be activated, it will often
    have consequences for the performance during the switching time. These
    effects of switching the functions on and off (invocation and
    devocation) are also included within AVANTO when determining the
    eventual availability and costs of downtime. When calculating the
    average availability, AVANTO will make use of MTBF and MTTR data of all
    components within the given configuration. The calculation normally
    takes place during a simulation period of twenty years and the result
    of the calculation represents an average expectancy. This means the
    calculated average availability will be realized in 95% of the cases.
    AVANTO does not take account of unscheduled failure as a consequence of
    human actions and other completely arbitrary factors. Nor is it
    possible to express the quality of the management organization as a
    figure. It is, however, possible to incorporate operational
    characteristics of the management organization in AVANTO. For example,
    the average throughput time at the help-desk and similar types of data.
    
    AVANTO can be used to model many hundreds of components. Each component
    can then have three redundancies.
     Levels of maintenance service
    
     
    Figure 5
    
    AVANTO can be used to calculate which form of maintenance agreement
    best suits the business scenario of the particular customer. A wide
    coverage in the maintenance agreement will not necessarily benefit
    availability. In other words, the yield per extra Dollar spent on
    maintenance decreases as the coverage increases. In many cases in which
    the daytime availability must be 100%, but can be less at other times,
    it is sufficient to provide less coverage in the maintenance agreement.
    AVANTO can also calculate the optimum form of maintenance. A graph can
    be used to illustrate that an effect develops in which the added yield
    actually decreases. In this way, the customer can determine precisely
    where the optimum lies in respect of coverage in his maintenance
    agreement. The above graph in Figure 5 clearly indicates that there is
    no point for the customer to switch to a service contract providing
    more coverage than seven days at sixteen hours a day and a response
    time of four hours (7 x 16 + 4). A more expensive maintenance contract
    does not provide extra cost reductions for the downtime.
    If the customer in this example enters a maintenance agreement of six
    days a week at ten hours a day and a response time of four hours, then
    each extra guilder spent on maintenance will result in a 190 guilder
    reduction of the downtime costs.
    
    Environmental factors
    
    A large number of parameters relating to environmental factors can be
    entered in AVANTO. This particularly concerns aspects relating to the
    quality of the power supply and air conditioning, no-breaks and
    possibly even diesel generators. AVANTO sees a diesel generator as a
    redundancy measure. For aspects like no-breaks it can now be clearly
    justified whether the investment provides sufficient yield. After all,
    the downtime costs will fall as a consequence of deploying a no-break.
    
    Performance aspects
    
    An important factor when establishing availability is the performance.
    It can be argued that there will be no downtime if one user can still
    work. In reality a relation exists between performance and
    availability. AVANTO will also take account of this providing it has
    been set correctly.
    Consultants using AVANTO must very clearly understand which performance
    effects arise when redundancy measures are introduced. The loss of part
    of the configuration, such as, for example, memory may also affect
    performance. Assume that when a redundant disk comes into operation the
    database performance temporarily drops by 25%. In that case, AVANTO
    will decrease availability accordingly and include this information in
    the final result of the calculation. So when installing AVANTO, we must
    have a thorough knowledge of how the system elements and components
    operate.
    
    Software and applications
    
    There are, of course, very limited possibilities for including
    quantitative data in AVANTO on the reliability of system software and
    applications. In recent years, increasing numbers of figures have been
    made available, such as Mean Time Between Crash (MTBC). However, these
    figures may differ considerably from one situation to the next. It is
    also practically impossible to take account of such aspects as
    programming errors and software bugs. When conducting an Availability
    Review, the investigation will focus primarily on the total management
    environment and the software and application management. Attention is
    paid in particular here to the procedures for reporting problems in the
    software, the applications, the help-desk organization and the second
    and third-line support. A well-designed infrastructure for
    problem-solving will make a considerably positive contribution to the
    availability of the applications.
    In several Availability Reviews, the investigation focuses on the
    availability of a certain application in the infrastructure. When
    setting up the AVANTO model and the Fault Tree analysis for this,
    account is clearly taken of the fact that the application in question
    uses only part of the hardware. In this way, AVANTO can also be used to
    create models for a certain group of users.
    
     
    System Health Check
    
    An equally important part of the Availability Review Service is the use
    of the System Health Check (SHC). Although mentioned several times
    above, the intrinsic (hardware) availability is not the only aspect
    that is important for the availability of an information system. The
    way in which the system management is exercised is particularly
    important for the eventual result. During an Availability Review, the
    Digital consultant conducts an SHC on all the systems involved in the
    investigation. This involves scrutinizing a large number of aspects in
    the areas of:
    � Security
    � Performance
    � Capacity utilization and occupied space
    � etc.
    
    Hundreds of checks are carried out in an SHC by the specially developed
    software. The result is that a detailed fingerprint of the management
    is obtained.
    The consultant carrying out the Availability Review Service indicates
    all the bottlenecks in his report and gives concise advice on how
    certain matters might be solved. Taking note of this advice will make a
    clear contribution to improving the entire system performance in all
    the fields investigated. Please refer to the brochure on SHC for more
    information in this respect.
    
    Supplementary investigation
    
    The above description sets out a clear picture of the availability an
    IT organization can offer. An Availability Review that has been
    conducted exhaustively uses questionnaires to obtain even better
    insight into the organization of the system and network management.
    Some of the additional aspects to be scrutinized are physical security,
    reporting, change management, problem management and other aspects that
    are allied to availability management.
    
    Conclusion
    
    The previous sections indicate the facilities available for using
    AVANTO to model availability. This document also indicates general
    aspects of the power of the Availability Review Service in combination
    with AVANTO. An Availability Review was conducted at an Australian
    bank. This involved charting the availability of a network of cash
    dispensers and particularly the availability at particular points of
    issue. Finally, AVANTO was used to formulate advice for improving the
    availability of certain points at the lowest possible costs. The method
    described for this is universally applicable and not limited to Digital
    hardware and Digital users.
    It almost goes without saying that this method in combination with
    AVANTO is completely unique within the IT sector. There is no other
    application like AVANTO.
    
    Reports are made to the customer by means of a management summary with
    recommendations, a detailed report containing the background
    information and all the detailed information from AVANTO and the System
    Health Check. All this is complemented with the results of the
    supplementary investigation.
    
    The Availability Review Service is applied to situations such as those
    set out below:
    � The IT management must guarantee a certain availability and looks for
      facilities to realize this.
    � There is uncertainty about the availability that can be offered with
      the existing infrastructure.
    � A system is to be expanded and an investigation is to determine what
      effects this may have on the availability.
    � A new application system is to be configured with a view to a
      particular availability.
    � The availability demands are approaching one hundred percent.
    � For ITIL implementations and determining Service Levels.
    
    Even when conducting an Availability Review in combination with the
    application of AVANTO, it will be possible to configure an application
    system so that a predetermined objective relating to availability can
    also be realized. In practice, designs have already been made of
    systems that exhibit fewer than an average four hours downtime
    (intrinsic) per year. Of course, the management organization must again
    comply with the high quality requirements as, for example, described in
    ITIL.
    
    Digitals Multivendor Customer Services in the Netherlands are ISO 9001
    certified.
    
    
	Introduction
    Today, the availability of information systems is as natural as that of
    telephone and electricity facilities. Without information systems the
    greater part of todays economic activities would come to a standstill.
    Increasing numbers of business managers are aware of this and are
    taking measures to help safeguard the availability of the information
    supply. Government also plays a role here with, among other things, its
    publication of the Code for information security . One of the
    objectives of this Code is the promotion of business confidence. This
    objective clearly indicates the relation between economic activity on
    the one hand and supply of information using automated systems on the
    other. Indeed, many companies and government organizations rely
    entirely on information systems for their operations. In everyday
    situations however, very little account is unfortunately taken of the
    availability requirements that have to be imposed when developing and
    configuring new application systems. Availability management, if
    already deployed explicitly, is almost always limited in practice to
    subsequent measurements and adjustments where this is possible.
    The computer industry has been successful in developing fault-tolerant
    systems for highly critical applications, which systems can operate
    practically without unscheduled interruptions. Fault-tolerant systems
    are frequently deployed particularly by financial institutions and
    logistics organizations. However, these systems are relatively
    expensive and the alternatives are limited.
    In addition, for many years there has been a trend among suppliers to
    design normal systems that can offer very high availability. Increasing
    numbers of suppliers are also providing lifelong guarantees for certain
    components. Several years ago personal computers used to break down
    with frightening regularity, but today, we anticipate that the
    technical life span of our PCs will far outstrip their economic life
    expectancy. There are also numerous possibilities for incorporating
    redundancy into computer configurations and the use of RAID technology
    is also applied with increasing frequency.
    There is on the one hand a greater need for information systems
    configured to provide high to very high availability, while on the
    other suppliers are offering more and more facilities for building
    reliable systems. However, the question that information managers are
    having to answer with increasing regularity is: How can I determine in
    advance what kind of availability I should offer my users? This is
    partly attributable to the development in which functional/business
    management determines the conditions to which the supply of information
    must comply. And this, of course, at the lowest possible cost.
    This is complicated by the fact that until recently there was no
    effectiveway to determine the availability of a complex information
    system before it was actually implemented. In addition to the many
    organizational aspects, the book titled Availability Management from
    the ITIL series only ventures on a mathematical approach to this
    problem in Appendix B.
    In practice, when establishing the availability figures in Service
    Level Agreements in advance, we usually take an arbitrary approach.
    Based on experience, the inclusion of a considerable dose of
    redundancy, negotiations and a sound maintenance agreement, it is
    thought that a certain guarantee can be provided to the users. Practice
    must then indicate the level of possibility for achieving the agreed
    availability and how adjustments can be made if the availability is
    inadequate. This situation is far from ideal since adjustments are
    almost always associated with a large number of frustrations,
    unexpected and often high costs. If this relates to strategic
    applications, it is an unacceptable and also unnecessary state of
    affairs. Evidently the demand for a well-established foundation for the
    availability to be expected will become increasingly important.
    The solution, however, is not simply a well-designed configuration. We
    only need read a few publications on this subject. The reasons for
    unscheduled failure of information systems can be significantly
    attributed to aspects such as environmental factors, service,
    management, the applications and the network. We will therefore need to
    take a holistic approach if we wish to find out more about the subject
    of availability and not simply take the hardware into account.
    It is not without reason that ITIL has become very popular. If we look
    at management models described in ITIL, it will become clear that the
    success or failure of availability depends on the entire system of
    measures, methods and procedures. All this must be complemented with
    the clear safeguarding of quality and security measures. Indeed, good
    security is the wall surrounding our availability. If we examine the
    situation closely, the only functions of the management organization
    are making and keeping available the applications necessary for the
    business.
    A hardware configuration is at the basis of a high availability, and
    this hardware can offer a certain availability. This basic availability
    is called the intrinsic availability of a system. Intrinsic
    availability is always better than or the same as the actual
    availability to be realized for the user.
    A system or infrastructure will only be able to approach its intrinsic
    or nominal availability if the management and all other external
    factors are at an optimum. According to ITIL, these external factors
    are the environment, the software, the network, the applications and
    the management of the entire system.
    Investigations conducted during the mid-nineteen-eighties demonstrated
    that hardware is accountable for only about twenty percent of all
    instances of non-availability. This only holds true if we look at all
    the causes of system failure. The greatest threat to the organization,
    however, is based on unscheduled failure. Most of this unscheduled
    failure is clearly attributable to the hardware, the service, the
    network and the environmental factors. If they are well-organized and
    tested, the management, the applications and the software will cause
    much fewer unscheduled failures.
    This is why, when we design a new information system, we must pay great
    attention to the topology of the hardware configuration. It is not
    difficult to imagine that the failure of one single hard disk may have
    serious consequences for a very large databases. An important role is
    played here not merely by the repair or replacement of the faulty unit
    but also by the time required to restore the database.
    Moreover, increasing numbers of systems are becoming part of a much
    wider infrastructure that is closely linked to the systems of other
    organizations. These may include, for example, systems for EDI,
    logistics, telephone sales and electronic payment transactions. But
    these may also include real-time applications for telecoms companies
    and within the chemical industry. The consequences of these kinds of
    computer systems being unavailable are catastrophic. The failure of
    information systems deployed in the dealing-rooms of banks may have
    disastrous consequences for the entire banking organization and extend
    far beyond the limits of that company alone. The reservation systems of
    airline companies are a case in point. In certain cases, we will need
    to design systems so that downtime will be limited to a few hours each
    year. In one situation, it was demanded that the application may never
    fail even in the event of a disaster. These very high availability
    demands are without exception prompted by the commercial importance of
    the application. We also know that as the need for availability
    increases, so the costs will increase exponentially.
    But how do we design a configuration to comply with such high
    availability requirements? Or, what must we do when we have to make a
    very critical application operational on an existing platform? Which
    measures must be met to set up the management so that we can continue
    to meet the demands that have been set? We will be able to find the
    answers to these questions using Digitals Availability Review,
    Partnership Services and with the aid of AVANTO.
    
    
    
    
4424.11Think Terabytes. Video-on-demand, Commercial DBs, etc.ATLANT::SCHMIDTSee http://atlant2.zko.dec.com/Fri Feb 16 1996 10:1824
  Even very high MTBF numbers are not meaningless.

  It is true for mere humans like you and I, who buy and use our
  disks one or two at a time, that an MTBF of 100K hours (11.4 years)
  isn't meaningfully different than an MTBF of 800K hours (91.3 years).
  After all, the disk will be obsolete in one or two years and truly
  ridiculous in five or ten years.

  But for people who assemble huge storage arrays of hundreds or
  thousands of disks, the law of large numbers starts to come into
  play. If these disks really have a uniform failure rate throughout
  their lives, then with a hundred disks, that 100K drive array starts
  to fail every 1000 hours (42 days). And a thousand-disk array, the
  failure occurs every 4.2 days. They'd better be using RAID! But
  RAID requires more disks, and that means more failures! Yipes!

  If, on the other hand, they buy that disk that runs 800K hours on
  average, then the hundred disk array runs an average of 8000 hours
  (333 days) between failures and even the thousand-disk array runs
  800 hours (33.3 days). RAID would still be nice-to-have, but or-
  dinary backup-to-tape might still be a sufficiently practical
  strategy.

                                   Atlant
4424.12Yes, but get in touch with the people who knowUTROP1::KOOIJMANLIFE IS HELL THEN YOU DIEFri Feb 16 1996 10:5632
    
    Yes,
    
    Yes, yes, yes you are right.
    But when a customer wants to know what level of availability he can
    guarantee to his users you will need AVANTO to give him the right answer. 
    With normal common sense and a pocket calculator you will not be able 
    to satisfy such complex problems as you describe them. And it is not
    true that 10 disks will give 800k devided by ten is 80k MTBF. That is
    only true if all disks are used by the same database and application and
    are not redundant and and and.
    We have utilised AVANTO many many times in situations where we had to
    answer questions from customers like "what do I have to do in order 
    to have no more then 16 hours of downtime average per year?"
    Would you recommend RAID or volume shadowing and/or clustering?
    We have even designed systems that will never be down, even in case of
    disaster. We have done this for customers with 8 node clusters and 150
    Gbyte of disk. 
    So once again, contact Dave Varner and Ron Rocheleau and don't start
    a debate here about the real value of MTBF and MTTR. As far as I'm
    cencerned only a few of us are qualified and the best one is Ron. 
    Use AVANTO with the customer and see for yourself what a great 
    thing we have. We have done it many times and customers are paying us
    big bundles of money to get the real Availability answers.
    
    Regards,
    
    
    Aad Kooijman.
    Business manager High Availability Services in Holland.
    So I'm not very very technical
                                  
4424.13ATLANT::SCHMIDTSee http://atlant2.zko.dec.com/Fri Feb 16 1996 12:2420
Aad:

  You're "preaching to the choir". I wasn't arguing about the use-
  fulness of a complete model. I worked for both Field Service and
  CSSE, and models were our life! We knew how to get whatever
  answer we wanted from our models. :-)

  What I was debating was the statement that such large MTBFs are
  meaningless. They're definitely not, at least not if you have
  enough disks (for example) that the law of large numbers starts
  to apply. Ten disks? You're right -- the calculation isn't
  simply MTBF/10. But a hundred disks? Maybe. And a thousand
  disks? Probably.

  And yes, disks have wear-out mechanisms as well as other sources
  of failures. But I was trying to draw a simple illustration and
  not clutter it up too much with details. And the original note
  that talked about meaningless MTBFs had used disks as an example,
  so I followed suit.
                                   Atlant
4424.14Is MTBF needed?MRKTNG::VICKERSFri Feb 16 1996 12:3328
    Re: MTBF vs. reliability/availability/maintainability - there are some
    really good replys in this notes stream, and some very valid positions.
    Unfortunately, our customers (who emerge from the great unwashed,
    uneducated masses) still ask for, and in some cases demand, equipment 
    MTBF as part of Digital's response to RFQs.  Try as one may, they can
    not be educated or coerced away from this position - in fact, they
    sometimes take on the "if you won't tell me, what are you hiding"
    attitude about the subject.
    
    Interestingly enough, most don't specify or care about the method used 
    to generate the number (DoD and other U.S. agencies being the 
    exception - MIL 217 only please), they just want the number and by not 
    providing it Digital risks being declared non-compliant in their proposal.  
    
    Also, I have never known a customer to come back with MTBF data and 
    say, " Oh, by the way, the equipment you sold me didn't meet the MTBF 
    you specified.  I think you should compensate me."  I have had them 
    cite me chapter and verse from the IBM/HP/Sun book as to why "my" 
    equipment was substsantially inferior to the competing product in 
    terms of "calculated" MTBF.  Then the discussions get really mundane.
    
    My .02 worth 
    
    	Bill
    
         
    
    
4424.15It's the way they've always done itBBPBV1::WALLACEUNIX is digital. Use Digital UNIX.Sat Feb 17 1996 08:1710
    Bill gets my vote. No numbers, no sale, in much of the OEM market I
    support. It doesn't matter if the numbers are meaningless, it doesn't
    matter than some of the OEM customers and/or their end users can't tell
    the difference between availability and reliability, it just matters
    whether they can drive their (?ISO9000?) quality process which says
    they have to crank the MTBF handle on a spreadsheet and come up with
    The Answer.
    
    regards
    john
4424.16The next point in our debateUTROP1::KOOIJMANLIFE IS HELL THEN YOU DIEMon Feb 19 1996 03:0033
    
    
    Hi guys,
    
    If your customer wants/needs MTTR and MTBF, give it to them. I do not argue
    with that. Just make sure you get a non-disclosure.
    I only want to point out that:
                       
    1. These figures do not tell the whole availability story.
    2. Digital has great services and applications to help our customers
       determine the 'real' availability. 
    3. We have been very succesful selling these services in Holland.
    4. We can, by using these services make IBM and HP look stupid.
    5. By positioning our High Availability services we have a unique
       selling point.
    6. We generated a million worth of NOR with these services within a
       year. Especially in the OEM market and with partners. We can help
       partners to design systems that will meet their Availability specs.
       The account managers love it because we see a lot of product sales
       as a result of these services.
    7. Digital has expertise that is second to none that you might want to use.
    8. The Availability Analysis Tool (AVANTO) is just great and we have
       used it hundreds of times. One of our ABU accounts bought k$ 600 worth
       of hardware and software as a result of an AVANTO exersise. Just to
       improve the availability of one of his VAX clusters.
    
    
    Best regards,
    
    
    Aad Kooijman.
    
	
4424.17Remember what MTBF means...ADOV01::MANUELOver the Horizon....Mon Feb 26 1996 08:5211
    And just remember that MTBF is "mean time between failures", this
    statistical number is just that - the time between successive failures
    of the same piece of equipment. Hopefuly with our latest technology you
    or your customer or the equipment will not be around to argue whether 
    the second failure exceeded the MTBF.
    
    Just replaced an RZ23 in my vaxstation after the first failure - I've
    had this faithful beasty for about 7 years, the MTBF timer started at
    13:00 today....
    
    Steve.
4424.18LILCPX::THELLENRon Thellen, DTN 522-2952Thu Oct 31 1996 10:3224
4424.19try the new call centerTROOA::MSCHNEIDERNothing witty to sayThu Oct 31 1996 12:473
4424.20Nearest ResellerSTOWOA::BLANCHARDThu Oct 31 1996 13:304