| Gentlemen,
MTTR and MTBF figures are very interresting but are allmost meaningless
today. Some disks we specify MTBF is 800,000 hours! (Do you believe it?)
Also we must have corporate approval to give them to customers.
Digital has developed a couple of services called High Availability
Services under the leadership of Dave Varner @OGO. He is the corporate
business manager for this. Engineering for the AVANTO application is
located in Shrewsbury Mass. The engineer for AVANTO is Ron Rocheleau @SHR.
In Holland we have made availability models for hundreds of systems
dutring the past year and we have earned lots of revenue with it. This
is a unique capability. A short write-up of what we can do is included.
Please contact Dave Varner @OGO for more information. This is the best
in the IT industry. With Availability Review and Partnership services
we have taken out the competition many times in Holland so please
involve Dave.
In its simplest form AVANTO can be used to produce hardware
availability figure in a very very professional way.
Best Regards,
Aad Kooijman @ UTO (The Netherlands, which is over in Europe)
Business manager High Availability Services
AVANTO
Over the past four years, Digitals Multivendor Customer Services
organization has developed and frequently applied an availability
analysis application. This application is called AVANTO (AVailability
ANalysis TOol) and has been used successfully in practice to determine
the availability of many hundreds of configurations. It has also been
empirically established that the predicted results are realized
actually in 95% of all cases. This means a unique tool is now available
to IT managers. This of course leaves other factors unimpeded such as
the management organization, the applications, etc. These aspects
(domains) will be involved in an Availability Review Service conducted
by Digital. So how do things proceed when using AVANTO?
The essence of AVANTO is that it enables a system to be designed so
that the anticipated availability can be made to correspond to the
demands made upon it from the business. AVANTO is frequently used when
configuring new application systems. AVANTO is an application that
enables the availability of very complex systems to be modeled and to
determine beforehand how the demands with respect to availability can
be realized without working in an arbitrary way.
In the simplest of applications it is possible to calculate the average
availability of a systems hardware by using MTTR and MTBF data.
However, as stated above, this will result in an incomplete picture as
many other aspects will codetermine the availability in practice. One
example is the quality of the environment in which the equipment has
been installed. Also very important is the organization of the helpdesk
and the underlying second and third-line support of the various
suppliers.
When establishing the potential availability of an existing system, it
will be necessary to investigate how the management, the environment,
the software and the other domains have been set up. Besides the
hardware, all these domains influence the level of availability. By
incorporating parameters and setting up a business scenario, AVANTO can
show availability as a function of the business requirements.
Digitals Availability Review and Partnership Services are also
conducted with the aid of AVANTO. An existing situation is scrutinized
in an Availability Review and a very detailed investigation determines
what can be done to improve availability management. Alternative
situations can also be modelled. It is not difficult to imagine that
this approach is much more preferable than one in which measures are
taken more or less by guesswork, after which we have to measure what
effects these measures have had. Moreover, the costs incurred when
improving availability in retrospect are generally much higher than
those associated with conducting an analysis in advance. AVANTO is now
in structural use by a large number of organizations within Change and
Availability Management in order to determine the availability effects
of scheduled configuration changes in advance.
Availability investigation
Lets assume that an IT organization with a given infrastructure wants
to determine what availability can be offered to its users. Or the
existing availability has to be increased. Digital Multivendor Customer
Services can provide answers to these questions based on investigation
and with the aid of AVANTO.
So how does such an investigation proceed?
At the start, the customer is consulted to establish which areas
(domains) are to be involved in the investigation. If a decision is
taken to limit the investigation to the hardware configuration, the
result will be of limited value. In accordance with the information
provided by ITIL (Information Technology Infrastructure Library),
there are five other domains besides the hardware
that must be involved in an Availability Review:
1. The environment, including climatic control, power supply, service
contracts, etc.
2. The system management, the organization, the procedures and the
system configuration
3. The network
4. The system software
5. The applications and the application management.
Only when all these domains have been exhaustively charted will it be
possible to determine what availability can be offered. If it is found
that the availability to be offered is inadequate, alternative
scenarios can be formulated with the aid of AVANTO. An Availability
Review provides a complete picture of all the aspects of availability
management.
AVANTO business model
When carrying out an Availability Review Service it will be established
what availability can be offered with an existing configuration or a
new one yet to be installed. Furthermore, extensive modelling
activities are also possible using AVANTO. Based on an initial AVANTO
model, the reference model, we can modify the redundancies and/or the
service contracts to determine the costs at which the desired
availability can be realized.
In the first place, however, it will be necessary to indicate when a
certain availability is to be offered. In this way, a central system
with an important database may require 100% availability during the day
while this level will also be required for the back-up equipment during
the evening and night. It might also be so that not all the hardware is
important for a particular application while a different application
does require all the hardware in the system in order to operate.
Charting these aspects is called setting up the business model or
business scenario in AVANTO. This business scenario is entered into
AVANTO, which always displays the availability as a function of the
business model.
Example of a simple business scenario
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7
Shift 1 50% 50% 100% 100% 75% 50% 25%
Shift 2 100% 100% 100% 100% 100% 50% 50%
Shift 3 100% 100% 100% 100% 100% 50% 50%
Shift 4 75% 75% 75% 100% 100% 50% 25%
The percentages indicate when and to what extent availability is
required.
AVANTO enables a business scenario to be created for each part of the
configuration. This means, for example, that a systems ideal service
mix can be modelled.
Cost of downtime
If known, the costs of the downtime can now be entered. In several
cases it is possible to establish which costs are associated with the
failure of the information system. This can also be entered into AVANTO
as part of the business scenario. If the costs of downtime cannot be
quantified, AVANTO will express the costs of downtime in a number of
points per hour. The customer can then call upon the assistance of his
financial department to convert this into the costs of downtime.
Various risk-analysis techniques are available for determining the
costs of downtime.
Figure 4 Redundancy model (Figures not included)
Redundancy
The diagram on the left displays the topology of a simple hardware
configuration. This diagram shows that components A, B, D, E and F must
function correctly for the operation of the entire chain. If component
A fails, the chain will be broken and the application will come to a
standstill. If component B1 fails then B2 will assume functionality.
Here, component B has been executed redundantly and will assume the
function from B1 automatically or via a manual procedure and vice
versa. It will be clear that the likelihood of failure of the
functionality of the entire chain as a consequence of component A is
greater than as a consequence of component B. Especially if A and B are
equally reliable. The reliability of the entire chain will decrease as
more non-redundant components are included in the chain.
For configurations as in Figure 1, the number of elements in the chain
can easily rise to many dozens.
AVANTO also offers facilities to take account of the effects of
activating redundancies and then switching back to the normal
situation. If a redundancy measure is to be activated, it will often
have consequences for the performance during the switching time. These
effects of switching the functions on and off (invocation and
devocation) are also included within AVANTO when determining the
eventual availability and costs of downtime. When calculating the
average availability, AVANTO will make use of MTBF and MTTR data of all
components within the given configuration. The calculation normally
takes place during a simulation period of twenty years and the result
of the calculation represents an average expectancy. This means the
calculated average availability will be realized in 95% of the cases.
AVANTO does not take account of unscheduled failure as a consequence of
human actions and other completely arbitrary factors. Nor is it
possible to express the quality of the management organization as a
figure. It is, however, possible to incorporate operational
characteristics of the management organization in AVANTO. For example,
the average throughput time at the help-desk and similar types of data.
AVANTO can be used to model many hundreds of components. Each component
can then have three redundancies.
Levels of maintenance service
Figure 5
AVANTO can be used to calculate which form of maintenance agreement
best suits the business scenario of the particular customer. A wide
coverage in the maintenance agreement will not necessarily benefit
availability. In other words, the yield per extra Dollar spent on
maintenance decreases as the coverage increases. In many cases in which
the daytime availability must be 100%, but can be less at other times,
it is sufficient to provide less coverage in the maintenance agreement.
AVANTO can also calculate the optimum form of maintenance. A graph can
be used to illustrate that an effect develops in which the added yield
actually decreases. In this way, the customer can determine precisely
where the optimum lies in respect of coverage in his maintenance
agreement. The above graph in Figure 5 clearly indicates that there is
no point for the customer to switch to a service contract providing
more coverage than seven days at sixteen hours a day and a response
time of four hours (7 x 16 + 4). A more expensive maintenance contract
does not provide extra cost reductions for the downtime.
If the customer in this example enters a maintenance agreement of six
days a week at ten hours a day and a response time of four hours, then
each extra guilder spent on maintenance will result in a 190 guilder
reduction of the downtime costs.
Environmental factors
A large number of parameters relating to environmental factors can be
entered in AVANTO. This particularly concerns aspects relating to the
quality of the power supply and air conditioning, no-breaks and
possibly even diesel generators. AVANTO sees a diesel generator as a
redundancy measure. For aspects like no-breaks it can now be clearly
justified whether the investment provides sufficient yield. After all,
the downtime costs will fall as a consequence of deploying a no-break.
Performance aspects
An important factor when establishing availability is the performance.
It can be argued that there will be no downtime if one user can still
work. In reality a relation exists between performance and
availability. AVANTO will also take account of this providing it has
been set correctly.
Consultants using AVANTO must very clearly understand which performance
effects arise when redundancy measures are introduced. The loss of part
of the configuration, such as, for example, memory may also affect
performance. Assume that when a redundant disk comes into operation the
database performance temporarily drops by 25%. In that case, AVANTO
will decrease availability accordingly and include this information in
the final result of the calculation. So when installing AVANTO, we must
have a thorough knowledge of how the system elements and components
operate.
Software and applications
There are, of course, very limited possibilities for including
quantitative data in AVANTO on the reliability of system software and
applications. In recent years, increasing numbers of figures have been
made available, such as Mean Time Between Crash (MTBC). However, these
figures may differ considerably from one situation to the next. It is
also practically impossible to take account of such aspects as
programming errors and software bugs. When conducting an Availability
Review, the investigation will focus primarily on the total management
environment and the software and application management. Attention is
paid in particular here to the procedures for reporting problems in the
software, the applications, the help-desk organization and the second
and third-line support. A well-designed infrastructure for
problem-solving will make a considerably positive contribution to the
availability of the applications.
In several Availability Reviews, the investigation focuses on the
availability of a certain application in the infrastructure. When
setting up the AVANTO model and the Fault Tree analysis for this,
account is clearly taken of the fact that the application in question
uses only part of the hardware. In this way, AVANTO can also be used to
create models for a certain group of users.
System Health Check
An equally important part of the Availability Review Service is the use
of the System Health Check (SHC). Although mentioned several times
above, the intrinsic (hardware) availability is not the only aspect
that is important for the availability of an information system. The
way in which the system management is exercised is particularly
important for the eventual result. During an Availability Review, the
Digital consultant conducts an SHC on all the systems involved in the
investigation. This involves scrutinizing a large number of aspects in
the areas of:
� Security
� Performance
� Capacity utilization and occupied space
� etc.
Hundreds of checks are carried out in an SHC by the specially developed
software. The result is that a detailed fingerprint of the management
is obtained.
The consultant carrying out the Availability Review Service indicates
all the bottlenecks in his report and gives concise advice on how
certain matters might be solved. Taking note of this advice will make a
clear contribution to improving the entire system performance in all
the fields investigated. Please refer to the brochure on SHC for more
information in this respect.
Supplementary investigation
The above description sets out a clear picture of the availability an
IT organization can offer. An Availability Review that has been
conducted exhaustively uses questionnaires to obtain even better
insight into the organization of the system and network management.
Some of the additional aspects to be scrutinized are physical security,
reporting, change management, problem management and other aspects that
are allied to availability management.
Conclusion
The previous sections indicate the facilities available for using
AVANTO to model availability. This document also indicates general
aspects of the power of the Availability Review Service in combination
with AVANTO. An Availability Review was conducted at an Australian
bank. This involved charting the availability of a network of cash
dispensers and particularly the availability at particular points of
issue. Finally, AVANTO was used to formulate advice for improving the
availability of certain points at the lowest possible costs. The method
described for this is universally applicable and not limited to Digital
hardware and Digital users.
It almost goes without saying that this method in combination with
AVANTO is completely unique within the IT sector. There is no other
application like AVANTO.
Reports are made to the customer by means of a management summary with
recommendations, a detailed report containing the background
information and all the detailed information from AVANTO and the System
Health Check. All this is complemented with the results of the
supplementary investigation.
The Availability Review Service is applied to situations such as those
set out below:
� The IT management must guarantee a certain availability and looks for
facilities to realize this.
� There is uncertainty about the availability that can be offered with
the existing infrastructure.
� A system is to be expanded and an investigation is to determine what
effects this may have on the availability.
� A new application system is to be configured with a view to a
particular availability.
� The availability demands are approaching one hundred percent.
� For ITIL implementations and determining Service Levels.
Even when conducting an Availability Review in combination with the
application of AVANTO, it will be possible to configure an application
system so that a predetermined objective relating to availability can
also be realized. In practice, designs have already been made of
systems that exhibit fewer than an average four hours downtime
(intrinsic) per year. Of course, the management organization must again
comply with the high quality requirements as, for example, described in
ITIL.
Digitals Multivendor Customer Services in the Netherlands are ISO 9001
certified.
Introduction
Today, the availability of information systems is as natural as that of
telephone and electricity facilities. Without information systems the
greater part of todays economic activities would come to a standstill.
Increasing numbers of business managers are aware of this and are
taking measures to help safeguard the availability of the information
supply. Government also plays a role here with, among other things, its
publication of the Code for information security . One of the
objectives of this Code is the promotion of business confidence. This
objective clearly indicates the relation between economic activity on
the one hand and supply of information using automated systems on the
other. Indeed, many companies and government organizations rely
entirely on information systems for their operations. In everyday
situations however, very little account is unfortunately taken of the
availability requirements that have to be imposed when developing and
configuring new application systems. Availability management, if
already deployed explicitly, is almost always limited in practice to
subsequent measurements and adjustments where this is possible.
The computer industry has been successful in developing fault-tolerant
systems for highly critical applications, which systems can operate
practically without unscheduled interruptions. Fault-tolerant systems
are frequently deployed particularly by financial institutions and
logistics organizations. However, these systems are relatively
expensive and the alternatives are limited.
In addition, for many years there has been a trend among suppliers to
design normal systems that can offer very high availability. Increasing
numbers of suppliers are also providing lifelong guarantees for certain
components. Several years ago personal computers used to break down
with frightening regularity, but today, we anticipate that the
technical life span of our PCs will far outstrip their economic life
expectancy. There are also numerous possibilities for incorporating
redundancy into computer configurations and the use of RAID technology
is also applied with increasing frequency.
There is on the one hand a greater need for information systems
configured to provide high to very high availability, while on the
other suppliers are offering more and more facilities for building
reliable systems. However, the question that information managers are
having to answer with increasing regularity is: How can I determine in
advance what kind of availability I should offer my users? This is
partly attributable to the development in which functional/business
management determines the conditions to which the supply of information
must comply. And this, of course, at the lowest possible cost.
This is complicated by the fact that until recently there was no
effectiveway to determine the availability of a complex information
system before it was actually implemented. In addition to the many
organizational aspects, the book titled Availability Management from
the ITIL series only ventures on a mathematical approach to this
problem in Appendix B.
In practice, when establishing the availability figures in Service
Level Agreements in advance, we usually take an arbitrary approach.
Based on experience, the inclusion of a considerable dose of
redundancy, negotiations and a sound maintenance agreement, it is
thought that a certain guarantee can be provided to the users. Practice
must then indicate the level of possibility for achieving the agreed
availability and how adjustments can be made if the availability is
inadequate. This situation is far from ideal since adjustments are
almost always associated with a large number of frustrations,
unexpected and often high costs. If this relates to strategic
applications, it is an unacceptable and also unnecessary state of
affairs. Evidently the demand for a well-established foundation for the
availability to be expected will become increasingly important.
The solution, however, is not simply a well-designed configuration. We
only need read a few publications on this subject. The reasons for
unscheduled failure of information systems can be significantly
attributed to aspects such as environmental factors, service,
management, the applications and the network. We will therefore need to
take a holistic approach if we wish to find out more about the subject
of availability and not simply take the hardware into account.
It is not without reason that ITIL has become very popular. If we look
at management models described in ITIL, it will become clear that the
success or failure of availability depends on the entire system of
measures, methods and procedures. All this must be complemented with
the clear safeguarding of quality and security measures. Indeed, good
security is the wall surrounding our availability. If we examine the
situation closely, the only functions of the management organization
are making and keeping available the applications necessary for the
business.
A hardware configuration is at the basis of a high availability, and
this hardware can offer a certain availability. This basic availability
is called the intrinsic availability of a system. Intrinsic
availability is always better than or the same as the actual
availability to be realized for the user.
A system or infrastructure will only be able to approach its intrinsic
or nominal availability if the management and all other external
factors are at an optimum. According to ITIL, these external factors
are the environment, the software, the network, the applications and
the management of the entire system.
Investigations conducted during the mid-nineteen-eighties demonstrated
that hardware is accountable for only about twenty percent of all
instances of non-availability. This only holds true if we look at all
the causes of system failure. The greatest threat to the organization,
however, is based on unscheduled failure. Most of this unscheduled
failure is clearly attributable to the hardware, the service, the
network and the environmental factors. If they are well-organized and
tested, the management, the applications and the software will cause
much fewer unscheduled failures.
This is why, when we design a new information system, we must pay great
attention to the topology of the hardware configuration. It is not
difficult to imagine that the failure of one single hard disk may have
serious consequences for a very large databases. An important role is
played here not merely by the repair or replacement of the faulty unit
but also by the time required to restore the database.
Moreover, increasing numbers of systems are becoming part of a much
wider infrastructure that is closely linked to the systems of other
organizations. These may include, for example, systems for EDI,
logistics, telephone sales and electronic payment transactions. But
these may also include real-time applications for telecoms companies
and within the chemical industry. The consequences of these kinds of
computer systems being unavailable are catastrophic. The failure of
information systems deployed in the dealing-rooms of banks may have
disastrous consequences for the entire banking organization and extend
far beyond the limits of that company alone. The reservation systems of
airline companies are a case in point. In certain cases, we will need
to design systems so that downtime will be limited to a few hours each
year. In one situation, it was demanded that the application may never
fail even in the event of a disaster. These very high availability
demands are without exception prompted by the commercial importance of
the application. We also know that as the need for availability
increases, so the costs will increase exponentially.
But how do we design a configuration to comply with such high
availability requirements? Or, what must we do when we have to make a
very critical application operational on an existing platform? Which
measures must be met to set up the management so that we can continue
to meet the demands that have been set? We will be able to find the
answers to these questions using Digitals Availability Review,
Partnership Services and with the aid of AVANTO.
|