| My apologies if my note which follows might sound hard edged. I
did not spend the time to phrase it in consideration of those
who have worked hard on this material to get it where it is now
but I wanted to at least get the first pass of my comments in
and shared for the meeting of next week as to not waste the
time there doing what can be done here. So I hope no offense
is taken.
I have an overriding comment which I want to share before going into the
detail comments that follow on the chapters 1-5 which Dave has
forwarded for review. The original book Building Dependable Systems
on which the context of this course so far is largely based is very
much developed from the perspective of hardware fault-tolerance as
the means to get the highest levels of availability. This perspective
was very valid in what is now considered traditional computing. I.E.
A single computer system with terminals attached. The current style is
client/server computing. Given this the traditional concept falls down.
A good analogy is a 1000 pound link in a 5 pound chain for the approach
of hardware fault-tolerance to solve the problems in the client/server
environment. There are just so many places where things fall down. Also
in the client/server the means by which availability is measured is much
different in that it is measured in terms of service availability and
not machine availability. Things break, machines fail but if service
is still provided then the "system" is viewed as providing continuous
availability. My comments which follow are all leading from the
perspective that it is possible to take a set of machines which are
not fault-tolerant or highly available in themselves and through the
use of software build systems which offer high levels of availability
and fault-tolerance. I.E. Distributed Fault-tolerant systems.
regards
Gregg
Section: Defining Dependable systems
pg 1-4
Requires the definition of
SYSTEM - One or more physical computers which may be network connected
to provide a service or set of services. In todays client
server computing environments the system may be a composite of
the network and a number of client and server computers.
pg 1-7
Under hardware failure add the following bullets
o Time to find an alternate code component and activate it.
o Time to find an alternate network path and recover context.
pg 1-8
First sentence should say RECOVER not recovery.
pg 1-43
Bullet 3 under Levels of System Availability states fault tolerant
computers for mission critical systems.I believe this should be
"fault tolerant systems" instead of computers.
Section: Dependable System Strategies
pg 2-5
Computing Component examples - add
o A reliable client/server middleware.
Fault - add
o Link loss due to timeout in a client/server environment
o A poorly seated computer board
Failure - add
o Processor Failure
pg 2-7
The figure of the 3 operational states is different that the
order of the bullets below. The bullets should be tagged B,A,C
or reordered.
pg 2-9
Bullet 4 uses the word Cuing, - should this be Queuing?
Bullet 5 should read -
"Causing the application to failover to another application."
pg 2-11
Time redundancy can be referred to as the N-Version programming
technique. I have never heard it referred to as time redundancy
before.
Software Redundancy - I have no idea what is being referred to
here. The technique described sounds weak at best. I would write
it on the lines of:
Software redundancy can be accomplished by an writing application
framework which allows for process replication. With process
replication processes can be created on alternate systems to
play the role of hot replication of the work in progress (Shadowing)
or warm availability (Standby) to take over in the case of process
or processor failures. This method is relatively difficult to
implement but is provided as a product on which to write applications
by some vendors today.
some references:
Fault Tolerant Computing - International Academic Publishers
A Support for Robust Replication in a Distributed Object Environment
authors: A. Corradi, L. leonardi - University of Bologna Italy
Providing High Availability using lazy replication
authors: Rivka Laden - Digital
Barbra Liskov, Liuba Shira, Sanjay Ghenawat - MIT
pg 2-12 Cost of Redundancy - refers to dividing into small multiples
to reduce the cost as "downsizing" this can also be referred
to as "granular partitioning".
pg 2-15 Single points of failure - this is totally a hardware
approach with no view toward replicating software
components.
pg 2-20 Single points of failure - add bullet 7
o Applications
pg 2-24 Client/Server computing. There is nothing to say what is
either good or bad about this style of computing. Some
people feel that client/server reduces the potential for
faults others feel it increases the potential as well as
creating situations of inderminate states of information
in the case of failures. I.E. the server crashes and burns,
my work may or may not have been completed. On a single
system I can accept that I won't know the result of the
work which was in progress since my computer is burning
in front of me. In the client/Server environment I am
still running but am unaware of the result of the last
piece of work. Hence a whole new set of problems and this
chapter does not start to go into them on either side, good
or bad.
pg 2-25 Transaction processing.
There is discussion of client/server in this section. There
is nothing about TP that even implies C/S. This is just an
interpretation of our TP monitors architecture. If this is
meant as a generic description these references to c/s should
be taken out.
The bigger issue on this section is the same as the comment
I had made on "Client-Server Computing". There is nothing
to qualify the benefit of using a monitor for dependable
systems. I would argue that inorder to get recovery in the
case of a failure, not just rollback of the work in progress
but failover and completion of the work, that monitors are
inappropriate since they support flat transaction managers
which means that in the case of a failure all resources in
a transaction need be rolled back and the transaction started
over again. This is a presumed abort view and very inappropriate
to offering completion of work.
pg 2-27
The last sentence states that "Distributed applications use
c/s and TP to produce disaster tolerant systems." - How? There
is nothing in this chapter which illustrates I am any better
off using these techniques. In fact I could be worse off. This
needs allot of qualification.
pg 2-28
Primary Dependability Strategies -
I disagree with bullet 2. I think there are 2 states. Apparently
not broken and not broken are the same. Is it not broken/broken,
or on/off?
Using redundancy -
First bullet - I do not agree with the last line "S/W redundancy
requires a few extra lines of code" as mentioned in comments on
pg 2-12. I think should be removed/rewritten.
add a new bullet -
o Application redundancy is the replication of the application
either through concurrency, warm standby, or hot standby.
pg 2-30
I am not sure what the XOR exercises are intended to achieve.(?)
Section: H/W dependability strategies
pg 4-9
Loosely coupled, independent processes - describes clusters, clusters
are a software implementation.
A section should be added for S/W dependability strategies.
I have had some discussions with Dave on DEC's Software Fault Tolerant
strategy and append a set of pointers below to get more information
as well as a Gartner research note which describes DEC software
fault tolerant product.
TPVON::Presentation$public:
RTR_CUSTOMER_PRESENTATION.PS - Technical tutorial presentation
RTR_CUSTOMER_PRESENTATION_NOTES.PS - Technical tutorial presentation notes
RTR_GARTNER_ARTICLE1.TXT
RTR_GARTNER_ARTICLE2.TXT
RTR_GARTNER_ARTICLE3.TXT
RTR_V2-2_FEATURES.PS;1 - IM Partners presentation symposium in Jan 93
RTR_V21_FEATURES.PS;1 - IM Partners presentation symposium in Jan 93
Note: since the Gartner articles have copyrights, reprint copies may be
obtained by contacting the RTR Marketing Manager: Jill Hitchcock at
TPSYS::Hitchcock or (508) 952-4137 / DTN: 227-4137
Glossy brochures:
Order number:
EA-A1104-34 Rel#43/91 03 20 8.0 MCG/CTS - The Australian Stock Exchange -
Builds Its Nationwide Systems with Digital
EC-F1796-57 Rel# 306/92 04 72 30.0 MRO/MKO - Reliable Transaction Router -
High-Performance Enterprise Integration Software
White papers done by the Australian Stock Exchange:
DECALP::RTR$PUBLIC:
ASX_CORE_SYSTEMS.PS
ASX_LESSONS_LEARNT.PS
ASX_GROW_RELIABLE_SYSTEMS.PS
(l) GARTNER GROUP RESEARCH NOTE #P-535-1156 ENTITLED:
"UNDERSTANDING DEC'S RELIABLE TRANSACTION ROUTER"
+ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
| Please be advised that the information contained within this |
+ report is copyrighted material. The following policies must +
| be adhered to: |
+ +
| - No reformatting of the data segments |
+ - No external distribution +
| - Internal use only in accordance with vendor agreements |
+ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
GartnerGroup Midrange Computer Systems
Copyright (C) 1992 MCS : P-535-115
+--------------------------------------------------------------------+
| |
| |
+--------------------------------------------------------------------+
Midrange Computing
Products, P-535-1156
W. Melling Research Note
March 13, 1992
Reprint
Understanding DEC's Reliable Transaction Router
Enterprises implementing heterogeneous client/server networks,
high-volume OLTP applications and/or fault-tolerant systems should
understand DEC's Reliable Transaction Router.
+--------------------------------------------------------------------+
| |
| |
| GartnerGroup |
| |
+--------------------------------------------------------------------+
This publication is published by Gartner Group, Inc. Reprints of this
document are available. Reprint prices are available upon request. Entire
contents, Copyright (C) 1992 Gartner Group, Inc. 56 Top Gallant Road,
P.O. Box 10212, Stamford, CT 06904-2212. Telephone : (203) 964-0096.
Facsimile : (203) 324-7901. This publication may not be reproduced in any
form or by any electronic or mechanical means including information
storage and retrieval systems without prior written permission. All
rights reserved.
The ACID Test
Transactional ACIDity refers to the following essential properties of a
transaction processing system :
o Atomicity : The system will either perform all individual
operations on the data, or will assure that no partially completed
operations leave any effects on the data.
o Consistency : Any execution of a transaction must take the
database (globally) from one consistent state to another
consistent state.
o Isolation : Operation of concurrent transactions must yield
results that are indistinguishable from the results which would be
obtained by forcing each transaction to be serially executed
(i.e., in isolation) to completion in some order.
o Durability : The ability to preserve the effects of committed
transactions and ensure database consistency after recovery from
processor, memory, media or network failures.
Source : Transaction Processing Performance Council
What is the DEC RTR? The Reliable Transaction Router (RTR) (see Figure 1)
from Digital Equipment Corp. (DEC) is a software product designed to turn
a network of heterogeneous clients, servers and databases into a
fault-tolerant system, with dynamic message routing, location-transparent
data access, transactional ACIDity, enhanced security and global
manageability. RTR delivers run-time services at the client, server and
message-management levels of a three-level model. It can be used as an
infrastructure for client/server computing, an architecture for
high-volume transaction processing or an alternate approach to
fault-tolerant computing.
Of Additional Interest
Users who find the RTR architecture relevant to their environments should
also familiarize themselves with the capabilities of SuiteTalk from
Multinet Distributed Information Systems, Harvard, Mass.
RTR as an infrastructure for client/server computing RTR is currently a
VAX/VMS product. We believe that, by 1Q93, it will begin managing
messages for multivendor clients and servers, with broad coverage by 4Q93
(0.7 probability). RTR permits independent selection of client and server
development tools, use of different tools for different applications and
integration of legacy systems with new applications. Developers must
decide where to make the cut between client function and server function
and must design the messages that will be passed decisions that are
preordained by more structured client/server products.
Figure 1
** Please see hardcopy for Figure 1 **
Source : Gartner Group
RTR as an architecture for high-volume OLTP OLTP requires : scalability,
availability, distributability, flexibility, integrity and security. In
1992, competitive price and portable application support should be
available on demand. The three-layer client/server architecture of RTR
permits granular scaling of client, server or message-manager hardware,
with, essentially, linear price/performance, up to airline or stock
exchange volumes. (Several RTR sites are stock exchanges.) Availability
is at "fault-tolerant" levels. Distributability is inherent in the
architecture (the Australian Stock Exchange has more than 200 nodes
operational). Integrity is addressed by facilities for guaranteed
delivery of transactional messages, two-phase commit (in the current
product, across heterogeneous database managers and non-data resources),
cooperative termination and roll-forward/roll-back.
RTR Price/Performance
We estimate that a network of VAX 4000 servers with VAX 3100 clients and
non-programmable terminals would benchmark at $11,000 to $12,000/tpsA
wide-area network (0.8 probability), which is highly competitive with
alternate OLTP systems.
Enhancing Client/Server Security
RTR provides convenient mechanisms for the introduction of user-developed
security measures. On top of normal operating system and database manager
security mechanisms, the three-level model adds compartmentalization to
such an extent that separate client and server development teams become
feasible, and their view of each other is limited to defined inputs and
outputs. Neither end users nor client software developers need know where
servers are, what technology has been used to implement them, or how they
work. In addition, RTR provides an "authentication server" mechanism for
trapping messages so that they can be examined by user code, which runs
concurrent with transaction processing to avoid a performance penalty,
and exercises its veto power by voting "no" in the commit process.
RTR as a new approach to fault-tolerant computing As business
requirements push data centers to supporting OLTP 24x7x52, and as
transactional systems increasingly become global, our understanding of
"fault tolerant" expands. No longer is it enough that the processor never
breaks and that disks are always shadowed. By themselves, hardware
approaches are like putting a 500-pound link in a 25-pound chain. RTR
addresses processor, memory, media and network failure with shadow
servers, standby servers, replicated routers, multiple virtual networks
on top of multiple physical networks, replay of in-flight transactions
whose target server has failed, and automatic resynchronization of
servers on recovery. RTR also goes after "deliberate downtime,"
supporting rolling upgrades of systems and application software, hot
backup/restore of databases, and even relocation of systems without
service interruption. "Disaster tolerance" is dealt with by shadowing
servers across wide-area networks. We believe that a properly configured
RTR network would match the availability of the industry leaders (Tandem
Computers, Stratus and DEC's VAXcluster with fault-tolerant front ends)
and far surpass the availability of a mainframe environment (see Research
Notes T-475-1011 and T-475-1016, 3/22/91).
Glossary
24x7x52 24 hours, seven days, 52 weeks per year
OLTP On-Line Transaction Processing
VAX Virtual Address Extension
VMS Virtual Memory System
RTR Availability Forecast
Full-function RTR (client/server/router)
o VMS Now
o Alpha/VMS 4Q92 (P=0.7)
o Windows NT 4Q92 (P=0.6)
o ACE/OSF 2Q93 (P=0.7)
o SVR4 4Q93 (P=0.7)
Client only
o Ultrix Now
o MS-DOS Now
o Macintosh 2Q93 (P=0.6)
How does RTR fit in the DEC OLTP strategy? DEC already ships a mature
OLTP monitor, the Application Control and Management System (ACMS). RTR
is an alternate way of doing OLTP, not a replacement for ACMS. Both offer
high volume and availability, and both are scheduled to be
multivendor-ported. However, users who want a highly flexible
infrastructure for an OLTP architecture of their own design will lean
toward RTR, while those who want a disciplined environment where
developers are shielded from architectural mistakes will lean toward
ACMS.
|