[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference koolit::vms_curriculum

Title:VMS Curriculum
Moderator:SUPER::MARSH
Created:Thu Nov 01 1990
Last Modified:Sun Aug 25 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:185
Total number of notes:2026

159.0. "Dependable Systems -- generic chapters" by SUPER::MATTHEWS () Wed Jan 13 1993 11:21

    This note is for discussion of the generic (system-independent)
    chapters of the Dependable Systems course.
T.RTitleUserPersonal
Name
DateLines
159.1fisrt pass review of chapters 1-5TPVON::VONGregg von Sternberg - RTR/SFT EngineeringTue Apr 06 1993 14:11474
	My apologies if my note which follows might sound hard edged. I
        did not spend the time to phrase it in consideration of those
        who have worked hard on this material to get it where it is now
    	but I wanted to at least get the first pass of my comments in
    	and shared for the meeting of next week as to not waste the 
        time there doing what can be done here. So I hope no offense
        is taken.
    
   	I have an overriding comment which I want to share before going into the
	detail comments that follow on the chapters 1-5 which Dave has 
	forwarded for review. The original book Building Dependable Systems
	on which the context of this course so far is largely based is very
	much developed from the perspective of hardware fault-tolerance as
	the means to get the highest levels of availability. This perspective
	was very valid in what is now considered traditional computing. I.E.
	A single computer system with terminals attached. The current style is
	client/server computing. Given this the traditional concept falls down.
	A good analogy is a 1000 pound link in a 5 pound chain for the approach
	of hardware fault-tolerance to solve the problems in the client/server
	environment. There are just so many places where things fall down. Also
	in the client/server the means by which availability is measured is much
	different in that it is measured in terms of service availability and
	not machine availability. Things break, machines fail but if service
	is still provided then the "system" is viewed as providing continuous
	availability. My comments which follow are all leading from the 
	perspective that it is possible to take a set of machines which are
	not fault-tolerant or highly available in themselves and through the
	use of software build systems which offer high levels of availability
	and fault-tolerance. I.E. Distributed Fault-tolerant systems.

	regards
	 Gregg

	
	Section: Defining Dependable systems

	pg 1-4 
	
	Requires the definition of 

	SYSTEM - One or more physical computers which may be network connected
		 to provide a service or set of services. In todays client
		 server computing environments the system may be a composite of
		 the network and a number of client and server computers.


	pg 1-7 
	
	Under hardware failure add the following bullets

	o Time to find an alternate code component and activate it.

	o Time to find an alternate network path and recover context. 
		 

	pg 1-8

	First sentence should say RECOVER not recovery.

	pg 1-43

	Bullet 3 under Levels of System Availability states fault tolerant 
	computers for mission critical systems.I believe this should be 
	"fault tolerant systems" instead of computers.

	Section: Dependable System Strategies

	pg 2-5

	Computing Component examples - add

	o A reliable client/server middleware.


	Fault - add

	o Link loss due to timeout in a client/server environment

	o A poorly seated computer board


	Failure - add

	o Processor Failure

	pg 2-7

	The figure of the 3 operational states is different that the
	order of the bullets below. The bullets should be tagged B,A,C
	or reordered.

	pg 2-9

	Bullet 4 uses the word Cuing, - should this be Queuing?

	Bullet 5 should read -
	"Causing the application to failover to another application."

	pg 2-11

	Time redundancy can be referred to as the N-Version programming
	technique. I have never heard it referred to as time redundancy
	before.

	Software Redundancy - I have no idea what is being referred to
	here. The technique described sounds weak at best. I would write
	it on the lines of:

	Software redundancy can be accomplished by an writing application
	framework which allows for process replication. With process 
	replication processes can be created on alternate systems to
	play the role of hot replication of the work in progress (Shadowing)
	or warm availability (Standby) to take over in the case of process
	or processor failures. This method is relatively difficult to 	
	implement but is provided as a product on which to write applications
	by some vendors today. 

	some references:

	Fault Tolerant Computing - International Academic Publishers

	A Support for Robust Replication in a Distributed Object Environment
	authors: A. Corradi, L. leonardi - University of Bologna Italy

	Providing High Availability using lazy replication
	authors: Rivka Laden - Digital
		 Barbra Liskov, Liuba Shira, Sanjay Ghenawat - MIT


	pg 2-12 Cost of Redundancy - refers to dividing into small multiples
		to reduce the cost as "downsizing" this can also be referred
		to as "granular partitioning".


	pg 2-15 Single points of failure - this is totally a hardware
		approach with no view toward replicating software 
		components.
	
	pg 2-20 Single points of failure - add bullet 7
		
	o Applications
	

	pg 2-24 Client/Server computing. There is nothing to say what is
		either good or bad about this style of computing. Some
		people feel that client/server reduces the potential for
		faults others feel it increases the potential as well as
		creating situations of inderminate states of information
		in the case of failures. I.E. the server crashes and burns,
		my work may or may not have been completed. On a single 
		system I can accept that I won't know the result of the
		work which was in progress since my computer is burning
		in front of me. In the client/Server environment I am
		still running but am unaware of the result of the last
		piece of work. Hence a whole new set of problems and this
		chapter does not start to go into them on either side, good
		or bad.

	pg 2-25 Transaction processing.

		There is discussion of client/server in this section. There
		is nothing about TP that even implies C/S. This is just an
		interpretation of our TP monitors architecture. If this is
		meant as a generic description these references to c/s should
		be taken out. 

		The bigger issue on this section is the same as the comment
		I had made on "Client-Server Computing". There is nothing
		to qualify the benefit of using a monitor for dependable
		systems. I would argue that inorder to get recovery in the
		case of a failure, not just rollback of the work in progress
		but failover and completion of the work, that monitors are 
		inappropriate since they support flat transaction managers
		which means that in the case of a failure all resources in
		a transaction need be rolled back and the transaction started
		over again. This is a presumed abort view and very inappropriate
		to offering completion of work.

	pg 2-27

		The last sentence states that "Distributed applications use
		c/s and TP to produce disaster tolerant systems." - How? There
		is nothing in this chapter which illustrates I am any better
		off using these techniques. In fact I could be worse off. This
		needs allot of qualification.

	pg 2-28
		Primary Dependability Strategies -
		I disagree with bullet 2. I think there are 2 states. Apparently
		not broken and not broken are the same. Is it not broken/broken,
		or on/off?


		Using redundancy -
		First bullet - I do not agree with the last line "S/W redundancy
		requires a few extra lines of code" as mentioned in comments on
		pg 2-12. I think should be removed/rewritten.

		add a new bullet -

		o Application redundancy is the replication of the application
		  either through concurrency, warm standby, or hot standby.

	pg 2-30

		I am not sure what the XOR exercises are intended to achieve.(?)

	Section: H/W dependability strategies

	pg 4-9

	Loosely coupled, independent processes - describes clusters, clusters
		are a software implementation.


	A section should be added for S/W dependability strategies.

	I have had some discussions with Dave on DEC's Software Fault Tolerant
	strategy and append a set of pointers below to get more information
	as well as a Gartner research note which describes DEC software 
	fault tolerant product.




 TPVON::Presentation$public:
  RTR_CUSTOMER_PRESENTATION.PS - Technical tutorial presentation
  RTR_CUSTOMER_PRESENTATION_NOTES.PS - Technical tutorial presentation notes
  RTR_GARTNER_ARTICLE1.TXT
  RTR_GARTNER_ARTICLE2.TXT
  RTR_GARTNER_ARTICLE3.TXT
  RTR_V2-2_FEATURES.PS;1 - IM Partners presentation symposium in Jan 93
  RTR_V21_FEATURES.PS;1 - IM Partners presentation symposium in Jan 93


  Note: since the Gartner articles have copyrights, reprint copies may be
  obtained by contacting the RTR Marketing Manager: Jill Hitchcock at
  TPSYS::Hitchcock or (508) 952-4137 / DTN: 227-4137  

 Glossy brochures:
 Order number:
 EA-A1104-34 Rel#43/91 03 20 8.0 MCG/CTS - The Australian Stock Exchange -
				Builds Its Nationwide Systems with Digital

 EC-F1796-57 Rel# 306/92 04 72 30.0 MRO/MKO - Reliable Transaction Router -
			High-Performance Enterprise Integration Software

 White papers done by the Australian Stock Exchange:

 DECALP::RTR$PUBLIC:
  ASX_CORE_SYSTEMS.PS
  ASX_LESSONS_LEARNT.PS
  ASX_GROW_RELIABLE_SYSTEMS.PS


(l) GARTNER GROUP RESEARCH NOTE #P-535-1156 ENTITLED:

		"UNDERSTANDING DEC'S RELIABLE TRANSACTION ROUTER"

     + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + 
     |  Please be advised that the information contained within this | 
     +  report is copyrighted material.  The following policies must + 
     |  be adhered to:                                               | 
     +                                                               + 
     |     -  No reformatting of the data segments                   | 
     +     -  No external distribution                               + 
     |     -  Internal use only in accordance with vendor agreements | 
     + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + 


GartnerGroup  Midrange Computer Systems

Copyright (C) 1992      MCS : P-535-115



+--------------------------------------------------------------------+
|                                                                    |
|                                                                    |
+--------------------------------------------------------------------+
Midrange Computing





Products, P-535-1156

W. Melling      Research Note

March 13, 1992

Reprint



Understanding DEC's Reliable Transaction Router

Enterprises implementing heterogeneous client/server networks,
high-volume OLTP applications and/or fault-tolerant systems should
understand DEC's Reliable Transaction Router.



+--------------------------------------------------------------------+
|                                                                    |
|                                                                    |
|    GartnerGroup                                                    |
|                                                                    |
+--------------------------------------------------------------------+
   This publication is published by Gartner Group, Inc. Reprints of this
   document are available. Reprint prices are available upon request. Entire
   contents, Copyright (C) 1992 Gartner Group, Inc. 56 Top Gallant Road,
   P.O. Box 10212, Stamford, CT 06904-2212. Telephone : (203) 964-0096.
   Facsimile : (203) 324-7901. This publication may not be reproduced in any
   form or by any electronic or mechanical means including information
   storage and retrieval systems without prior written permission. All
   rights reserved.

   The ACID Test

   Transactional ACIDity refers to the following essential properties of a
   transaction processing system :

   o      Atomicity : The system will either perform all individual
          operations on the data, or will assure that no partially completed
          operations leave any effects on the data.

   o      Consistency : Any execution of a transaction must take the
          database (globally) from one consistent state to another
          consistent state.

   o      Isolation : Operation of concurrent transactions must yield
          results that are indistinguishable from the results which would be
          obtained by forcing each transaction to be serially executed
          (i.e., in isolation) to completion in some order.

   o      Durability : The ability to preserve the effects of committed
          transactions and ensure database consistency after recovery from
          processor, memory, media or network failures.

   Source : Transaction Processing Performance Council

   What is the DEC RTR? The Reliable Transaction Router (RTR) (see Figure 1)
   from Digital Equipment Corp. (DEC) is a software product designed to turn
   a network of heterogeneous clients, servers and databases into a
   fault-tolerant system, with dynamic message routing, location-transparent
   data access, transactional ACIDity, enhanced security and global
   manageability. RTR delivers run-time services at the client, server and
   message-management levels of a three-level model. It can be used as an
   infrastructure for client/server computing, an architecture for
   high-volume transaction processing or an alternate approach to
   fault-tolerant computing.

   Of Additional Interest

   Users who find the RTR architecture relevant to their environments should
   also familiarize themselves with the capabilities of SuiteTalk from
   Multinet Distributed Information Systems, Harvard, Mass.

   RTR as an infrastructure for client/server computing RTR is currently a
   VAX/VMS product. We believe that, by 1Q93, it will begin managing
   messages for multivendor clients and servers, with broad coverage by 4Q93
   (0.7 probability). RTR permits independent selection of client and server
   development tools, use of different tools for different applications and
   integration of legacy systems with new applications. Developers must
   decide where to make the cut between client function and server function
   and must design the messages that will be passed decisions that are
   preordained by more structured client/server products.

   Figure 1

   ** Please see hardcopy for Figure 1 **

   Source : Gartner Group

   RTR as an architecture for high-volume OLTP OLTP requires : scalability,
   availability, distributability, flexibility, integrity and security. In
   1992, competitive price and portable application support should be
   available on demand. The three-layer client/server architecture of RTR
   permits granular scaling of client, server or message-manager hardware,
   with, essentially, linear price/performance, up to airline or stock
   exchange volumes. (Several RTR sites are stock exchanges.) Availability
   is at "fault-tolerant" levels. Distributability is inherent in the
   architecture (the Australian Stock Exchange has more than 200 nodes
   operational). Integrity is addressed by facilities for guaranteed
   delivery of transactional messages, two-phase commit (in the current
   product, across heterogeneous database managers and non-data resources),
   cooperative termination and roll-forward/roll-back.

   RTR Price/Performance

   We estimate that a network of VAX 4000 servers with VAX 3100 clients and
   non-programmable terminals would benchmark at $11,000 to $12,000/tpsA
   wide-area network (0.8 probability), which is highly competitive with
   alternate OLTP systems.

   Enhancing Client/Server Security

   RTR provides convenient mechanisms for the introduction of user-developed
   security measures. On top of normal operating system and database manager
   security mechanisms, the three-level model adds compartmentalization to
   such an extent that separate client and server development teams become
   feasible, and their view of each other is limited to defined inputs and
   outputs. Neither end users nor client software developers need know where
   servers are, what technology has been used to implement them, or how they
   work. In addition, RTR provides an "authentication server" mechanism for
   trapping messages so that they can be examined by user code, which runs
   concurrent with transaction processing to avoid a performance penalty,
   and exercises its veto power by voting "no" in the commit process.

   RTR as a new approach to fault-tolerant computing As business
   requirements push data centers to supporting OLTP 24x7x52, and as
   transactional systems increasingly become global, our understanding of
   "fault tolerant" expands. No longer is it enough that the processor never
   breaks and that disks are always shadowed. By themselves, hardware
   approaches are like putting a 500-pound link in a 25-pound chain. RTR
   addresses processor, memory, media and network failure with shadow
   servers, standby servers, replicated routers, multiple virtual networks
   on top of multiple physical networks, replay of in-flight transactions
   whose target server has failed, and automatic resynchronization of
   servers on recovery. RTR also goes after "deliberate downtime,"
   supporting rolling upgrades of systems and application software, hot
   backup/restore of databases, and even relocation of systems without
   service interruption. "Disaster tolerance" is dealt with by shadowing
   servers across wide-area networks. We believe that a properly configured
   RTR network would match the availability of the industry leaders (Tandem
   Computers, Stratus and DEC's VAXcluster with fault-tolerant front ends)
   and far surpass the availability of a mainframe environment (see Research
   Notes T-475-1011 and T-475-1016, 3/22/91).

   Glossary

   24x7x52      24 hours, seven days, 52 weeks per year

   OLTP         On-Line Transaction Processing

   VAX          Virtual Address Extension

   VMS          Virtual Memory System

   RTR Availability Forecast

   Full-function RTR (client/server/router)

   o      VMS            Now

   o      Alpha/VMS      4Q92      (P=0.7)

   o      Windows NT     4Q92      (P=0.6)

   o      ACE/OSF        2Q93      (P=0.7)

   o      SVR4           4Q93      (P=0.7)

   Client only

   o      Ultrix         Now

   o      MS-DOS         Now

   o      Macintosh      2Q93      (P=0.6)

   How does RTR fit in the DEC OLTP strategy? DEC already ships a mature
   OLTP monitor, the Application Control and Management System (ACMS). RTR
   is an alternate way of doing OLTP, not a replacement for ACMS. Both offer
   high volume and availability, and both are scheduled to be
   multivendor-ported. However, users who want a highly flexible
   infrastructure for an OLTP architecture of their own design will lean
   toward RTR, while those who want a disciplined environment where
   developers are shielded from architectural mistakes will lean toward
   ACMS.