[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference koolit::vms_curriculum

Title:VMS Curriculum
Moderator:SUPER::MARSH
Created:Thu Nov 01 1990
Last Modified:Sun Aug 25 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:185
Total number of notes:2026

158.0. "Building Dependable Systems -- general discussion" by SUPER::MATTHEWS () Wed Nov 04 1992 09:25

    This note is for discussion of the OpenVMS System Availability and
    Integrity course.
    
    					Val
T.RTitleUserPersonal
Name
DateLines
158.1Strategies and Requirements Doc -- Feedback PleaseTANG::RHINEWed Nov 04 1992 09:31191


--------------- TM
| | | | | | | |
|d|i|g|i|t|a|l|                          INTEROFFICE MEMO
| | | | | | | |
---------------

TO:  Dick McCarthy                       Date:  20-Oct-1992              
     Pete Buswell                        FROM:  Jack Rhine, Bill Simcox
     John Coffey                         DEPT:  Services Development
     Jim Malanson				and Training

cc:  Rick Wardrop                                                 
     Bob Sowton                                                       
     Jim Stewart                                                               
               


SUBJECT:  Qualification of OpenVMS System Availability and Integrity Course
         


As part of the OpenVMS System and Network Management Mastery Series effort, we
have a core curriculum of three system and network management "generalist"
courses that teach the system management skills necessary to successfully
manage an OpenVMS system that is part of a VAXcluster and or is in a networked
environment. This core curriculum is followed by a set of "specialist" courses
in areas such as performance, security and troubleshooting.  We would like to
augment this set of specialist courses with an additional course in OpenVMS
System Availability and Integrity. 

This course will provide experienced system managers and technical data center
managers with the skills they need to define their requirements for system
reliability and integrity, identify performance and cost tradeoffs, and
translate them into a specification of requirements and a prototype
implementation plan for their site.  This course will introduce failure
prediction using DECamds and other tools.  Complex and multi-site
configurations will be discussed.  This course will include labs and case
studies.

There is a large installed OpenVMS customer base and a growing percentage of
that customer base is becoming more concerned about mission critical
applications and 24 by 7 operating environments.  Given the industry trend for
high end customers' willingness  to pay for services and training that leverage
continuous operations, this course would seem to have very focused appeal and
can be offered at a premium price. Further, this course would result in a NEW
offering (not a replacement or updated offering) that would net incremental
revenue. 

A P/L analysis was done using an estimate of 50 worldwide offerings per year
with 10 students each and a constant 3 year model, FY93 - FY95.  The MLP is
assumed to be US$1995.  The analysis shows a 3 year Total Area Margin of about
$1,000,000 @ 38% against a development and update expense of $120,000 over the
same 3 year period. 

This project will be funded as a result of deferring OpenVMS for Programmers II
until next fiscal year pending further study of how to restructure the 
OpenVMS Programming curriculum.  We believe that the financial benefit of the
reliability and integrity course will be more immediate.

The Strategies and Requirements Specification is appended below.  Please
review the objectives of the course, scope of work, and qualify viability of the
offering so that we may start development as quickly as possible to meet a 
projected end of May introduction.  We would appreciate your response within
two weeks.

Regards,




                         Strategies and Requirements


OpenVMS System Availability & Integrity
5-Day Lecture/Lab

DESCRIPTION

This course will provide experienced system managers and technical data center
managers the skills they need to define their requirements for system
reliability and integrity, identify performance and cost tradeoffs, and
translate them into a specification of requirements and a prototype
implementation plan for their site.  This course will introduce failure
prediction using DECamds and other tools.  Complex and multi-site
configurations will be discussed.  This course will include labs and case
studies based on actual experiences of DEC customers with mission critical
applications.

OBJECTIVES

  *  Determine mean time between interruptions for mission critical
     applications and define relevant single and multi-site 
     system configurations.
  *  Develop a prototype remedial process for a mission critical site
  *  Perform predictive failure analysis using DECamds, operating system
     tools, and Symptom Directed Diagnosis tools
  *  Define available application features that leverage availability and
     data integrity and their relevance to common application scenarios.

TARGET AUDIENCE

Experienced system managers who are responsible for mission critical systems,
technical data center managers, and application designers who design mission
critical applications that require high availability and data integrity.

TOPICS

  *  Operational definitions of reliability, availability and data integrity
  *  Cost and performance tradeoffs
  *  Computation of mean time between interruptions for complex applications
  *  Determination of single points of failure, their impact to MBTI, and how
     to minimize that impact
  *  Failure prediction and troubleshooting in mission critical environments
  *  Development of remedial processes for mission critical applications
  *  Backup and fast recovery in a 24 by 7 environments
  *  Application data integrity and availability features in OpenVMS
  *  Case studies and labs
  *  DEC Mission Critical Services

Note:  These topics were distilled from a more detailed list developed by 
       several subject matter experts.  A preliminary, more detailed, topic
       list is attached.
  

SUMMARY OF REQUIREMENTS

  *  5-Day Lecture/Lab
  *  Material will be modular
  *  Case study approach combined with labs using software fault insertion
  *  Selected training centers having extensive TBD lab resources should
     be targeted for this course
  *  Prerequisite of SYSNET III, Performance and Troubleshooting are
     recommended perquisites


Potential Content of an OpenVMS Reliability, Availability, and Integrity Course
resulting from a brainstorming session with subject matter expertise:

1. Define the above and other terms

2. Discuss tradeoffs impacting the above, performance, and cost.

3. Discuss evaluation of needs, MTBI measurements, can't exceed the weakest
   link, etc.

4. Discuss areas of a system that can be impacted, and those that cannot.
   i.e. design issues like no parity memory that are inherent weaknesses.

5. Determine single points of failure and how to minimize them in standard
   environments, i.e. typical workstation, cluster, mainframe.

6. How to configure for availability
	- redundancy, hot spares
	- ease of management issues
	- performance tradeoffs
	- multi site (MDF and other) approaches (ability to move users, data
	  to a different site for quick recovery)
	- storage management
		* RAID, including striping and shadowing
	- faster rebooting
	- network vs. cluster file services
        - "clusters of clusters"

7. Application techniques
	- DECtp, two phase commit
	- journalling, checkpointing
	- failover

8. How to predict failure using AMDS and other tools (Polycenter?)

9. Backing up 24 x 7 systems and data recovery in this environment

10. The remedial process
	- How to work around failures
		* hot spares
		* determine applications that HAVE to run when there is
		  reduced capacity
	- Troubleshooting in a high availability environment
		* hot systems
		* site specific process, who does what
		* how to get DEC involved

11. Digital Services
	- Why mission critical services
	- other DEC service offerings

DEC mission critical sites such as Bellcore and MCI are potential for case
studies.  Lab exercises with AMDS are a possibility.
    
158.2A question -SUPER::MATTHEWSWed Nov 04 1992 17:366
    I realize the tools are VMS-specific, but a lot of the topics are not. 
    Is there any thought of making this a multi-OS course? Or at least
    separating the generic from the VMS-specific material as is being done
    in the security training?
    
    					Val
158.3RE: Generic Availability and Data Integrity SuggestionTANG::RHINEVMS Training Product ManagerWed Nov 04 1992 18:137
    Val, your suggestion is a good one to look at.  I know that there are
    products in the OSF space that provide some of the VMS availability and
    data integrity tools.  If you know someone in the OSF Course
    Development world that could look at the memo in .1 and provide
    feedback from the point of view you have suggested I would appreciate
    it.
    
158.4When and how complexSOAEDS::TRAYSERSeniority means a bigger shovel!Mon Nov 09 1992 23:0415
  1 basic question and 1 comment...
  
  What is the time table for this course?  Q3 development with Q4 delivery
   or further out?
  
  Having looked over the outline of the material I see this as a perfect
  course to either be a prerequisite or a parallel course offering to VMS
  Troubleshooting.  I saw nothing in the outline that would make me believe
  that I wouldn't understand it if I hadn't taken Troubleshooting.  This 
  might be an appropriate course for the MIS series--no significant VMS
  issues, possibly suitable for Windows NT, Unix or other operating
  systems.  I think Val's comment was right on the money; don't tie it too
  tightly to VMS or we limit our options.
  
  $
158.5RE:.-1TANG::RHINEVMS Training Product ManagerTue Nov 10 1992 07:378
    Buck, we are looking at Q3 development and Q4 delivery.
    
    The reason that I suggested troubleshooting as a possible prerequisite
    is that troubleshooting in a high availablilty environment, which I
    believe is an important topic, could layer on knowledge and skills 
    that are taught in the troubleshooting course you are about to pilot.
    The prerequisite issue should be decided after the content is firmed
    up.
158.6New name, New formatSUPER::SUPER::TARRYWed Jan 13 1993 11:5121
    There has been some progress on this new course.
    
    	The name has been changed to
    
    		Building Dependable Systems  - Generic chapters only
    
    		Building Dependable OpenVMS System - Generic + VMS specific
    
    
    	There will be two part numbers.
    
    		The first course will contain only the generic chapters
    		and will not have laboratory exercises.
    
    		The OpenVMS specific course will have laboratory exercises.
    
    
    I will be posting pointers to the project plan as soon as it is
    approved.
    
  
158.7Chapter 1 Ready for ReviewSUPER::SUPER::TARRYFri Jan 22 1993 14:3653
    
    
	The first draft of chapter 1 is ready for review.  This
	chapter is a generic chapter and contains:

		Terms
		Levels of dependable system
		Defining business requirements


    	There is a figure to be added at one point.  It will show the
    	following components:  hardware,software,environment and humans
    
    	There is still one block to be written on tradeoffs between data
    	integrity, performance cost.
    
    	There is a very extensive case study at the end of the chapter.
    
	Copy the file from:

	SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_1_INTRO_INSTRUCTOR.PS
		                          VMS_DS_1_INTRO_STUDENT.PS


	Note the following:

		Each chapter has a ps file for the instructor and a ps
		file for the student.  You will need to print and read
		both.  The student material is not in the instructor
		manual.

		Please do not send comments on the format of the
		materials.  I have no control over format.

	Send comments to  SUPER::TARRY  by 1-Mar-1993
    
    
	Other chapters planned for the course, not necessarily in the
    	order in which they will appear are:

		Introduction to Depenable Computing   Generic
		Environmental factors                 Generic
		System Configurations                 Generic
		Mass storage                          Generic
		Configuring OpenVMS System
		Avoid Human Errors                    Generic
		Managing Dependable OpenVMS Systems
		Disaster Recovery                     Generic
		Data Integrity                        Generic

    
    
                                                                       
158.8Revised Chapter 1 Posted.SUPER::SUPER::TARRYThu Feb 18 1993 14:1217
    One person did provide some review on chapter 1.  I have revised the
    chapter according to the suggestions and placed a new copy for review.
    
    SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_1_INTRO_STUDENT.PS
                                                    _INSTRUCTOR.PS
    
    
    Be sure to pull and print both versions.  There are two case studies
    in this chapter.  One is very short.
    
    Chapter 3 on the environment is finished, but there are so many figures
    the chapter makes little sense without them.  As soon as preliminary
    figures are ready I will post chapter 3.
    
    Chapters 2 and 4 are in the final stages of development.  They should
    both be posted early next week.
    I sure do need more reviewers.
158.9Chapters 2 and 4 for reviewSUPER::SUPER::TARRYMon Feb 22 1993 14:4919
    Building Dependable OpenVMS Systems
    
    Chapters 2 and 4 have been posted for review.
    
    Chapter 2 discusses strategy in general and chapter 4 discusses
    hardware strategy.  This incluse fault tolerant, cluster and RAID.
    
    Both are generic chapters.
    
    
    SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_2_STRATEGY_STUDENT.PS
                                                       _INSTRUCTOR.PS
    
    
    				     VMS_DS_4_HARDWARE_STUDENT.PS
                                                      _INSTRUCTOR.PS
    
    
    
158.10Need Reviewers Please!!!SUPER::SUPER::TARRYThu Mar 04 1993 18:1428
    Chapters 1-4 have been reposted in the directory:
    
    SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS
    		                                   _INSTRUCTOR.PS
    
    
    To review these chapters you must pull and print both the student and 
    instructor versions.
    
    Chapter 3 is posted for the first time.  Chapter 1 is ready for the
    pilot.
    
    Chapters 1-4 are operating system generic chapters.
    
    The final chapters will look like this:
    
    	Chapter 5   OpenVMS configurations that support high availability
    	Chapter 6   Managing OpenVMS systems for dependability
    	Chapter 7   Managing the data center (generic)
    
    	Laboratory Exercises for OpenVMS
    
    
    I am looking for a funded reviewer and a pilot instructor for this
    course.
    
    I am really desperate for some instructor review.
  
158.11Chapters Posted for ReviewSUPER::SUPER::TARRYWed Mar 17 1993 14:0918
    Chapter 1-5 and 7 are posted for review in:
    
    SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS
                                      VMS_DS_#_name_INSTRUCTOR.PS
    
    
    Note that to review the materials you must obtain both the instructor
    and student material and read them at the same time.
    
    Some chapters are not finished.
    
    The pilot has been scheduled for 10-May at PKO.  Dave Maxwell has
    agreed to be the instructor.  Thank you for volunteering!
    
    I just must tell you that I am off for a vacation in Costa Rica.  I
    will be back 8-April.
    
    Please have review comments posted by 8-April.
158.12Beginning reviews of the material (starting with module 2)SOAEDS::TRAYSERSeniority: Big Shovel, Less Breaks!Mon Apr 05 1993 03:2465
I've browsed the Student Materials and read the "Plan" carefully.  I see a
very basic problem with the material--I don't think it fits well in the SysNet
curriculum.  I think of my SysNet 3 students (which according to the prereq's
they need SysNet 3 before taking this course) and I cannot envision them 
enjoying this class.  The material here is more attuned to the MIS director
or Senior System Manager--people responsible for setting up data centers, not
for your average advanced system manager.  Do we have material designed and 
written for the right audience?  Do we have a plan to get the correct students
to this class.

I've already reviewed most of Chapter 1 via 'hardcopy', so I'll begin entering 
Chapter 2 info --

Nothing notable in the Instructor's page except 2-13 "studnet" instead of 
"student".

Student Guide:

2-5 - 10th bullet, good analogy, but out of context. Move the hiway example to
      the instructor's page.

    - definition of 'Failure', "...whose effects cannot be contained."  Huh?
      This is a little fuzzy, how about a better definition?!

2-7 - Very good analogy with the suitcase example!  Clear, to the point and
      covers the concept well. 

2-8 & 2-9 - Urgh.  These pages don't flow well.  The title is "Dependability
      Strategies", but there are no "strategies" on this page, only definitions
      and examples.  Please put a few STRATEGIES here, then support them with
      examples.

      Also, the examples are incomplete, overly complex or technically wrong:

        The three bullets on the top of 2-8 cover "Fault prevention", "Error
        correction" and "Failure recovery", but the examples cover "Error
        correction" twice and there is no example for "Error Prevention" (which
        in my opinion is the most critical".

        The CRC/XOR example is overly complex and, although interesting, makes
        for tedious lecture by those not entirely familiar with the process.
        Please note that the discussion only refers to the SOFTWARE supplied
        CRC mechanism, not the more useful and quicker HARDWARE implemented CRC.
        Move it to the Instructor's page.

        And lastly, there are slight errors in the Backup example:

          Numbered item "1" under the "When data is backed up to tape..." has
          the term Checksum confused with CRC.  There is a CRC algorithm used 
          to calculate a Checksum for the header block, but the DATA blocks do
          not refer to the CRC that is written in them as a Checksum.  These
          are two distinct fields in the Backup tape layout structure (known
          as BBH$L_CRC and BBH$W_CHECKSUM).

          Numbered item "2" under the "When data is backed up TO tape..." 
          discusses READING from the tape.  It should instead explain that 
          during the writing of the Group a redundancy block is calculated and 
          written at the end of the Group of blocks.

          Last paragraph should state that if the CRC cannot solve the problem,
          then the redundancy block is used.

More later...

$
158.13continuing chapter 2 reviewSOAEDS::TRAYSERSeniority: Big Shovel, Less Breaks!Wed Apr 07 1993 03:4689
More from chapter 2...

Basic problem with this chapter is lots of new terms introduced but not fully
defined.  Don't use an example until the item introduced has been properly and
clearly defined.  The examples, although many of them are good, belong on the
instructor's pages, not the student material.

(No significant comments on the Instructor's pages, what follows is the student
material review)

2-11  -- 2nd sentence is out of place in the lecture, this should have been 
         cover back near page 2-8.

      -- Time Redundancy, what is it?  Need a clear definition.

      -- 5th paragraph doesn't seem to have anything to do with Time
         Redundancy.  Also, change 'can' to 'might'.

2-12  -- No discussion of Software Redundancy and how it complements Hardware
         Redundancy.  Examples might be RMS Journaling, VAXsim, etc.

      -- 2nd bullet, use VAXcluster nodes as a redundancy example and move the
         space tire example to the instructor's page.

2-13  -- What is "N+1" and "2N"?  This idea and notation are not clearly
         defined.  Is this 'industry' notation or 'DEC' notation?

2-14  -- reverse bullets 5 & 6...determine "work-a-rounds" and their 'costs'
         (including time, money and other impact) and THEN define a recovery
         procedure.  The recovery procedure MIGHT be the work-a-round

      -- #3, "...are the most EFFECTIVE FOR YOUR NEEDS", since cost-effective
         solutions might not be the one we need, we might need the FASTEST,
         regardless of cost.

2-15  -- The description on 2-15 indicates a 6000-510, diagram on 2-16
         shows a 6000-520.  Change diagram to show 6000-510.

2-16  -- If this is a 6000-510 then ERASE CPU-1 from the diagram leaving
         only CPU-0.

      -- KDM70 should have "#1" removed (see 2-21).

      -- Indicate Data A and Data B on the disks (see 2-21).

      -- Ethernet segment isn't connected to the DEMNA, move the line over to 
         line up beneath the DEMNA (see 2-21).

2-17  -- #1 should indicate 6000-510

2-19  -- #1b, drop reference to Digital Service and replace with "Call
         maintenance" like the others.  For that matter, just drop the "b"
         question if the only significant "recovery" is to call service.  I'd
         feel really stupid saying "call service" for all the recovery
         solutions.
 
      -- #1c, the solution is to add another CPU in the cabinet, NOT add
         another system (see 2-15, last sentence).

      -- #2a, "...using the console OR BATCH."

      -- #2b,c if we had discussed Software Redundancy earlier we might have 
         mentioned having Virtual Terminal Support turned on (via SYSGEN) and 
         just reconnecting to the disconnected process via the second Ethernet 
         controller or the console.

      -- #3a, this assumes the entire controller fails, rather than just a port
         or cable in which case switching cables or ports is a solution.

      -- #3c, this requires human intervention...must move cables to the good
         controller when it fails.  Consider and HSC and dual ported disks.

      -- #4c, "N+1" just hanging out at the end of the answers.  Is this 
         intentional?  If so, it looks ugly.  Either explain it or drop it.

2-20  -- #5c, consider RMS Journaling.

      -- #6a, "...to do backup" or ANY other activity, such as file recovery, 
         tape journaling, software installations, etc.

      -- #6c&d - Where did "d" come from?  And "c" has a question not posed
         on 2-18.

2-21  -- both DECservers are labeled "#1", just drop the numbers.


More later...

$
158.14Last of 2SOAEDS::TRAYSERSeniority: Big Shovel, Less Breaks!Thu Apr 08 1993 02:4245
More from chapter 2...

Instructor's pages prefixed with an "I", such as I2-4 is page 2-4 Instr. Guide.

2-22  -- These are 3 strategies, but what I'd like here is a problem statement.
         Why do I need to know about these?  Will they add redundancy?  Will
         they describe a potential failure?  Use this page to 'setup' the next
         page.  (Or, put the 'setup' on I2-17.)

2-23  -- There is an assumption here that a "process" defines a "server" or
         a "client".  "Process" is strictly a VMS concept, i.e. a 'server' such 
         as a VXT doesn't use a process (yes, trust me, a VXT is generally a
         'server' and not a 'client').  On PCs, either of these concepts can 
         be implemented as a driver, not a process. (Same problem on I2-18)
 
2-25  -- ACID test not properly introduced.  It wasn't until I read it a second
         time that I figured out that ACID was an acronym.  The definition may
         be technically correct, but I'm not really sure what to do with it. I
         think it needs a better intro.

      -- Before-Image Journaling, nit -- "...cannot complete*,* the...", a 
         comma is missing.  Also, care to define the types of journaling, such
         as After-Image Journaling.

2-27  -- Arrows on diagram not clear.  I assume the represent the network or
         some "remote access".

2-30  -- 1st sentence, term "front-end" note defined.

2-31  -- More clearly mark the answers.  Since the questions are repeated on
         this page, at first glance I didn't see the answers, just the
         questions!

---------

I2-19 -- I like the analogies being on the Instructor's page, thanks!

I2-21 -- A good example of a distributed application is VAX Notes.  The user 
         interface runs locally and the compute/storage functions are on the
         conference host system.


Chapter 3 is next...

$
158.15Chapter 3, needs a fair bit of workSOAEDS::TRAYSERSeniority: Big Shovel, Less Breaks!Fri Apr 09 1993 02:38177
Overall I was a little depressed with this chapter.  It addressed large data 
centers, but not office based systems.  It discussed physical security but
no mention of software security.  There are numerous terms used that are
not defined or described.  There is a weird discussion on power and electricity
and no discussions on Static electricity and its effects.  Basically a ragged
chapter with occasional interesting topics intermixed with esoteric wanderings.

(Instructor pages denoted with leading "I", such as I3-14)

3-4  -- term EMI,  can YOU defined 'waveform' to a system manager with only a
        high school education?

     -- term NOISE, please don't define a term with another term that has yet
        to be defined!

     -- term NOTCH, OK, so I've got a small vocabulary.  What's "SUBTRATIVE"
        supposed to mean to me? 

I3-5 -- 2nd paragraph from bottom, I claim ignorance.  Three-phase comparison
        to single-phase needs a bit more explanation for those of us not
        strong in electronics.

3-6  -- spelling, second line "avilable" is missing an "a".

I3-6 -- "sensitive" paragraph -- oh, please!  give me a break.  And there are
        students opposed to Dams because of flooding and keeping salmon from
        migrating.  And there are students opposed to fuel-fired planets that
        spew pollutants into the air.  Drop the reference, if find it slightly
        demeaning.

     -- 6th bullet, "jepordize" needs an "o" in front of the "p".

3-7  -- Ha!  Please remember that many system managers have never taken
        college courses so may not have had 'trig'.  Terms and phrases such as 
        'conductor', 'AC Sine Wave', 'current is induced' is beyond many 
        students.  To cover them in this 'matter-of-fact' material is not
        acceptable.

     -- Hmmm, why does figure B still have a current flowing  from right to 
        left as indicated by the "+ and -"??

3-8  -- Gag!  Is this material really necessary?  I can teach this course quite
        successfully without this page.  And if I *was* to teach this material
        I couldn't answer any questions on it.  My degree nor my 20 years of 
        computer experience has ever taught me the details of three-phase
        power, so if you expect the average SOFTWARE instructor to teach this
        material successfully, I believe you are setting us up for a failure.
        Drop pages 3-7, 3-8 and 3-9.

3-9  -- Useless, drop it.

3-10 -- 9th 'paragraph', care to tell us why the substation is getting from
        more than one circuit?

     -- last paragraph, "At cost,", huh?  Do you mean to say "At an increased
        expense,"

3-11 -- Diagram: What are the triangles?  What are the boxes with dots in them?
        What is a "power pool"?  What are the lettered circles (A, B, etc.)?

     -- #2, I know Texas is big, but "WHY" is it separate?  I'm sure students
        will ask, it's just too obvious!

I3-7 -- 2nd paragraph from bottom, 2 mispelins, "pulic" and "invester".

3-12 -- 6th bullet, spelling of "compnay"

I3-8 -- Item #10, care to define "Load Shedding".  I assume it means cutting
        service.

     -- It's not clear on pages I3-8 and I3-7 what parts are actually taken 
        from the quoted sources.  Please indent, italicize or otherwise change
        the font for quoted material.
        
3-13 -- 3rd bullet, "oscilating" needs a second "l".

I3-9 -- 3rd "paragraph" from bottom, should "...1 minute or more..." be "...1
        minute or LESS..."??
 
3-14 -- The diagram nor the description on the instructor's page (I3-11) was
        clear or useful.  Drop it!

3-15 -- 3rd paragraph, how exactly does one spread the load evenly over 3-phase
        power?

I3-11-- 2nd paragraph under Voltage tolerance, spelling: change Emphasis to
         Emphasize.

3-16 -- 3rd paragraph under motor generators, might add an instructor note
        that PC and small VAX systems can usually ride the short power sags
        without any special hardware requirements.

I3-12-- 2nd paragraph of motor generators, What "flywheel"?
    
     -- last paragraph of motor generators, Huh?  how can sustaining power for
        an extra 1/10th of a second be enough for a "smooth" shutdown of 
        computer equipment?  Never heard of this!

3-17 -- bullets, add Power Duration and Recharging Time as two other 
        considerations

3-18 -- First paragraph puts all computers in the same situation.  My Laptop
        generates very little heat and doesn't need any special air 
        conditioning.

3-19 -- 1st line, we haven't defined solid-state circuitry and where it is
        found.  Cars have Solid-state equipment and work fine on the hot 
        asphault parking lots.  What devices are being referred to?

     -- 3rd and 4th paragraph should only be one paragraph.

     -- 2nd bullet is poorly worded.  It implies that I don't have enough power
        to start with.

     -- 3rd bullet implies I can buy a fan at Kmart and cool my computer room.
        Mention portable A/C unit that exhaust the hot air into the suspended
        ceilings (no joke, we've use them before).  

     -- The last paragraph "writes off" all the issues regarding PC and 
        other office-based workstations.  I believe this is a major short-
        coming of this material.

3-20 -- Only the first section on this page relates to Physical Security, the
        other issues are Environmental issues.  Also, it ignore small systems
        again, assuming computers are kept in computer rooms.

     -- 2nd paragraph, "controlled...access", yeah, I call it a "door".  Be
        more specific.

     -- 3rd paragraph, why no windows?  Reflective or mirrored windows are 
        a feature I suggest to managers.  Having a view to the outside
        world is a GREAT way of reducing stress, which computer people seem
        to easily generate.

     -- Keeping data center clean, 1st paragraph -- printers in computer rooms
        is not as large of an issue as it used to be.  The 'lint' or 'paper
        dust' was significantly reduced as shops moved from the high-speed, 
        continuous form impact printers to laser printers.
  
     -- Keeping data center clean, 4th paragraph -- This is a MAJOR 
        misunderstanding of VCS systems.  The were originally sold in 
        VAXcluster environments, but can have HSCs, Unix systems, stand-alone
        VAX systems, Alphas, 3rd-party systems, etc. connected to the 
        VCS.  The *name* is a poor name and should be changed to reflect its
        current capabilities.

     -- Water Problems, 1st paragraph -- "Sorry, Egon.  I'm a little fuzzy on
        this 'bad' thing.  Why is it bad to cross the streams?"  replace
        "very bad" with "can be a problem".  I've use water to clean soda
        out of a VT200 keyboard before.  Bad is subjective.
  
     -- Water Problems, 3rd paragraph -- "SOME air conditioning units...", not
        all are water cooled anymore.  Liquid Nitrogen and Hydrogen, Freon and 
        other coolants are popular today.

3-21 -- So, what is "Power Conditioning System Plus"?  Strange header for the
        page.  Is it a product or service?  Is it a concept?  Does DEC sell it?
        
     -- 2nd paragraph, what's a H731x?  What's VAX REMS, I've never heard of 
        this.  How about a description.  If this is a product pitch move it
        to the appendix and present the "concepts" of the process here.

3-22 -- What's a REOP/EMS?  What's a J-Box?  What's a RSU?  What's the box on
        the floor near the J-Box that looks like it's 1/2 open?

Basically a chapter I'd cut-and-paste and deliver using only about 6-8 pages.
Some stuff is interesting, but the audience is wrong if we are selling this
course to System Managers.  This stuff is more for Data Center Managers or 
MIS directors.

Oh, and finally, many pages are using more than 75% of the page.  This might 
be OK for reference material or Self-paced material, but for lecture lab we
like enough space at the bottom of the page to take notes.

More later...

$
158.16Chapter 4 -- tools, but no real strategiesSOAEDS::TRAYSERSeniority: Big Shovel, Less Breaks!Mon Apr 12 1993 02:41180
Chapter 4

Don't see the connection between title and contents--"Software Dependabilities
Strategies".  Sorry, but I saw no strategies, only concepts and tools.
----------


4-5  -- 1st sentence.  Don't use the term "independently" to define the 
        concept of "independently recoverable unit".

4-6  -- Needs reorganizing.  System Kernel Level Redundancy should precede
        Subsystem Level Redundancy, especially since 'kernel' is used before
        it is defined.

     -- First section, CIRCUIT level redundancy not adequately defined.

     -- 2nd sentence, makes no sense. What console processor?  what is
        "its" referring to, the CPU or the Console CPU?  How does it work?

     -- 2nd section.  Why 2 separate sentences?  If they were merged they would
        for a coherent paragraph.  It looks strange to break up a paragraph
        like that.  OR, bring back the bullets that probably were there in a
        previous draft of this material.

     -- 3rd section (other than moving it up on the page), 3rd bullet has an
        example to get the idea across, the first two bullets also need an
        example like MIRA or TANDEM, etc.

     -- "Wide Area Redundancy" -- adding the term cluster system is redundant,
        it is a system.  Drop the cluster reference, it is covered later.

     -- Last paragraph could benefit from an example like hurricane Hugo, 
        where damage along the Carolina coast was extensive, but the tornadoes
        spawned by the hurricane traveled inland to cities like Charlotte doing
        excessive damage far from the 'strike' site.  The March blizzard was
        a good example that even geographically separate locations need to be
        MUCH more separated to protect them from the same disaster.

I4-6 -- 1st sentence, what does "such circuit level redundancy" mean?  "Such"
        implies a definition preceded this sentence.  I don't see it.
 
     -- Last paragraph, "...nearest the caller Order placement..." doesn't make
        sense!  Are we missing word here?

I4-9 -- 2nd paragraph from bottom, VMS also has this feature.

I4-11-- This definition was needed in the previous chapters.  Move it forward
        to the first occurrence of "2N or N+1".

4-11 -- Last paragraph, care to tell us a bit more about a "Watchdog Timer",
        such as part number, common uses, etc.  On instructor's page would
        be fine.

4-12 -- Diagram is in error.  CPU is missing, MIRA watchdog timer not 
        connected to the Q-bus, cable from DEQTA goes directly to the other
        DEQTA (have it go to a 'cloud' or "..." to indicate more hardware" 
        What is the box labeled "Standard Q-bus Interface" supposed to be?
        Is this an old configuration?  I don't see the DSSI hardware or disks.

4-13 -- The entire page looks like it needs bullets!  Or form them into 
        paragraphs.

     -- Triple Modular, Tightly Coupled and Hardware Intensive - give an 
        example of the type of hardware required to configure to these
        descriptions.  (Stratus reference on the instructor's page is
        probably OK for the last one.)

4-14 -- Layout looks weird.  Looks like bullets were removed and a paragraph
        would be more reasonable than the current format.

     -- 2nd bullet "No single point of repair", this is a strange statement
        that the rest of the bullet fails to explain adequately.

     -- 4th bullet "Self-checking checkers", ditto.  How is this done?  Can
        you give an example on the instructor's page?

4-15 -- 4-14 nor I4-15 adequately describe this diagram.  What is "Mass STC",
        what is X-Link?  Diagram shows 2 cables leaving single Ethernet
        controllers.  Why does the Processor box have two little un-labeled
        boxes in it?  What is the vertical, white box that connects the
        Mass STC, Ethernet and Console 'boxes'?

4-17 -- Overall this page is a sales pitch, not providing anything worthy
        of a 'strategy'.  This page really doesn't say anything worth 
        lecturing on in a Dependable Systems class.  How is it redundant?
        How should it be configured?  What are the costs?  Etc.!!

     -- The middle section notes differences between VAXcluster and VMScluster.
        This information is old and will not be current when this material
        begins to ship.  As of Mid-May (VAX-VMS 6.0 and AXP-VMS V1.5) mixed
        architecture VMSclusters are supported.  It might be worth noting 
        previous configurations, but ALL the documentation will be changed
        to reflect that VMSclusters is the old VAXcluster concept on various
        combinations of VAX and AXP systems.  

     -- Last sentence.  Drop it or move it to the instructor's page.  Having
        a VAX VMScluster V6.0 (AXP V1.5) SPD in the instructor's kit would be
        more useful than this line.

I4-17-- How can we get a copy of the Aberdeen Group white paper?  For
        Instructors and/or for Students.

4-18 -- 2nd paragraph, "secondary storage" -- I don't think we've mentioned
        what "secondary" would be or how that differs from Primary storage.

     -- Last paragraph -- misleading.  Seems to imply that RAID might be
        discussed in more detail in the Performance course.  Please make it
        clear that VMS performance is more clearly defined and that RAID
        concepts (including performance) are covered in more details here.

I4-19-- Any idea how to order the Berkeley paper on RAID?

I4-20-- diagram is ugly.  Either turn it on it's side (landscape) or break
        it into two tables:

          Level     Descr.     Avail.   Request rate
          0
          1
          ...

          Level     Data rate  cost     type of appl
          0
          1

4-21 -- 2nd sentence, need to define "chunk" better -- "A chunk is the I/O 
        size of RAID that usually consist of several blocks or data", etc.

     -- Availability, MTBF?  Has it been previously defined?  I think so,
        just can't find it.

     -- diagram, why is "B" shaded?

4-22 -- Reference of Raid 0 and Raid 1 being combined is good, but belongs
        AFTER we have described Raid 1, possibly at the bottom of the page.
        Otherwise it seems that the Avail/Perf/Storage section is defining
        how RAID 0/1 works!

     -- Also need to stick with either "RAID 1" or "RAID-1" spelling.

4-23 -- Why is "B" shaded?

4-24 -- Under Performance, the 2nd and 3rd paragraphs - don't describe any
        performance issues.  If they are describing how RAID-3 works, then
        move them up to the top of the page.
 
     -- Under Performance, 1st sentence, "...workload is a near sequential...",
        huh?  Don't understand the sentence.

4-25 -- Can't read the labels in the diagram.  Probably need to clear the
        background shading with the D�E�F like the triangle on top of the page.

     -- Dark lines going to far, right disk implies only blocks A & C are
        being referenced, add a 3rd line connecting "B" to the far right
        disk.

4-26 -- bullets 5-8, don't make sense.  How can we write a D�E�F XOR chunk
        before we have written "F"??
     
4-28 -- Reed-Solomon, never heard of it.  Where is it defined in the class
        materials?  How can I get more info on this?

     -- 3rd & 4th bullets, what are in these blocks?  Probably Reed-Solomon 
        stuff, right?  And if the students asks, I can add nothing to the
        discussion.  Either more detail or drop all references to Raid-6.

4-31 -- What is this page doing here?  Why is RAID discussed in such details
        here?  'Can' it.

4-33 -- Only exercise is on disks?  What about clusters?  MIRA?  other 
        concepts discussed?  Considering we don't have products that cover
        RAID 3, why spend 67% of the lab working on it?

Overall some interesting topics, but not focused around Strategies.  Seems
like a Marketing tour since we only discuss concepts rather than configuration
issues, what disks should be RAID-ed, how is Cluster maintenance performed 
without halting the cluster, how about combining VAXsim with shadow disks, 
etc.

$
     
158.17I still own you a few more chapters, but "thanks"!SOAEDS::TRAYSERSeniority: Big Shovel, Less Breaks!Tue Apr 20 1993 23:4410
  Emmalee, 
  
  Thanks for the mail commenting on all my comments.  It's nice to know the
  course writers really do read this stuff.  I even appreciate your "I
  disagree with you" comments, where you keep the page as you have it but
  put my comments on the instructors pages -- a quite acceptable solution.

  Thanks!

  $
158.18Course UpdateSUPER::SUPER::TARRYThu May 13 1993 16:4945
    PLEASE NOTICE THIS IMPORTANT CHANGE!
    
    The name of the directory  ES$REVIEW is now  IDC$REVIEW.
    
    
    The chapters for review are posted in:
    
    SUPER::$1$DUA6:[IDC$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS
                                                    _INSTRUCTOR.PS
    
    
    Again let me remind you that you must pull both student and instructor
    versions.  We no longer have facing pages.
    
    The name of the course has been changed to:
    
    	Building Dependable Systems Using OpenVMS Products          
    
    
    The course is coming along pretty well.  By tomorrow 
    14-May I will post updated chapters 1,2,3,4,5 and 8
    (Some requested changes to the drawings are still in progress.  All
     VAX 9000's are being removed and some of the lines are being
    corrected.)
    
    Many thanks to Buck Trayser for his thoughtful and always appreciated
    comments.
    
    The remaining chapters are: 
    
    Chapter 6  Distributed Software Fault Tolerance.
    
    Chapter 7  OpenVMS Products for Dependable Systems
    
    And lab exercises which will include the following:
    
    	Performing a Rolling Upgrade of a VMScluster system
    	Building a shadowed system disk
    	Performing backup by breaking a shadow set
    	Backup data integrity exploration
    	Writing a help library module
    	Solving cluster problems using DECamds
    
     Now is the time to review these chapters and prepare to teach this
    innovative course.
158.19Chapter 6 posted for reviewSUPER::SUPER::TARRYFri May 21 1993 14:5237
    A new chapter 6   Distributed Software Fault Tolerance has
    been completed and posted in the IDC$REVIEW directory
    
    SUPER::$1$DUA6:[IDC$REVIEW.DEP_SYS]VMS_DS_6_DSFT_STUDENT.PS
                                                    _INSTRUCTOR.PS
    
    
    This chapter discusses a concept of providing fault tolerant systems
    which is new to me and very exciting.  
    
    Distributed software fault tolerance uses distributed
    client/server configurations to provide fault tolerance so that the
    failure of a single node, network link or even an entire site does not
    interrupt the availability of an application.
    
    There is a very beautiful demo of this concept which is to be included
    in the media kit.  It must be run on a workstation and of course you
    will need to be able to project the workstation.  The
    chapter may not make the concept as clear without the demo. 
    Also some drawings for the chapter have not been done yet.
    
    One point to emphasize is that nothing about distributed software fault
    tolerance negates anything that is in this course except perhaps
    providing standby generators in case of power failure.
    
    I think others will like this idea as much as I do.  I will repost the
    chapter after the drawing are added and will try to make the demo
    available to instructors.  There will also be some handouts on a
    successful implementation in Australia.
    
    Next week we will post the final chapters which are:
    
    Chapter 7   OpenVMS layered products
            8   Managing complex system
            9   Laboratory exercises
    
    With the lab exercises this is a 5 day course.  
158.20Pilot June 14SUPER::SUPER::TARRYFri May 21 1993 14:533
    The pilot for Building Dependable Systems has been rescheduled to
    June 14.  It will be held in ZKO instead of PKO.  Students are now
    being enrolled.
158.21BROWNY::GDAY::MAXWELLDave MaxwellFri May 21 1993 15:0912
Hello,

I have placed a saveset in SUPER::IDC$REVIEW:[DEP_SYS]RTRDEMO.BCK

You will need the files to teach the class.  These are the files for doing the
RTR demo.  It is a real neat demo.  I suggest you try it even if you don't teach
the class.  Their are 4 files in the saveset.  Print out the AAA_README.TXT file.
It tells you how to run the demo.  It does require a workstation.  If you have a 
color terminal it is even nicer.

Good luck,
Dave
158.22Building Dependable Systems FinishedSUPER::SUPER::TARRYThu Jul 08 1993 09:5936
    The course is finished.  It is available in 3 formats:
    
    EY-N438  Building and Managing Dependable Systems Using OpenVMS
             Products   L/L  5 days  Includes operating system generic 
             materials, OpenVMS specific materials and laboratory
             exercises.
    
    EY-Q141  Building and Managing Dependable Systems Using OpenVMS
             Products   Sem  3 days  Generic and OpenVMS specific 
             materials.
    
    EY-N439  Building and Managing Dependable Systems
             2 days  Seminar   Generic materials only
    
    
    The course did not have a pilot.
    
    Materials are posted in the IDC$REVIEW directory
    
    
    SUPER::$1$DU6:[IDC$REVIEW.DEP_SYS]
    
    
    AMDS.BCK;1                Demo for DECamds         
    EY-N438E-EX-0001.PS;3     Student lab book
    EY-N438E-IG-0001.PS;1     Instructor guide              
    EY-N438E-SG-0001.PS;1     Student guide
    EY-N438E-TS-0001.PS;1     Pre test                  
    RTRDEMO.BCK;1             RTR demo  (fun-try it out)
    
    To prepare for the course you need both the student guide and the
    instructor guide.
    
    Look for an overhead package to be posted next week.
     
    
158.23YSUPER::SUPER::TARRYThu Jul 08 1993 10:005
    Discussion regarding the course Building Dependable Systems has been
    moved to the notes file.
    
    SUPER::VMS_PERFORMANCE