[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference koolit::vms_curriculum

Title:	VMS Curriculum

Moderator:	SUPER::MARSH

Created:	Thu Nov 01 1990
Last Modified:	Sun Aug 25 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	185
Total number of notes:	2026

158.0. "Building Dependable Systems -- general discussion" by SUPER::MATTHEWS () Wed Nov 04 1992 09:25

    This note is for discussion of the OpenVMS System Availability and
    Integrity course.
    
    					Val

T.R	Title	User	Personal Name	Date	Lines
158.1	Strategies and Requirements Doc -- Feedback Please	TANG::RHINE		`Wed Nov 04 1992 09:31`	191
	--------------- TM \| \| \| \| \| \| \| \| \|d\|i\|g\|i\|t\|a\|l\| INTEROFFICE MEMO \| \| \| \| \| \| \| \| --------------- TO: Dick McCarthy Date: 20-Oct-1992 Pete Buswell FROM: Jack Rhine, Bill Simcox John Coffey DEPT: Services Development Jim Malanson and Training cc: Rick Wardrop Bob Sowton Jim Stewart SUBJECT: Qualification of OpenVMS System Availability and Integrity Course As part of the OpenVMS System and Network Management Mastery Series effort, we have a core curriculum of three system and network management "generalist" courses that teach the system management skills necessary to successfully manage an OpenVMS system that is part of a VAXcluster and or is in a networked environment. This core curriculum is followed by a set of "specialist" courses in areas such as performance, security and troubleshooting. We would like to augment this set of specialist courses with an additional course in OpenVMS System Availability and Integrity. This course will provide experienced system managers and technical data center managers with the skills they need to define their requirements for system reliability and integrity, identify performance and cost tradeoffs, and translate them into a specification of requirements and a prototype implementation plan for their site. This course will introduce failure prediction using DECamds and other tools. Complex and multi-site configurations will be discussed. This course will include labs and case studies. There is a large installed OpenVMS customer base and a growing percentage of that customer base is becoming more concerned about mission critical applications and 24 by 7 operating environments. Given the industry trend for high end customers' willingness to pay for services and training that leverage continuous operations, this course would seem to have very focused appeal and can be offered at a premium price. Further, this course would result in a NEW offering (not a replacement or updated offering) that would net incremental revenue. A P/L analysis was done using an estimate of 50 worldwide offerings per year with 10 students each and a constant 3 year model, FY93 - FY95. The MLP is assumed to be US$1995. The analysis shows a 3 year Total Area Margin of about $1,000,000 @ 38% against a development and update expense of $120,000 over the same 3 year period. This project will be funded as a result of deferring OpenVMS for Programmers II until next fiscal year pending further study of how to restructure the OpenVMS Programming curriculum. We believe that the financial benefit of the reliability and integrity course will be more immediate. The Strategies and Requirements Specification is appended below. Please review the objectives of the course, scope of work, and qualify viability of the offering so that we may start development as quickly as possible to meet a projected end of May introduction. We would appreciate your response within two weeks. Regards, Strategies and Requirements OpenVMS System Availability & Integrity 5-Day Lecture/Lab DESCRIPTION This course will provide experienced system managers and technical data center managers the skills they need to define their requirements for system reliability and integrity, identify performance and cost tradeoffs, and translate them into a specification of requirements and a prototype implementation plan for their site. This course will introduce failure prediction using DECamds and other tools. Complex and multi-site configurations will be discussed. This course will include labs and case studies based on actual experiences of DEC customers with mission critical applications. OBJECTIVES * Determine mean time between interruptions for mission critical applications and define relevant single and multi-site system configurations. * Develop a prototype remedial process for a mission critical site * Perform predictive failure analysis using DECamds, operating system tools, and Symptom Directed Diagnosis tools * Define available application features that leverage availability and data integrity and their relevance to common application scenarios. TARGET AUDIENCE Experienced system managers who are responsible for mission critical systems, technical data center managers, and application designers who design mission critical applications that require high availability and data integrity. TOPICS * Operational definitions of reliability, availability and data integrity * Cost and performance tradeoffs * Computation of mean time between interruptions for complex applications * Determination of single points of failure, their impact to MBTI, and how to minimize that impact * Failure prediction and troubleshooting in mission critical environments * Development of remedial processes for mission critical applications * Backup and fast recovery in a 24 by 7 environments * Application data integrity and availability features in OpenVMS * Case studies and labs * DEC Mission Critical Services Note: These topics were distilled from a more detailed list developed by several subject matter experts. A preliminary, more detailed, topic list is attached. SUMMARY OF REQUIREMENTS * 5-Day Lecture/Lab * Material will be modular * Case study approach combined with labs using software fault insertion * Selected training centers having extensive TBD lab resources should be targeted for this course * Prerequisite of SYSNET III, Performance and Troubleshooting are recommended perquisites Potential Content of an OpenVMS Reliability, Availability, and Integrity Course resulting from a brainstorming session with subject matter expertise: 1. Define the above and other terms 2. Discuss tradeoffs impacting the above, performance, and cost. 3. Discuss evaluation of needs, MTBI measurements, can't exceed the weakest link, etc. 4. Discuss areas of a system that can be impacted, and those that cannot. i.e. design issues like no parity memory that are inherent weaknesses. 5. Determine single points of failure and how to minimize them in standard environments, i.e. typical workstation, cluster, mainframe. 6. How to configure for availability - redundancy, hot spares - ease of management issues - performance tradeoffs - multi site (MDF and other) approaches (ability to move users, data to a different site for quick recovery) - storage management * RAID, including striping and shadowing - faster rebooting - network vs. cluster file services - "clusters of clusters" 7. Application techniques - DECtp, two phase commit - journalling, checkpointing - failover 8. How to predict failure using AMDS and other tools (Polycenter?) 9. Backing up 24 x 7 systems and data recovery in this environment 10. The remedial process - How to work around failures * hot spares * determine applications that HAVE to run when there is reduced capacity - Troubleshooting in a high availability environment * hot systems * site specific process, who does what * how to get DEC involved 11. Digital Services - Why mission critical services - other DEC service offerings DEC mission critical sites such as Bellcore and MCI are potential for case studies. Lab exercises with AMDS are a possibility.
158.2	A question -	SUPER::MATTHEWS		`Wed Nov 04 1992 17:36`	6
	I realize the tools are VMS-specific, but a lot of the topics are not. Is there any thought of making this a multi-OS course? Or at least separating the generic from the VMS-specific material as is being done in the security training? Val
158.3	RE: Generic Availability and Data Integrity Suggestion	TANG::RHINE	VMS Training Product Manager	`Wed Nov 04 1992 18:13`	7
	Val, your suggestion is a good one to look at. I know that there are products in the OSF space that provide some of the VMS availability and data integrity tools. If you know someone in the OSF Course Development world that could look at the memo in .1 and provide feedback from the point of view you have suggested I would appreciate it.
158.4	When and how complex	SOAEDS::TRAYSER	Seniority means a bigger shovel!	`Mon Nov 09 1992 23:04`	15
	1 basic question and 1 comment... What is the time table for this course? Q3 development with Q4 delivery or further out? Having looked over the outline of the material I see this as a perfect course to either be a prerequisite or a parallel course offering to VMS Troubleshooting. I saw nothing in the outline that would make me believe that I wouldn't understand it if I hadn't taken Troubleshooting. This might be an appropriate course for the MIS series--no significant VMS issues, possibly suitable for Windows NT, Unix or other operating systems. I think Val's comment was right on the money; don't tie it too tightly to VMS or we limit our options. $
158.5	RE:.-1	TANG::RHINE	VMS Training Product Manager	`Tue Nov 10 1992 07:37`	8
	Buck, we are looking at Q3 development and Q4 delivery. The reason that I suggested troubleshooting as a possible prerequisite is that troubleshooting in a high availablilty environment, which I believe is an important topic, could layer on knowledge and skills that are taught in the troubleshooting course you are about to pilot. The prerequisite issue should be decided after the content is firmed up.
158.6	New name, New format	SUPER::SUPER::TARRY		`Wed Jan 13 1993 11:51`	21
	There has been some progress on this new course. The name has been changed to Building Dependable Systems - Generic chapters only Building Dependable OpenVMS System - Generic + VMS specific There will be two part numbers. The first course will contain only the generic chapters and will not have laboratory exercises. The OpenVMS specific course will have laboratory exercises. I will be posting pointers to the project plan as soon as it is approved.
158.7	Chapter 1 Ready for Review	SUPER::SUPER::TARRY		`Fri Jan 22 1993 14:36`	53
	The first draft of chapter 1 is ready for review. This chapter is a generic chapter and contains: Terms Levels of dependable system Defining business requirements There is a figure to be added at one point. It will show the following components: hardware,software,environment and humans There is still one block to be written on tradeoffs between data integrity, performance cost. There is a very extensive case study at the end of the chapter. Copy the file from: SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_1_INTRO_INSTRUCTOR.PS VMS_DS_1_INTRO_STUDENT.PS Note the following: Each chapter has a ps file for the instructor and a ps file for the student. You will need to print and read both. The student material is not in the instructor manual. Please do not send comments on the format of the materials. I have no control over format. Send comments to SUPER::TARRY by 1-Mar-1993 Other chapters planned for the course, not necessarily in the order in which they will appear are: Introduction to Depenable Computing Generic Environmental factors Generic System Configurations Generic Mass storage Generic Configuring OpenVMS System Avoid Human Errors Generic Managing Dependable OpenVMS Systems Disaster Recovery Generic Data Integrity Generic
158.8	Revised Chapter 1 Posted.	SUPER::SUPER::TARRY		`Thu Feb 18 1993 14:12`	17
	One person did provide some review on chapter 1. I have revised the chapter according to the suggestions and placed a new copy for review. SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_1_INTRO_STUDENT.PS _INSTRUCTOR.PS Be sure to pull and print both versions. There are two case studies in this chapter. One is very short. Chapter 3 on the environment is finished, but there are so many figures the chapter makes little sense without them. As soon as preliminary figures are ready I will post chapter 3. Chapters 2 and 4 are in the final stages of development. They should both be posted early next week. I sure do need more reviewers.
158.9	Chapters 2 and 4 for review	SUPER::SUPER::TARRY		`Mon Feb 22 1993 14:49`	19
	Building Dependable OpenVMS Systems Chapters 2 and 4 have been posted for review. Chapter 2 discusses strategy in general and chapter 4 discusses hardware strategy. This incluse fault tolerant, cluster and RAID. Both are generic chapters. SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_2_STRATEGY_STUDENT.PS _INSTRUCTOR.PS VMS_DS_4_HARDWARE_STUDENT.PS _INSTRUCTOR.PS
158.10	Need Reviewers Please!!!	SUPER::SUPER::TARRY		`Thu Mar 04 1993 18:14`	28
	Chapters 1-4 have been reposted in the directory: SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS _INSTRUCTOR.PS To review these chapters you must pull and print both the student and instructor versions. Chapter 3 is posted for the first time. Chapter 1 is ready for the pilot. Chapters 1-4 are operating system generic chapters. The final chapters will look like this: Chapter 5 OpenVMS configurations that support high availability Chapter 6 Managing OpenVMS systems for dependability Chapter 7 Managing the data center (generic) Laboratory Exercises for OpenVMS I am looking for a funded reviewer and a pilot instructor for this course. I am really desperate for some instructor review.
158.11	Chapters Posted for Review	SUPER::SUPER::TARRY		`Wed Mar 17 1993 14:09`	18
	Chapter 1-5 and 7 are posted for review in: SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS VMS_DS_#_name_INSTRUCTOR.PS Note that to review the materials you must obtain both the instructor and student material and read them at the same time. Some chapters are not finished. The pilot has been scheduled for 10-May at PKO. Dave Maxwell has agreed to be the instructor. Thank you for volunteering! I just must tell you that I am off for a vacation in Costa Rica. I will be back 8-April. Please have review comments posted by 8-April.
158.12	Beginning reviews of the material (starting with module 2)	SOAEDS::TRAYSER	Seniority: Big Shovel, Less Breaks!	`Mon Apr 05 1993 02:24`	65
	I've browsed the Student Materials and read the "Plan" carefully. I see a very basic problem with the material--I don't think it fits well in the SysNet curriculum. I think of my SysNet 3 students (which according to the prereq's they need SysNet 3 before taking this course) and I cannot envision them enjoying this class. The material here is more attuned to the MIS director or Senior System Manager--people responsible for setting up data centers, not for your average advanced system manager. Do we have material designed and written for the right audience? Do we have a plan to get the correct students to this class. I've already reviewed most of Chapter 1 via 'hardcopy', so I'll begin entering Chapter 2 info -- Nothing notable in the Instructor's page except 2-13 "studnet" instead of "student". Student Guide: 2-5 - 10th bullet, good analogy, but out of context. Move the hiway example to the instructor's page. - definition of 'Failure', "...whose effects cannot be contained." Huh? This is a little fuzzy, how about a better definition?! 2-7 - Very good analogy with the suitcase example! Clear, to the point and covers the concept well. 2-8 & 2-9 - Urgh. These pages don't flow well. The title is "Dependability Strategies", but there are no "strategies" on this page, only definitions and examples. Please put a few STRATEGIES here, then support them with examples. Also, the examples are incomplete, overly complex or technically wrong: The three bullets on the top of 2-8 cover "Fault prevention", "Error correction" and "Failure recovery", but the examples cover "Error correction" twice and there is no example for "Error Prevention" (which in my opinion is the most critical". The CRC/XOR example is overly complex and, although interesting, makes for tedious lecture by those not entirely familiar with the process. Please note that the discussion only refers to the SOFTWARE supplied CRC mechanism, not the more useful and quicker HARDWARE implemented CRC. Move it to the Instructor's page. And lastly, there are slight errors in the Backup example: Numbered item "1" under the "When data is backed up to tape..." has the term Checksum confused with CRC. There is a CRC algorithm used to calculate a Checksum for the header block, but the DATA blocks do not refer to the CRC that is written in them as a Checksum. These are two distinct fields in the Backup tape layout structure (known as BBH$L_CRC and BBH$W_CHECKSUM). Numbered item "2" under the "When data is backed up TO tape..." discusses READING from the tape. It should instead explain that during the writing of the Group a redundancy block is calculated and written at the end of the Group of blocks. Last paragraph should state that if the CRC cannot solve the problem, then the redundancy block is used. More later... $
158.13	continuing chapter 2 review	SOAEDS::TRAYSER	Seniority: Big Shovel, Less Breaks!	`Wed Apr 07 1993 02:46`	89
	More from chapter 2... Basic problem with this chapter is lots of new terms introduced but not fully defined. Don't use an example until the item introduced has been properly and clearly defined. The examples, although many of them are good, belong on the instructor's pages, not the student material. (No significant comments on the Instructor's pages, what follows is the student material review) 2-11 -- 2nd sentence is out of place in the lecture, this should have been cover back near page 2-8. -- Time Redundancy, what is it? Need a clear definition. -- 5th paragraph doesn't seem to have anything to do with Time Redundancy. Also, change 'can' to 'might'. 2-12 -- No discussion of Software Redundancy and how it complements Hardware Redundancy. Examples might be RMS Journaling, VAXsim, etc. -- 2nd bullet, use VAXcluster nodes as a redundancy example and move the space tire example to the instructor's page. 2-13 -- What is "N+1" and "2N"? This idea and notation are not clearly defined. Is this 'industry' notation or 'DEC' notation? 2-14 -- reverse bullets 5 & 6...determine "work-a-rounds" and their 'costs' (including time, money and other impact) and THEN define a recovery procedure. The recovery procedure MIGHT be the work-a-round -- #3, "...are the most EFFECTIVE FOR YOUR NEEDS", since cost-effective solutions might not be the one we need, we might need the FASTEST, regardless of cost. 2-15 -- The description on 2-15 indicates a 6000-510, diagram on 2-16 shows a 6000-520. Change diagram to show 6000-510. 2-16 -- If this is a 6000-510 then ERASE CPU-1 from the diagram leaving only CPU-0. -- KDM70 should have "#1" removed (see 2-21). -- Indicate Data A and Data B on the disks (see 2-21). -- Ethernet segment isn't connected to the DEMNA, move the line over to line up beneath the DEMNA (see 2-21). 2-17 -- #1 should indicate 6000-510 2-19 -- #1b, drop reference to Digital Service and replace with "Call maintenance" like the others. For that matter, just drop the "b" question if the only significant "recovery" is to call service. I'd feel really stupid saying "call service" for all the recovery solutions. -- #1c, the solution is to add another CPU in the cabinet, NOT add another system (see 2-15, last sentence). -- #2a, "...using the console OR BATCH." -- #2b,c if we had discussed Software Redundancy earlier we might have mentioned having Virtual Terminal Support turned on (via SYSGEN) and just reconnecting to the disconnected process via the second Ethernet controller or the console. -- #3a, this assumes the entire controller fails, rather than just a port or cable in which case switching cables or ports is a solution. -- #3c, this requires human intervention...must move cables to the good controller when it fails. Consider and HSC and dual ported disks. -- #4c, "N+1" just hanging out at the end of the answers. Is this intentional? If so, it looks ugly. Either explain it or drop it. 2-20 -- #5c, consider RMS Journaling. -- #6a, "...to do backup" or ANY other activity, such as file recovery, tape journaling, software installations, etc. -- #6c&d - Where did "d" come from? And "c" has a question not posed on 2-18. 2-21 -- both DECservers are labeled "#1", just drop the numbers. More later... $
158.14	Last of 2	SOAEDS::TRAYSER	Seniority: Big Shovel, Less Breaks!	`Thu Apr 08 1993 01:42`	45
	More from chapter 2... Instructor's pages prefixed with an "I", such as I2-4 is page 2-4 Instr. Guide. 2-22 -- These are 3 strategies, but what I'd like here is a problem statement. Why do I need to know about these? Will they add redundancy? Will they describe a potential failure? Use this page to 'setup' the next page. (Or, put the 'setup' on I2-17.) 2-23 -- There is an assumption here that a "process" defines a "server" or a "client". "Process" is strictly a VMS concept, i.e. a 'server' such as a VXT doesn't use a process (yes, trust me, a VXT is generally a 'server' and not a 'client'). On PCs, either of these concepts can be implemented as a driver, not a process. (Same problem on I2-18) 2-25 -- ACID test not properly introduced. It wasn't until I read it a second time that I figured out that ACID was an acronym. The definition may be technically correct, but I'm not really sure what to do with it. I think it needs a better intro. -- Before-Image Journaling, nit -- "...cannot complete, the...", a comma is missing. Also, care to define the types of journaling, such as After-Image Journaling. 2-27 -- Arrows on diagram not clear. I assume the represent the network or some "remote access". 2-30 -- 1st sentence, term "front-end" note defined. 2-31 -- More clearly mark the answers. Since the questions are repeated on this page, at first glance I didn't see the answers, just the questions! --------- I2-19 -- I like the analogies being on the Instructor's page, thanks! I2-21 -- A good example of a distributed application is VAX Notes. The user interface runs locally and the compute/storage functions are on the conference host system. Chapter 3 is next... $
158.15	Chapter 3, needs a fair bit of work	SOAEDS::TRAYSER	Seniority: Big Shovel, Less Breaks!	`Fri Apr 09 1993 01:38`	177
	Overall I was a little depressed with this chapter. It addressed large data centers, but not office based systems. It discussed physical security but no mention of software security. There are numerous terms used that are not defined or described. There is a weird discussion on power and electricity and no discussions on Static electricity and its effects. Basically a ragged chapter with occasional interesting topics intermixed with esoteric wanderings. (Instructor pages denoted with leading "I", such as I3-14) 3-4 -- term EMI, can YOU defined 'waveform' to a system manager with only a high school education? -- term NOISE, please don't define a term with another term that has yet to be defined! -- term NOTCH, OK, so I've got a small vocabulary. What's "SUBTRATIVE" supposed to mean to me? I3-5 -- 2nd paragraph from bottom, I claim ignorance. Three-phase comparison to single-phase needs a bit more explanation for those of us not strong in electronics. 3-6 -- spelling, second line "avilable" is missing an "a". I3-6 -- "sensitive" paragraph -- oh, please! give me a break. And there are students opposed to Dams because of flooding and keeping salmon from migrating. And there are students opposed to fuel-fired planets that spew pollutants into the air. Drop the reference, if find it slightly demeaning. -- 6th bullet, "jepordize" needs an "o" in front of the "p". 3-7 -- Ha! Please remember that many system managers have never taken college courses so may not have had 'trig'. Terms and phrases such as 'conductor', 'AC Sine Wave', 'current is induced' is beyond many students. To cover them in this 'matter-of-fact' material is not acceptable. -- Hmmm, why does figure B still have a current flowing from right to left as indicated by the "+ and -"?? 3-8 -- Gag! Is this material really necessary? I can teach this course quite successfully without this page. And if I was to teach this material I couldn't answer any questions on it. My degree nor my 20 years of computer experience has ever taught me the details of three-phase power, so if you expect the average SOFTWARE instructor to teach this material successfully, I believe you are setting us up for a failure. Drop pages 3-7, 3-8 and 3-9. 3-9 -- Useless, drop it. 3-10 -- 9th 'paragraph', care to tell us why the substation is getting from more than one circuit? -- last paragraph, "At cost,", huh? Do you mean to say "At an increased expense," 3-11 -- Diagram: What are the triangles? What are the boxes with dots in them? What is a "power pool"? What are the lettered circles (A, B, etc.)? -- #2, I know Texas is big, but "WHY" is it separate? I'm sure students will ask, it's just too obvious! I3-7 -- 2nd paragraph from bottom, 2 mispelins, "pulic" and "invester". 3-12 -- 6th bullet, spelling of "compnay" I3-8 -- Item #10, care to define "Load Shedding". I assume it means cutting service. -- It's not clear on pages I3-8 and I3-7 what parts are actually taken from the quoted sources. Please indent, italicize or otherwise change the font for quoted material. 3-13 -- 3rd bullet, "oscilating" needs a second "l". I3-9 -- 3rd "paragraph" from bottom, should "...1 minute or more..." be "...1 minute or LESS..."?? 3-14 -- The diagram nor the description on the instructor's page (I3-11) was clear or useful. Drop it! 3-15 -- 3rd paragraph, how exactly does one spread the load evenly over 3-phase power? I3-11-- 2nd paragraph under Voltage tolerance, spelling: change Emphasis to Emphasize. 3-16 -- 3rd paragraph under motor generators, might add an instructor note that PC and small VAX systems can usually ride the short power sags without any special hardware requirements. I3-12-- 2nd paragraph of motor generators, What "flywheel"? -- last paragraph of motor generators, Huh? how can sustaining power for an extra 1/10th of a second be enough for a "smooth" shutdown of computer equipment? Never heard of this! 3-17 -- bullets, add Power Duration and Recharging Time as two other considerations 3-18 -- First paragraph puts all computers in the same situation. My Laptop generates very little heat and doesn't need any special air conditioning. 3-19 -- 1st line, we haven't defined solid-state circuitry and where it is found. Cars have Solid-state equipment and work fine on the hot asphault parking lots. What devices are being referred to? -- 3rd and 4th paragraph should only be one paragraph. -- 2nd bullet is poorly worded. It implies that I don't have enough power to start with. -- 3rd bullet implies I can buy a fan at Kmart and cool my computer room. Mention portable A/C unit that exhaust the hot air into the suspended ceilings (no joke, we've use them before). -- The last paragraph "writes off" all the issues regarding PC and other office-based workstations. I believe this is a major short- coming of this material. 3-20 -- Only the first section on this page relates to Physical Security, the other issues are Environmental issues. Also, it ignore small systems again, assuming computers are kept in computer rooms. -- 2nd paragraph, "controlled...access", yeah, I call it a "door". Be more specific. -- 3rd paragraph, why no windows? Reflective or mirrored windows are a feature I suggest to managers. Having a view to the outside world is a GREAT way of reducing stress, which computer people seem to easily generate. -- Keeping data center clean, 1st paragraph -- printers in computer rooms is not as large of an issue as it used to be. The 'lint' or 'paper dust' was significantly reduced as shops moved from the high-speed, continuous form impact printers to laser printers. -- Keeping data center clean, 4th paragraph -- This is a MAJOR misunderstanding of VCS systems. The were originally sold in VAXcluster environments, but can have HSCs, Unix systems, stand-alone VAX systems, Alphas, 3rd-party systems, etc. connected to the VCS. The name is a poor name and should be changed to reflect its current capabilities. -- Water Problems, 1st paragraph -- "Sorry, Egon. I'm a little fuzzy on this 'bad' thing. Why is it bad to cross the streams?" replace "very bad" with "can be a problem". I've use water to clean soda out of a VT200 keyboard before. Bad is subjective. -- Water Problems, 3rd paragraph -- "SOME air conditioning units...", not all are water cooled anymore. Liquid Nitrogen and Hydrogen, Freon and other coolants are popular today. 3-21 -- So, what is "Power Conditioning System Plus"? Strange header for the page. Is it a product or service? Is it a concept? Does DEC sell it? -- 2nd paragraph, what's a H731x? What's VAX REMS, I've never heard of this. How about a description. If this is a product pitch move it to the appendix and present the "concepts" of the process here. 3-22 -- What's a REOP/EMS? What's a J-Box? What's a RSU? What's the box on the floor near the J-Box that looks like it's 1/2 open? Basically a chapter I'd cut-and-paste and deliver using only about 6-8 pages. Some stuff is interesting, but the audience is wrong if we are selling this course to System Managers. This stuff is more for Data Center Managers or MIS directors. Oh, and finally, many pages are using more than 75% of the page. This might be OK for reference material or Self-paced material, but for lecture lab we like enough space at the bottom of the page to take notes. More later... $
158.16	Chapter 4 -- tools, but no real strategies	SOAEDS::TRAYSER	Seniority: Big Shovel, Less Breaks!	`Mon Apr 12 1993 01:41`	180
	Chapter 4 Don't see the connection between title and contents--"Software Dependabilities Strategies". Sorry, but I saw no strategies, only concepts and tools. ---------- 4-5 -- 1st sentence. Don't use the term "independently" to define the concept of "independently recoverable unit". 4-6 -- Needs reorganizing. System Kernel Level Redundancy should precede Subsystem Level Redundancy, especially since 'kernel' is used before it is defined. -- First section, CIRCUIT level redundancy not adequately defined. -- 2nd sentence, makes no sense. What console processor? what is "its" referring to, the CPU or the Console CPU? How does it work? -- 2nd section. Why 2 separate sentences? If they were merged they would for a coherent paragraph. It looks strange to break up a paragraph like that. OR, bring back the bullets that probably were there in a previous draft of this material. -- 3rd section (other than moving it up on the page), 3rd bullet has an example to get the idea across, the first two bullets also need an example like MIRA or TANDEM, etc. -- "Wide Area Redundancy" -- adding the term cluster system is redundant, it is a system. Drop the cluster reference, it is covered later. -- Last paragraph could benefit from an example like hurricane Hugo, where damage along the Carolina coast was extensive, but the tornadoes spawned by the hurricane traveled inland to cities like Charlotte doing excessive damage far from the 'strike' site. The March blizzard was a good example that even geographically separate locations need to be MUCH more separated to protect them from the same disaster. I4-6 -- 1st sentence, what does "such circuit level redundancy" mean? "Such" implies a definition preceded this sentence. I don't see it. -- Last paragraph, "...nearest the caller Order placement..." doesn't make sense! Are we missing word here? I4-9 -- 2nd paragraph from bottom, VMS also has this feature. I4-11-- This definition was needed in the previous chapters. Move it forward to the first occurrence of "2N or N+1". 4-11 -- Last paragraph, care to tell us a bit more about a "Watchdog Timer", such as part number, common uses, etc. On instructor's page would be fine. 4-12 -- Diagram is in error. CPU is missing, MIRA watchdog timer not connected to the Q-bus, cable from DEQTA goes directly to the other DEQTA (have it go to a 'cloud' or "..." to indicate more hardware" What is the box labeled "Standard Q-bus Interface" supposed to be? Is this an old configuration? I don't see the DSSI hardware or disks. 4-13 -- The entire page looks like it needs bullets! Or form them into paragraphs. -- Triple Modular, Tightly Coupled and Hardware Intensive - give an example of the type of hardware required to configure to these descriptions. (Stratus reference on the instructor's page is probably OK for the last one.) 4-14 -- Layout looks weird. Looks like bullets were removed and a paragraph would be more reasonable than the current format. -- 2nd bullet "No single point of repair", this is a strange statement that the rest of the bullet fails to explain adequately. -- 4th bullet "Self-checking checkers", ditto. How is this done? Can you give an example on the instructor's page? 4-15 -- 4-14 nor I4-15 adequately describe this diagram. What is "Mass STC", what is X-Link? Diagram shows 2 cables leaving single Ethernet controllers. Why does the Processor box have two little un-labeled boxes in it? What is the vertical, white box that connects the Mass STC, Ethernet and Console 'boxes'? 4-17 -- Overall this page is a sales pitch, not providing anything worthy of a 'strategy'. This page really doesn't say anything worth lecturing on in a Dependable Systems class. How is it redundant? How should it be configured? What are the costs? Etc.!! -- The middle section notes differences between VAXcluster and VMScluster. This information is old and will not be current when this material begins to ship. As of Mid-May (VAX-VMS 6.0 and AXP-VMS V1.5) mixed architecture VMSclusters are supported. It might be worth noting previous configurations, but ALL the documentation will be changed to reflect that VMSclusters is the old VAXcluster concept on various combinations of VAX and AXP systems. -- Last sentence. Drop it or move it to the instructor's page. Having a VAX VMScluster V6.0 (AXP V1.5) SPD in the instructor's kit would be more useful than this line. I4-17-- How can we get a copy of the Aberdeen Group white paper? For Instructors and/or for Students. 4-18 -- 2nd paragraph, "secondary storage" -- I don't think we've mentioned what "secondary" would be or how that differs from Primary storage. -- Last paragraph -- misleading. Seems to imply that RAID might be discussed in more detail in the Performance course. Please make it clear that VMS performance is more clearly defined and that RAID concepts (including performance) are covered in more details here. I4-19-- Any idea how to order the Berkeley paper on RAID? I4-20-- diagram is ugly. Either turn it on it's side (landscape) or break it into two tables: Level Descr. Avail. Request rate 0 1 ... Level Data rate cost type of appl 0 1 4-21 -- 2nd sentence, need to define "chunk" better -- "A chunk is the I/O size of RAID that usually consist of several blocks or data", etc. -- Availability, MTBF? Has it been previously defined? I think so, just can't find it. -- diagram, why is "B" shaded? 4-22 -- Reference of Raid 0 and Raid 1 being combined is good, but belongs AFTER we have described Raid 1, possibly at the bottom of the page. Otherwise it seems that the Avail/Perf/Storage section is defining how RAID 0/1 works! -- Also need to stick with either "RAID 1" or "RAID-1" spelling. 4-23 -- Why is "B" shaded? 4-24 -- Under Performance, the 2nd and 3rd paragraphs - don't describe any performance issues. If they are describing how RAID-3 works, then move them up to the top of the page. -- Under Performance, 1st sentence, "...workload is a near sequential...", huh? Don't understand the sentence. 4-25 -- Can't read the labels in the diagram. Probably need to clear the background shading with the D�E�F like the triangle on top of the page. -- Dark lines going to far, right disk implies only blocks A & C are being referenced, add a 3rd line connecting "B" to the far right disk. 4-26 -- bullets 5-8, don't make sense. How can we write a D�E�F XOR chunk before we have written "F"?? 4-28 -- Reed-Solomon, never heard of it. Where is it defined in the class materials? How can I get more info on this? -- 3rd & 4th bullets, what are in these blocks? Probably Reed-Solomon stuff, right? And if the students asks, I can add nothing to the discussion. Either more detail or drop all references to Raid-6. 4-31 -- What is this page doing here? Why is RAID discussed in such details here? 'Can' it. 4-33 -- Only exercise is on disks? What about clusters? MIRA? other concepts discussed? Considering we don't have products that cover RAID 3, why spend 67% of the lab working on it? Overall some interesting topics, but not focused around Strategies. Seems like a Marketing tour since we only discuss concepts rather than configuration issues, what disks should be RAID-ed, how is Cluster maintenance performed without halting the cluster, how about combining VAXsim with shadow disks, etc. $
158.17	I still own you a few more chapters, but "thanks"!	SOAEDS::TRAYSER	Seniority: Big Shovel, Less Breaks!	`Tue Apr 20 1993 22:44`	10
	Emmalee, Thanks for the mail commenting on all my comments. It's nice to know the course writers really do read this stuff. I even appreciate your "I disagree with you" comments, where you keep the page as you have it but put my comments on the instructors pages -- a quite acceptable solution. Thanks! $
158.18	Course Update	SUPER::SUPER::TARRY		`Thu May 13 1993 15:49`	45
	PLEASE NOTICE THIS IMPORTANT CHANGE! The name of the directory ES$REVIEW is now IDC$REVIEW. The chapters for review are posted in: SUPER::$1$DUA6:[IDC$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS _INSTRUCTOR.PS Again let me remind you that you must pull both student and instructor versions. We no longer have facing pages. The name of the course has been changed to: Building Dependable Systems Using OpenVMS Products The course is coming along pretty well. By tomorrow 14-May I will post updated chapters 1,2,3,4,5 and 8 (Some requested changes to the drawings are still in progress. All VAX 9000's are being removed and some of the lines are being corrected.) Many thanks to Buck Trayser for his thoughtful and always appreciated comments. The remaining chapters are: Chapter 6 Distributed Software Fault Tolerance. Chapter 7 OpenVMS Products for Dependable Systems And lab exercises which will include the following: Performing a Rolling Upgrade of a VMScluster system Building a shadowed system disk Performing backup by breaking a shadow set Backup data integrity exploration Writing a help library module Solving cluster problems using DECamds Now is the time to review these chapters and prepare to teach this innovative course.
158.19	Chapter 6 posted for review	SUPER::SUPER::TARRY		`Fri May 21 1993 13:52`	37
	A new chapter 6 Distributed Software Fault Tolerance has been completed and posted in the IDC$REVIEW directory SUPER::$1$DUA6:[IDC$REVIEW.DEP_SYS]VMS_DS_6_DSFT_STUDENT.PS _INSTRUCTOR.PS This chapter discusses a concept of providing fault tolerant systems which is new to me and very exciting. Distributed software fault tolerance uses distributed client/server configurations to provide fault tolerance so that the failure of a single node, network link or even an entire site does not interrupt the availability of an application. There is a very beautiful demo of this concept which is to be included in the media kit. It must be run on a workstation and of course you will need to be able to project the workstation. The chapter may not make the concept as clear without the demo. Also some drawings for the chapter have not been done yet. One point to emphasize is that nothing about distributed software fault tolerance negates anything that is in this course except perhaps providing standby generators in case of power failure. I think others will like this idea as much as I do. I will repost the chapter after the drawing are added and will try to make the demo available to instructors. There will also be some handouts on a successful implementation in Australia. Next week we will post the final chapters which are: Chapter 7 OpenVMS layered products 8 Managing complex system 9 Laboratory exercises With the lab exercises this is a 5 day course.
158.20	Pilot June 14	SUPER::SUPER::TARRY		`Fri May 21 1993 13:53`	3
	The pilot for Building Dependable Systems has been rescheduled to June 14. It will be held in ZKO instead of PKO. Students are now being enrolled.
158.21		BROWNY::GDAY::MAXWELL	Dave Maxwell	`Fri May 21 1993 14:09`	12
	Hello, I have placed a saveset in SUPER::IDC$REVIEW:[DEP_SYS]RTRDEMO.BCK You will need the files to teach the class. These are the files for doing the RTR demo. It is a real neat demo. I suggest you try it even if you don't teach the class. Their are 4 files in the saveset. Print out the AAA_README.TXT file. It tells you how to run the demo. It does require a workstation. If you have a color terminal it is even nicer. Good luck, Dave
158.22	Building Dependable Systems Finished	SUPER::SUPER::TARRY		`Thu Jul 08 1993 08:59`	36
	The course is finished. It is available in 3 formats: EY-N438 Building and Managing Dependable Systems Using OpenVMS Products L/L 5 days Includes operating system generic materials, OpenVMS specific materials and laboratory exercises. EY-Q141 Building and Managing Dependable Systems Using OpenVMS Products Sem 3 days Generic and OpenVMS specific materials. EY-N439 Building and Managing Dependable Systems 2 days Seminar Generic materials only The course did not have a pilot. Materials are posted in the IDC$REVIEW directory SUPER::$1$DU6:[IDC$REVIEW.DEP_SYS] AMDS.BCK;1 Demo for DECamds EY-N438E-EX-0001.PS;3 Student lab book EY-N438E-IG-0001.PS;1 Instructor guide EY-N438E-SG-0001.PS;1 Student guide EY-N438E-TS-0001.PS;1 Pre test RTRDEMO.BCK;1 RTR demo (fun-try it out) To prepare for the course you need both the student guide and the instructor guide. Look for an overhead package to be posted next week.
158.23	Y	SUPER::SUPER::TARRY		`Thu Jul 08 1993 09:00`	5
	Discussion regarding the course Building Dependable Systems has been moved to the notes file. SUPER::VMS_PERFORMANCE