T.R | Title | User | Personal Name | Date | Lines |
---|
158.1 | Strategies and Requirements Doc -- Feedback Please | TANG::RHINE | | Wed Nov 04 1992 09:31 | 191 |
|
--------------- TM
| | | | | | | |
|d|i|g|i|t|a|l| INTEROFFICE MEMO
| | | | | | | |
---------------
TO: Dick McCarthy Date: 20-Oct-1992
Pete Buswell FROM: Jack Rhine, Bill Simcox
John Coffey DEPT: Services Development
Jim Malanson and Training
cc: Rick Wardrop
Bob Sowton
Jim Stewart
SUBJECT: Qualification of OpenVMS System Availability and Integrity Course
As part of the OpenVMS System and Network Management Mastery Series effort, we
have a core curriculum of three system and network management "generalist"
courses that teach the system management skills necessary to successfully
manage an OpenVMS system that is part of a VAXcluster and or is in a networked
environment. This core curriculum is followed by a set of "specialist" courses
in areas such as performance, security and troubleshooting. We would like to
augment this set of specialist courses with an additional course in OpenVMS
System Availability and Integrity.
This course will provide experienced system managers and technical data center
managers with the skills they need to define their requirements for system
reliability and integrity, identify performance and cost tradeoffs, and
translate them into a specification of requirements and a prototype
implementation plan for their site. This course will introduce failure
prediction using DECamds and other tools. Complex and multi-site
configurations will be discussed. This course will include labs and case
studies.
There is a large installed OpenVMS customer base and a growing percentage of
that customer base is becoming more concerned about mission critical
applications and 24 by 7 operating environments. Given the industry trend for
high end customers' willingness to pay for services and training that leverage
continuous operations, this course would seem to have very focused appeal and
can be offered at a premium price. Further, this course would result in a NEW
offering (not a replacement or updated offering) that would net incremental
revenue.
A P/L analysis was done using an estimate of 50 worldwide offerings per year
with 10 students each and a constant 3 year model, FY93 - FY95. The MLP is
assumed to be US$1995. The analysis shows a 3 year Total Area Margin of about
$1,000,000 @ 38% against a development and update expense of $120,000 over the
same 3 year period.
This project will be funded as a result of deferring OpenVMS for Programmers II
until next fiscal year pending further study of how to restructure the
OpenVMS Programming curriculum. We believe that the financial benefit of the
reliability and integrity course will be more immediate.
The Strategies and Requirements Specification is appended below. Please
review the objectives of the course, scope of work, and qualify viability of the
offering so that we may start development as quickly as possible to meet a
projected end of May introduction. We would appreciate your response within
two weeks.
Regards,
Strategies and Requirements
OpenVMS System Availability & Integrity
5-Day Lecture/Lab
DESCRIPTION
This course will provide experienced system managers and technical data center
managers the skills they need to define their requirements for system
reliability and integrity, identify performance and cost tradeoffs, and
translate them into a specification of requirements and a prototype
implementation plan for their site. This course will introduce failure
prediction using DECamds and other tools. Complex and multi-site
configurations will be discussed. This course will include labs and case
studies based on actual experiences of DEC customers with mission critical
applications.
OBJECTIVES
* Determine mean time between interruptions for mission critical
applications and define relevant single and multi-site
system configurations.
* Develop a prototype remedial process for a mission critical site
* Perform predictive failure analysis using DECamds, operating system
tools, and Symptom Directed Diagnosis tools
* Define available application features that leverage availability and
data integrity and their relevance to common application scenarios.
TARGET AUDIENCE
Experienced system managers who are responsible for mission critical systems,
technical data center managers, and application designers who design mission
critical applications that require high availability and data integrity.
TOPICS
* Operational definitions of reliability, availability and data integrity
* Cost and performance tradeoffs
* Computation of mean time between interruptions for complex applications
* Determination of single points of failure, their impact to MBTI, and how
to minimize that impact
* Failure prediction and troubleshooting in mission critical environments
* Development of remedial processes for mission critical applications
* Backup and fast recovery in a 24 by 7 environments
* Application data integrity and availability features in OpenVMS
* Case studies and labs
* DEC Mission Critical Services
Note: These topics were distilled from a more detailed list developed by
several subject matter experts. A preliminary, more detailed, topic
list is attached.
SUMMARY OF REQUIREMENTS
* 5-Day Lecture/Lab
* Material will be modular
* Case study approach combined with labs using software fault insertion
* Selected training centers having extensive TBD lab resources should
be targeted for this course
* Prerequisite of SYSNET III, Performance and Troubleshooting are
recommended perquisites
Potential Content of an OpenVMS Reliability, Availability, and Integrity Course
resulting from a brainstorming session with subject matter expertise:
1. Define the above and other terms
2. Discuss tradeoffs impacting the above, performance, and cost.
3. Discuss evaluation of needs, MTBI measurements, can't exceed the weakest
link, etc.
4. Discuss areas of a system that can be impacted, and those that cannot.
i.e. design issues like no parity memory that are inherent weaknesses.
5. Determine single points of failure and how to minimize them in standard
environments, i.e. typical workstation, cluster, mainframe.
6. How to configure for availability
- redundancy, hot spares
- ease of management issues
- performance tradeoffs
- multi site (MDF and other) approaches (ability to move users, data
to a different site for quick recovery)
- storage management
* RAID, including striping and shadowing
- faster rebooting
- network vs. cluster file services
- "clusters of clusters"
7. Application techniques
- DECtp, two phase commit
- journalling, checkpointing
- failover
8. How to predict failure using AMDS and other tools (Polycenter?)
9. Backing up 24 x 7 systems and data recovery in this environment
10. The remedial process
- How to work around failures
* hot spares
* determine applications that HAVE to run when there is
reduced capacity
- Troubleshooting in a high availability environment
* hot systems
* site specific process, who does what
* how to get DEC involved
11. Digital Services
- Why mission critical services
- other DEC service offerings
DEC mission critical sites such as Bellcore and MCI are potential for case
studies. Lab exercises with AMDS are a possibility.
|
158.2 | A question - | SUPER::MATTHEWS | | Wed Nov 04 1992 17:36 | 6 |
| I realize the tools are VMS-specific, but a lot of the topics are not.
Is there any thought of making this a multi-OS course? Or at least
separating the generic from the VMS-specific material as is being done
in the security training?
Val
|
158.3 | RE: Generic Availability and Data Integrity Suggestion | TANG::RHINE | VMS Training Product Manager | Wed Nov 04 1992 18:13 | 7 |
| Val, your suggestion is a good one to look at. I know that there are
products in the OSF space that provide some of the VMS availability and
data integrity tools. If you know someone in the OSF Course
Development world that could look at the memo in .1 and provide
feedback from the point of view you have suggested I would appreciate
it.
|
158.4 | When and how complex | SOAEDS::TRAYSER | Seniority means a bigger shovel! | Mon Nov 09 1992 23:04 | 15 |
| 1 basic question and 1 comment...
What is the time table for this course? Q3 development with Q4 delivery
or further out?
Having looked over the outline of the material I see this as a perfect
course to either be a prerequisite or a parallel course offering to VMS
Troubleshooting. I saw nothing in the outline that would make me believe
that I wouldn't understand it if I hadn't taken Troubleshooting. This
might be an appropriate course for the MIS series--no significant VMS
issues, possibly suitable for Windows NT, Unix or other operating
systems. I think Val's comment was right on the money; don't tie it too
tightly to VMS or we limit our options.
$
|
158.5 | RE:.-1 | TANG::RHINE | VMS Training Product Manager | Tue Nov 10 1992 07:37 | 8 |
| Buck, we are looking at Q3 development and Q4 delivery.
The reason that I suggested troubleshooting as a possible prerequisite
is that troubleshooting in a high availablilty environment, which I
believe is an important topic, could layer on knowledge and skills
that are taught in the troubleshooting course you are about to pilot.
The prerequisite issue should be decided after the content is firmed
up.
|
158.6 | New name, New format | SUPER::SUPER::TARRY | | Wed Jan 13 1993 11:51 | 21 |
| There has been some progress on this new course.
The name has been changed to
Building Dependable Systems - Generic chapters only
Building Dependable OpenVMS System - Generic + VMS specific
There will be two part numbers.
The first course will contain only the generic chapters
and will not have laboratory exercises.
The OpenVMS specific course will have laboratory exercises.
I will be posting pointers to the project plan as soon as it is
approved.
|
158.7 | Chapter 1 Ready for Review | SUPER::SUPER::TARRY | | Fri Jan 22 1993 14:36 | 53 |
|
The first draft of chapter 1 is ready for review. This
chapter is a generic chapter and contains:
Terms
Levels of dependable system
Defining business requirements
There is a figure to be added at one point. It will show the
following components: hardware,software,environment and humans
There is still one block to be written on tradeoffs between data
integrity, performance cost.
There is a very extensive case study at the end of the chapter.
Copy the file from:
SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_1_INTRO_INSTRUCTOR.PS
VMS_DS_1_INTRO_STUDENT.PS
Note the following:
Each chapter has a ps file for the instructor and a ps
file for the student. You will need to print and read
both. The student material is not in the instructor
manual.
Please do not send comments on the format of the
materials. I have no control over format.
Send comments to SUPER::TARRY by 1-Mar-1993
Other chapters planned for the course, not necessarily in the
order in which they will appear are:
Introduction to Depenable Computing Generic
Environmental factors Generic
System Configurations Generic
Mass storage Generic
Configuring OpenVMS System
Avoid Human Errors Generic
Managing Dependable OpenVMS Systems
Disaster Recovery Generic
Data Integrity Generic
|
158.8 | Revised Chapter 1 Posted. | SUPER::SUPER::TARRY | | Thu Feb 18 1993 14:12 | 17 |
| One person did provide some review on chapter 1. I have revised the
chapter according to the suggestions and placed a new copy for review.
SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_1_INTRO_STUDENT.PS
_INSTRUCTOR.PS
Be sure to pull and print both versions. There are two case studies
in this chapter. One is very short.
Chapter 3 on the environment is finished, but there are so many figures
the chapter makes little sense without them. As soon as preliminary
figures are ready I will post chapter 3.
Chapters 2 and 4 are in the final stages of development. They should
both be posted early next week.
I sure do need more reviewers.
|
158.9 | Chapters 2 and 4 for review | SUPER::SUPER::TARRY | | Mon Feb 22 1993 14:49 | 19 |
| Building Dependable OpenVMS Systems
Chapters 2 and 4 have been posted for review.
Chapter 2 discusses strategy in general and chapter 4 discusses
hardware strategy. This incluse fault tolerant, cluster and RAID.
Both are generic chapters.
SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_2_STRATEGY_STUDENT.PS
_INSTRUCTOR.PS
VMS_DS_4_HARDWARE_STUDENT.PS
_INSTRUCTOR.PS
|
158.10 | Need Reviewers Please!!! | SUPER::SUPER::TARRY | | Thu Mar 04 1993 18:14 | 28 |
| Chapters 1-4 have been reposted in the directory:
SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS
_INSTRUCTOR.PS
To review these chapters you must pull and print both the student and
instructor versions.
Chapter 3 is posted for the first time. Chapter 1 is ready for the
pilot.
Chapters 1-4 are operating system generic chapters.
The final chapters will look like this:
Chapter 5 OpenVMS configurations that support high availability
Chapter 6 Managing OpenVMS systems for dependability
Chapter 7 Managing the data center (generic)
Laboratory Exercises for OpenVMS
I am looking for a funded reviewer and a pilot instructor for this
course.
I am really desperate for some instructor review.
|
158.11 | Chapters Posted for Review | SUPER::SUPER::TARRY | | Wed Mar 17 1993 14:09 | 18 |
| Chapter 1-5 and 7 are posted for review in:
SUPER::$1$DUA6:[ES$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS
VMS_DS_#_name_INSTRUCTOR.PS
Note that to review the materials you must obtain both the instructor
and student material and read them at the same time.
Some chapters are not finished.
The pilot has been scheduled for 10-May at PKO. Dave Maxwell has
agreed to be the instructor. Thank you for volunteering!
I just must tell you that I am off for a vacation in Costa Rica. I
will be back 8-April.
Please have review comments posted by 8-April.
|
158.12 | Beginning reviews of the material (starting with module 2) | SOAEDS::TRAYSER | Seniority: Big Shovel, Less Breaks! | Mon Apr 05 1993 03:24 | 65 |
| I've browsed the Student Materials and read the "Plan" carefully. I see a
very basic problem with the material--I don't think it fits well in the SysNet
curriculum. I think of my SysNet 3 students (which according to the prereq's
they need SysNet 3 before taking this course) and I cannot envision them
enjoying this class. The material here is more attuned to the MIS director
or Senior System Manager--people responsible for setting up data centers, not
for your average advanced system manager. Do we have material designed and
written for the right audience? Do we have a plan to get the correct students
to this class.
I've already reviewed most of Chapter 1 via 'hardcopy', so I'll begin entering
Chapter 2 info --
Nothing notable in the Instructor's page except 2-13 "studnet" instead of
"student".
Student Guide:
2-5 - 10th bullet, good analogy, but out of context. Move the hiway example to
the instructor's page.
- definition of 'Failure', "...whose effects cannot be contained." Huh?
This is a little fuzzy, how about a better definition?!
2-7 - Very good analogy with the suitcase example! Clear, to the point and
covers the concept well.
2-8 & 2-9 - Urgh. These pages don't flow well. The title is "Dependability
Strategies", but there are no "strategies" on this page, only definitions
and examples. Please put a few STRATEGIES here, then support them with
examples.
Also, the examples are incomplete, overly complex or technically wrong:
The three bullets on the top of 2-8 cover "Fault prevention", "Error
correction" and "Failure recovery", but the examples cover "Error
correction" twice and there is no example for "Error Prevention" (which
in my opinion is the most critical".
The CRC/XOR example is overly complex and, although interesting, makes
for tedious lecture by those not entirely familiar with the process.
Please note that the discussion only refers to the SOFTWARE supplied
CRC mechanism, not the more useful and quicker HARDWARE implemented CRC.
Move it to the Instructor's page.
And lastly, there are slight errors in the Backup example:
Numbered item "1" under the "When data is backed up to tape..." has
the term Checksum confused with CRC. There is a CRC algorithm used
to calculate a Checksum for the header block, but the DATA blocks do
not refer to the CRC that is written in them as a Checksum. These
are two distinct fields in the Backup tape layout structure (known
as BBH$L_CRC and BBH$W_CHECKSUM).
Numbered item "2" under the "When data is backed up TO tape..."
discusses READING from the tape. It should instead explain that
during the writing of the Group a redundancy block is calculated and
written at the end of the Group of blocks.
Last paragraph should state that if the CRC cannot solve the problem,
then the redundancy block is used.
More later...
$
|
158.13 | continuing chapter 2 review | SOAEDS::TRAYSER | Seniority: Big Shovel, Less Breaks! | Wed Apr 07 1993 03:46 | 89 |
| More from chapter 2...
Basic problem with this chapter is lots of new terms introduced but not fully
defined. Don't use an example until the item introduced has been properly and
clearly defined. The examples, although many of them are good, belong on the
instructor's pages, not the student material.
(No significant comments on the Instructor's pages, what follows is the student
material review)
2-11 -- 2nd sentence is out of place in the lecture, this should have been
cover back near page 2-8.
-- Time Redundancy, what is it? Need a clear definition.
-- 5th paragraph doesn't seem to have anything to do with Time
Redundancy. Also, change 'can' to 'might'.
2-12 -- No discussion of Software Redundancy and how it complements Hardware
Redundancy. Examples might be RMS Journaling, VAXsim, etc.
-- 2nd bullet, use VAXcluster nodes as a redundancy example and move the
space tire example to the instructor's page.
2-13 -- What is "N+1" and "2N"? This idea and notation are not clearly
defined. Is this 'industry' notation or 'DEC' notation?
2-14 -- reverse bullets 5 & 6...determine "work-a-rounds" and their 'costs'
(including time, money and other impact) and THEN define a recovery
procedure. The recovery procedure MIGHT be the work-a-round
-- #3, "...are the most EFFECTIVE FOR YOUR NEEDS", since cost-effective
solutions might not be the one we need, we might need the FASTEST,
regardless of cost.
2-15 -- The description on 2-15 indicates a 6000-510, diagram on 2-16
shows a 6000-520. Change diagram to show 6000-510.
2-16 -- If this is a 6000-510 then ERASE CPU-1 from the diagram leaving
only CPU-0.
-- KDM70 should have "#1" removed (see 2-21).
-- Indicate Data A and Data B on the disks (see 2-21).
-- Ethernet segment isn't connected to the DEMNA, move the line over to
line up beneath the DEMNA (see 2-21).
2-17 -- #1 should indicate 6000-510
2-19 -- #1b, drop reference to Digital Service and replace with "Call
maintenance" like the others. For that matter, just drop the "b"
question if the only significant "recovery" is to call service. I'd
feel really stupid saying "call service" for all the recovery
solutions.
-- #1c, the solution is to add another CPU in the cabinet, NOT add
another system (see 2-15, last sentence).
-- #2a, "...using the console OR BATCH."
-- #2b,c if we had discussed Software Redundancy earlier we might have
mentioned having Virtual Terminal Support turned on (via SYSGEN) and
just reconnecting to the disconnected process via the second Ethernet
controller or the console.
-- #3a, this assumes the entire controller fails, rather than just a port
or cable in which case switching cables or ports is a solution.
-- #3c, this requires human intervention...must move cables to the good
controller when it fails. Consider and HSC and dual ported disks.
-- #4c, "N+1" just hanging out at the end of the answers. Is this
intentional? If so, it looks ugly. Either explain it or drop it.
2-20 -- #5c, consider RMS Journaling.
-- #6a, "...to do backup" or ANY other activity, such as file recovery,
tape journaling, software installations, etc.
-- #6c&d - Where did "d" come from? And "c" has a question not posed
on 2-18.
2-21 -- both DECservers are labeled "#1", just drop the numbers.
More later...
$
|
158.14 | Last of 2 | SOAEDS::TRAYSER | Seniority: Big Shovel, Less Breaks! | Thu Apr 08 1993 02:42 | 45 |
| More from chapter 2...
Instructor's pages prefixed with an "I", such as I2-4 is page 2-4 Instr. Guide.
2-22 -- These are 3 strategies, but what I'd like here is a problem statement.
Why do I need to know about these? Will they add redundancy? Will
they describe a potential failure? Use this page to 'setup' the next
page. (Or, put the 'setup' on I2-17.)
2-23 -- There is an assumption here that a "process" defines a "server" or
a "client". "Process" is strictly a VMS concept, i.e. a 'server' such
as a VXT doesn't use a process (yes, trust me, a VXT is generally a
'server' and not a 'client'). On PCs, either of these concepts can
be implemented as a driver, not a process. (Same problem on I2-18)
2-25 -- ACID test not properly introduced. It wasn't until I read it a second
time that I figured out that ACID was an acronym. The definition may
be technically correct, but I'm not really sure what to do with it. I
think it needs a better intro.
-- Before-Image Journaling, nit -- "...cannot complete*,* the...", a
comma is missing. Also, care to define the types of journaling, such
as After-Image Journaling.
2-27 -- Arrows on diagram not clear. I assume the represent the network or
some "remote access".
2-30 -- 1st sentence, term "front-end" note defined.
2-31 -- More clearly mark the answers. Since the questions are repeated on
this page, at first glance I didn't see the answers, just the
questions!
---------
I2-19 -- I like the analogies being on the Instructor's page, thanks!
I2-21 -- A good example of a distributed application is VAX Notes. The user
interface runs locally and the compute/storage functions are on the
conference host system.
Chapter 3 is next...
$
|
158.15 | Chapter 3, needs a fair bit of work | SOAEDS::TRAYSER | Seniority: Big Shovel, Less Breaks! | Fri Apr 09 1993 02:38 | 177 |
| Overall I was a little depressed with this chapter. It addressed large data
centers, but not office based systems. It discussed physical security but
no mention of software security. There are numerous terms used that are
not defined or described. There is a weird discussion on power and electricity
and no discussions on Static electricity and its effects. Basically a ragged
chapter with occasional interesting topics intermixed with esoteric wanderings.
(Instructor pages denoted with leading "I", such as I3-14)
3-4 -- term EMI, can YOU defined 'waveform' to a system manager with only a
high school education?
-- term NOISE, please don't define a term with another term that has yet
to be defined!
-- term NOTCH, OK, so I've got a small vocabulary. What's "SUBTRATIVE"
supposed to mean to me?
I3-5 -- 2nd paragraph from bottom, I claim ignorance. Three-phase comparison
to single-phase needs a bit more explanation for those of us not
strong in electronics.
3-6 -- spelling, second line "avilable" is missing an "a".
I3-6 -- "sensitive" paragraph -- oh, please! give me a break. And there are
students opposed to Dams because of flooding and keeping salmon from
migrating. And there are students opposed to fuel-fired planets that
spew pollutants into the air. Drop the reference, if find it slightly
demeaning.
-- 6th bullet, "jepordize" needs an "o" in front of the "p".
3-7 -- Ha! Please remember that many system managers have never taken
college courses so may not have had 'trig'. Terms and phrases such as
'conductor', 'AC Sine Wave', 'current is induced' is beyond many
students. To cover them in this 'matter-of-fact' material is not
acceptable.
-- Hmmm, why does figure B still have a current flowing from right to
left as indicated by the "+ and -"??
3-8 -- Gag! Is this material really necessary? I can teach this course quite
successfully without this page. And if I *was* to teach this material
I couldn't answer any questions on it. My degree nor my 20 years of
computer experience has ever taught me the details of three-phase
power, so if you expect the average SOFTWARE instructor to teach this
material successfully, I believe you are setting us up for a failure.
Drop pages 3-7, 3-8 and 3-9.
3-9 -- Useless, drop it.
3-10 -- 9th 'paragraph', care to tell us why the substation is getting from
more than one circuit?
-- last paragraph, "At cost,", huh? Do you mean to say "At an increased
expense,"
3-11 -- Diagram: What are the triangles? What are the boxes with dots in them?
What is a "power pool"? What are the lettered circles (A, B, etc.)?
-- #2, I know Texas is big, but "WHY" is it separate? I'm sure students
will ask, it's just too obvious!
I3-7 -- 2nd paragraph from bottom, 2 mispelins, "pulic" and "invester".
3-12 -- 6th bullet, spelling of "compnay"
I3-8 -- Item #10, care to define "Load Shedding". I assume it means cutting
service.
-- It's not clear on pages I3-8 and I3-7 what parts are actually taken
from the quoted sources. Please indent, italicize or otherwise change
the font for quoted material.
3-13 -- 3rd bullet, "oscilating" needs a second "l".
I3-9 -- 3rd "paragraph" from bottom, should "...1 minute or more..." be "...1
minute or LESS..."??
3-14 -- The diagram nor the description on the instructor's page (I3-11) was
clear or useful. Drop it!
3-15 -- 3rd paragraph, how exactly does one spread the load evenly over 3-phase
power?
I3-11-- 2nd paragraph under Voltage tolerance, spelling: change Emphasis to
Emphasize.
3-16 -- 3rd paragraph under motor generators, might add an instructor note
that PC and small VAX systems can usually ride the short power sags
without any special hardware requirements.
I3-12-- 2nd paragraph of motor generators, What "flywheel"?
-- last paragraph of motor generators, Huh? how can sustaining power for
an extra 1/10th of a second be enough for a "smooth" shutdown of
computer equipment? Never heard of this!
3-17 -- bullets, add Power Duration and Recharging Time as two other
considerations
3-18 -- First paragraph puts all computers in the same situation. My Laptop
generates very little heat and doesn't need any special air
conditioning.
3-19 -- 1st line, we haven't defined solid-state circuitry and where it is
found. Cars have Solid-state equipment and work fine on the hot
asphault parking lots. What devices are being referred to?
-- 3rd and 4th paragraph should only be one paragraph.
-- 2nd bullet is poorly worded. It implies that I don't have enough power
to start with.
-- 3rd bullet implies I can buy a fan at Kmart and cool my computer room.
Mention portable A/C unit that exhaust the hot air into the suspended
ceilings (no joke, we've use them before).
-- The last paragraph "writes off" all the issues regarding PC and
other office-based workstations. I believe this is a major short-
coming of this material.
3-20 -- Only the first section on this page relates to Physical Security, the
other issues are Environmental issues. Also, it ignore small systems
again, assuming computers are kept in computer rooms.
-- 2nd paragraph, "controlled...access", yeah, I call it a "door". Be
more specific.
-- 3rd paragraph, why no windows? Reflective or mirrored windows are
a feature I suggest to managers. Having a view to the outside
world is a GREAT way of reducing stress, which computer people seem
to easily generate.
-- Keeping data center clean, 1st paragraph -- printers in computer rooms
is not as large of an issue as it used to be. The 'lint' or 'paper
dust' was significantly reduced as shops moved from the high-speed,
continuous form impact printers to laser printers.
-- Keeping data center clean, 4th paragraph -- This is a MAJOR
misunderstanding of VCS systems. The were originally sold in
VAXcluster environments, but can have HSCs, Unix systems, stand-alone
VAX systems, Alphas, 3rd-party systems, etc. connected to the
VCS. The *name* is a poor name and should be changed to reflect its
current capabilities.
-- Water Problems, 1st paragraph -- "Sorry, Egon. I'm a little fuzzy on
this 'bad' thing. Why is it bad to cross the streams?" replace
"very bad" with "can be a problem". I've use water to clean soda
out of a VT200 keyboard before. Bad is subjective.
-- Water Problems, 3rd paragraph -- "SOME air conditioning units...", not
all are water cooled anymore. Liquid Nitrogen and Hydrogen, Freon and
other coolants are popular today.
3-21 -- So, what is "Power Conditioning System Plus"? Strange header for the
page. Is it a product or service? Is it a concept? Does DEC sell it?
-- 2nd paragraph, what's a H731x? What's VAX REMS, I've never heard of
this. How about a description. If this is a product pitch move it
to the appendix and present the "concepts" of the process here.
3-22 -- What's a REOP/EMS? What's a J-Box? What's a RSU? What's the box on
the floor near the J-Box that looks like it's 1/2 open?
Basically a chapter I'd cut-and-paste and deliver using only about 6-8 pages.
Some stuff is interesting, but the audience is wrong if we are selling this
course to System Managers. This stuff is more for Data Center Managers or
MIS directors.
Oh, and finally, many pages are using more than 75% of the page. This might
be OK for reference material or Self-paced material, but for lecture lab we
like enough space at the bottom of the page to take notes.
More later...
$
|
158.16 | Chapter 4 -- tools, but no real strategies | SOAEDS::TRAYSER | Seniority: Big Shovel, Less Breaks! | Mon Apr 12 1993 02:41 | 180 |
| Chapter 4
Don't see the connection between title and contents--"Software Dependabilities
Strategies". Sorry, but I saw no strategies, only concepts and tools.
----------
4-5 -- 1st sentence. Don't use the term "independently" to define the
concept of "independently recoverable unit".
4-6 -- Needs reorganizing. System Kernel Level Redundancy should precede
Subsystem Level Redundancy, especially since 'kernel' is used before
it is defined.
-- First section, CIRCUIT level redundancy not adequately defined.
-- 2nd sentence, makes no sense. What console processor? what is
"its" referring to, the CPU or the Console CPU? How does it work?
-- 2nd section. Why 2 separate sentences? If they were merged they would
for a coherent paragraph. It looks strange to break up a paragraph
like that. OR, bring back the bullets that probably were there in a
previous draft of this material.
-- 3rd section (other than moving it up on the page), 3rd bullet has an
example to get the idea across, the first two bullets also need an
example like MIRA or TANDEM, etc.
-- "Wide Area Redundancy" -- adding the term cluster system is redundant,
it is a system. Drop the cluster reference, it is covered later.
-- Last paragraph could benefit from an example like hurricane Hugo,
where damage along the Carolina coast was extensive, but the tornadoes
spawned by the hurricane traveled inland to cities like Charlotte doing
excessive damage far from the 'strike' site. The March blizzard was
a good example that even geographically separate locations need to be
MUCH more separated to protect them from the same disaster.
I4-6 -- 1st sentence, what does "such circuit level redundancy" mean? "Such"
implies a definition preceded this sentence. I don't see it.
-- Last paragraph, "...nearest the caller Order placement..." doesn't make
sense! Are we missing word here?
I4-9 -- 2nd paragraph from bottom, VMS also has this feature.
I4-11-- This definition was needed in the previous chapters. Move it forward
to the first occurrence of "2N or N+1".
4-11 -- Last paragraph, care to tell us a bit more about a "Watchdog Timer",
such as part number, common uses, etc. On instructor's page would
be fine.
4-12 -- Diagram is in error. CPU is missing, MIRA watchdog timer not
connected to the Q-bus, cable from DEQTA goes directly to the other
DEQTA (have it go to a 'cloud' or "..." to indicate more hardware"
What is the box labeled "Standard Q-bus Interface" supposed to be?
Is this an old configuration? I don't see the DSSI hardware or disks.
4-13 -- The entire page looks like it needs bullets! Or form them into
paragraphs.
-- Triple Modular, Tightly Coupled and Hardware Intensive - give an
example of the type of hardware required to configure to these
descriptions. (Stratus reference on the instructor's page is
probably OK for the last one.)
4-14 -- Layout looks weird. Looks like bullets were removed and a paragraph
would be more reasonable than the current format.
-- 2nd bullet "No single point of repair", this is a strange statement
that the rest of the bullet fails to explain adequately.
-- 4th bullet "Self-checking checkers", ditto. How is this done? Can
you give an example on the instructor's page?
4-15 -- 4-14 nor I4-15 adequately describe this diagram. What is "Mass STC",
what is X-Link? Diagram shows 2 cables leaving single Ethernet
controllers. Why does the Processor box have two little un-labeled
boxes in it? What is the vertical, white box that connects the
Mass STC, Ethernet and Console 'boxes'?
4-17 -- Overall this page is a sales pitch, not providing anything worthy
of a 'strategy'. This page really doesn't say anything worth
lecturing on in a Dependable Systems class. How is it redundant?
How should it be configured? What are the costs? Etc.!!
-- The middle section notes differences between VAXcluster and VMScluster.
This information is old and will not be current when this material
begins to ship. As of Mid-May (VAX-VMS 6.0 and AXP-VMS V1.5) mixed
architecture VMSclusters are supported. It might be worth noting
previous configurations, but ALL the documentation will be changed
to reflect that VMSclusters is the old VAXcluster concept on various
combinations of VAX and AXP systems.
-- Last sentence. Drop it or move it to the instructor's page. Having
a VAX VMScluster V6.0 (AXP V1.5) SPD in the instructor's kit would be
more useful than this line.
I4-17-- How can we get a copy of the Aberdeen Group white paper? For
Instructors and/or for Students.
4-18 -- 2nd paragraph, "secondary storage" -- I don't think we've mentioned
what "secondary" would be or how that differs from Primary storage.
-- Last paragraph -- misleading. Seems to imply that RAID might be
discussed in more detail in the Performance course. Please make it
clear that VMS performance is more clearly defined and that RAID
concepts (including performance) are covered in more details here.
I4-19-- Any idea how to order the Berkeley paper on RAID?
I4-20-- diagram is ugly. Either turn it on it's side (landscape) or break
it into two tables:
Level Descr. Avail. Request rate
0
1
...
Level Data rate cost type of appl
0
1
4-21 -- 2nd sentence, need to define "chunk" better -- "A chunk is the I/O
size of RAID that usually consist of several blocks or data", etc.
-- Availability, MTBF? Has it been previously defined? I think so,
just can't find it.
-- diagram, why is "B" shaded?
4-22 -- Reference of Raid 0 and Raid 1 being combined is good, but belongs
AFTER we have described Raid 1, possibly at the bottom of the page.
Otherwise it seems that the Avail/Perf/Storage section is defining
how RAID 0/1 works!
-- Also need to stick with either "RAID 1" or "RAID-1" spelling.
4-23 -- Why is "B" shaded?
4-24 -- Under Performance, the 2nd and 3rd paragraphs - don't describe any
performance issues. If they are describing how RAID-3 works, then
move them up to the top of the page.
-- Under Performance, 1st sentence, "...workload is a near sequential...",
huh? Don't understand the sentence.
4-25 -- Can't read the labels in the diagram. Probably need to clear the
background shading with the D�E�F like the triangle on top of the page.
-- Dark lines going to far, right disk implies only blocks A & C are
being referenced, add a 3rd line connecting "B" to the far right
disk.
4-26 -- bullets 5-8, don't make sense. How can we write a D�E�F XOR chunk
before we have written "F"??
4-28 -- Reed-Solomon, never heard of it. Where is it defined in the class
materials? How can I get more info on this?
-- 3rd & 4th bullets, what are in these blocks? Probably Reed-Solomon
stuff, right? And if the students asks, I can add nothing to the
discussion. Either more detail or drop all references to Raid-6.
4-31 -- What is this page doing here? Why is RAID discussed in such details
here? 'Can' it.
4-33 -- Only exercise is on disks? What about clusters? MIRA? other
concepts discussed? Considering we don't have products that cover
RAID 3, why spend 67% of the lab working on it?
Overall some interesting topics, but not focused around Strategies. Seems
like a Marketing tour since we only discuss concepts rather than configuration
issues, what disks should be RAID-ed, how is Cluster maintenance performed
without halting the cluster, how about combining VAXsim with shadow disks,
etc.
$
|
158.17 | I still own you a few more chapters, but "thanks"! | SOAEDS::TRAYSER | Seniority: Big Shovel, Less Breaks! | Tue Apr 20 1993 23:44 | 10 |
| Emmalee,
Thanks for the mail commenting on all my comments. It's nice to know the
course writers really do read this stuff. I even appreciate your "I
disagree with you" comments, where you keep the page as you have it but
put my comments on the instructors pages -- a quite acceptable solution.
Thanks!
$
|
158.18 | Course Update | SUPER::SUPER::TARRY | | Thu May 13 1993 16:49 | 45 |
| PLEASE NOTICE THIS IMPORTANT CHANGE!
The name of the directory ES$REVIEW is now IDC$REVIEW.
The chapters for review are posted in:
SUPER::$1$DUA6:[IDC$REVIEW.DEP_SYS]VMS_DS_#_name_STUDENT.PS
_INSTRUCTOR.PS
Again let me remind you that you must pull both student and instructor
versions. We no longer have facing pages.
The name of the course has been changed to:
Building Dependable Systems Using OpenVMS Products
The course is coming along pretty well. By tomorrow
14-May I will post updated chapters 1,2,3,4,5 and 8
(Some requested changes to the drawings are still in progress. All
VAX 9000's are being removed and some of the lines are being
corrected.)
Many thanks to Buck Trayser for his thoughtful and always appreciated
comments.
The remaining chapters are:
Chapter 6 Distributed Software Fault Tolerance.
Chapter 7 OpenVMS Products for Dependable Systems
And lab exercises which will include the following:
Performing a Rolling Upgrade of a VMScluster system
Building a shadowed system disk
Performing backup by breaking a shadow set
Backup data integrity exploration
Writing a help library module
Solving cluster problems using DECamds
Now is the time to review these chapters and prepare to teach this
innovative course.
|
158.19 | Chapter 6 posted for review | SUPER::SUPER::TARRY | | Fri May 21 1993 14:52 | 37 |
| A new chapter 6 Distributed Software Fault Tolerance has
been completed and posted in the IDC$REVIEW directory
SUPER::$1$DUA6:[IDC$REVIEW.DEP_SYS]VMS_DS_6_DSFT_STUDENT.PS
_INSTRUCTOR.PS
This chapter discusses a concept of providing fault tolerant systems
which is new to me and very exciting.
Distributed software fault tolerance uses distributed
client/server configurations to provide fault tolerance so that the
failure of a single node, network link or even an entire site does not
interrupt the availability of an application.
There is a very beautiful demo of this concept which is to be included
in the media kit. It must be run on a workstation and of course you
will need to be able to project the workstation. The
chapter may not make the concept as clear without the demo.
Also some drawings for the chapter have not been done yet.
One point to emphasize is that nothing about distributed software fault
tolerance negates anything that is in this course except perhaps
providing standby generators in case of power failure.
I think others will like this idea as much as I do. I will repost the
chapter after the drawing are added and will try to make the demo
available to instructors. There will also be some handouts on a
successful implementation in Australia.
Next week we will post the final chapters which are:
Chapter 7 OpenVMS layered products
8 Managing complex system
9 Laboratory exercises
With the lab exercises this is a 5 day course.
|
158.20 | Pilot June 14 | SUPER::SUPER::TARRY | | Fri May 21 1993 14:53 | 3 |
| The pilot for Building Dependable Systems has been rescheduled to
June 14. It will be held in ZKO instead of PKO. Students are now
being enrolled.
|
158.21 | | BROWNY::GDAY::MAXWELL | Dave Maxwell | Fri May 21 1993 15:09 | 12 |
| Hello,
I have placed a saveset in SUPER::IDC$REVIEW:[DEP_SYS]RTRDEMO.BCK
You will need the files to teach the class. These are the files for doing the
RTR demo. It is a real neat demo. I suggest you try it even if you don't teach
the class. Their are 4 files in the saveset. Print out the AAA_README.TXT file.
It tells you how to run the demo. It does require a workstation. If you have a
color terminal it is even nicer.
Good luck,
Dave
|
158.22 | Building Dependable Systems Finished | SUPER::SUPER::TARRY | | Thu Jul 08 1993 09:59 | 36 |
| The course is finished. It is available in 3 formats:
EY-N438 Building and Managing Dependable Systems Using OpenVMS
Products L/L 5 days Includes operating system generic
materials, OpenVMS specific materials and laboratory
exercises.
EY-Q141 Building and Managing Dependable Systems Using OpenVMS
Products Sem 3 days Generic and OpenVMS specific
materials.
EY-N439 Building and Managing Dependable Systems
2 days Seminar Generic materials only
The course did not have a pilot.
Materials are posted in the IDC$REVIEW directory
SUPER::$1$DU6:[IDC$REVIEW.DEP_SYS]
AMDS.BCK;1 Demo for DECamds
EY-N438E-EX-0001.PS;3 Student lab book
EY-N438E-IG-0001.PS;1 Instructor guide
EY-N438E-SG-0001.PS;1 Student guide
EY-N438E-TS-0001.PS;1 Pre test
RTRDEMO.BCK;1 RTR demo (fun-try it out)
To prepare for the course you need both the student guide and the
instructor guide.
Look for an overhead package to be posted next week.
|
158.23 | Y | SUPER::SUPER::TARRY | | Thu Jul 08 1993 10:00 | 5 |
| Discussion regarding the course Building Dependable Systems has been
moved to the notes file.
SUPER::VMS_PERFORMANCE
|