[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::space

Title:Space Exploration
Notice:Shuttle launch schedules, see Note 6
Moderator:PRAGMA::GRIFFIN
Created:Mon Feb 17 1986
Last Modified:Thu Jun 05 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:974
Total number of notes:18843

193.0. "Feynman's CHALLENGER 51-L Report" by VICKI::LEKAS (Tony Lekas) Tue Jul 22 1986 18:53

The current Scientific American has some information in the
Science and the Citizen section on the Rodger's report.  It also has
some quotes from Richard Feynman's separate report.  I have a
great deal of respect for Feynman and would like to read his
report.  Does anyone know how to get a copy?

Tony 
T.RTitleUserPersonal
Name
DateLines
193.1Volume 2 of Rodgers' Commission report?DSSDEV::SAUTERJohn SauterWed Jul 23 1986 09:1120
    The document you get from the government for $18 is only volume
    1 of the full report.  Could Feynman's separate report be in one
    of the other volumes?  According to the Table of Contents volume
    2 contains the following:
    
    Appendix E	Independent Test Team Report to the Commission
    Appendix F	Personal Observations on Reliability of Shuttle
    Appendix G	Human Factors Analysis
    Appendix H	Flight Readiness Review Treatment of O-ring Problems
    Appendix I	NASA Pre-Launch Activities Team Report
    Appendix J	NASA Mission Planning and Operations Team Report
    Appendix K	NASA Development and Production Team Report
    Appendix L	NASA Accident Analysis Team Report
    Appendix M	Comments by Morton Thiokol on NASA Report
    
    The other volumes don't look likely to contain Feynman's separate
    report.  If I had to guess I'd say it would be Appendix F.
    
    Unfortunately, I don't know how to obtain volume 2.
        John Sauter
193.2Try the GPOLATOUR::DZIEDZICWed Jul 23 1986 09:445
    I believe Feynman published his report separately, so it was
    not contained in the "official" report.  Try the GPO for the
    remainder of the report; maybe they might know where to get a
    copy of Feynman's comments.
    
193.3Not at the GPOMARY::LEKASTony LekasWed Jul 23 1986 15:328
I also got the first volume of the report.  The reference in
Scientific American seemed to indicate that Feynman's report was
separate.  Sort of his dissenting opinion.  I called the Government
Printing Office but the person at the order desk said that they
did not have it.  I am going to write to Scientific American and
ask where they got it.

Tony 
193.4Feynman's reportFRSBEE::FARRINGTONa Nuclear wonderland !Wed Jul 23 1986 16:1010
    Feynman's report is to be issued as a separate 'appendix', due,
    if I recall correctly, to some major disagreement over Feynman's
    conclusions (or analyses or some such).  The separate issue was
    to allow all parties to 'be happy'.
    
    The silliness of the scientific thinker !  I mean _really_; who
    wants to hear the unvarnished truth with hard core science to
    back it up ?!  There are (political) careers to consider...  
    
    Dwight
193.5Response from Sciantific AmericanMARY::LEKASTony LekasTue Aug 05 1986 12:2612
I received a response from Scientific American.  They suggested
that I contact: 

California Institute of Technology
Public Relations Department
315 S. Hille Avenue
Pasadena, California, 91125

I just wrote them a letter.  I'll post the response here if and
when I get it.

		Tony 
193.6Response from Cal TechMARY::LEKASTony LekasFri Aug 15 1986 18:1316
Cal Tech sent me a copy of Feynman's observations and as I
promised I will post their response here.  They sent a copy of a
13 page typewritten document. 

It gives some insights on how people and organizations can fool
themselves.  It is interesting to note that of the groups
responsible for the Solid Rocket Boosters, The Main Engines, the
computer system, the sensors, and the reaction control system
Feynman believes that only the computer system software
organization gave appropriate attention to reliability and
safety.  Maybe this is because everyone knows that computer
software usually has bugs and as a result they were more careful. 

Feynman's observations follow in the next response.

		Tony
193.7Feynman's ObservationsMARY::LEKASTony LekasFri Aug 15 1986 18:15740


        Personal observations on the reliability of  the  Shuttle
        by R.  P.  Feynman


        Introduction
        ------------

             It appears that there are  enormous  differences  of
        opinion  as  to the probability of a failure with loss of
        vehicle and of human  life.   The  estimates  range  from
        roughly  1  in  100  to 1 in 100,000.  The higher figures
        come from the working engineers, and the very low figures
        from management.  What are the causes and consequences of
        this lack of agreement?  Since 1 part  in  100,000  would
        imply  that  one  could put a Shuttle up each day for 300
        years expecting to lose only one, we could  properly  ask
        "What is the cause of management's fantastic faith in the
        machinery?"

             We have also found that certification criteria  used
        in  Flight  Readiness  Reviews  often develop a gradually
        decreasing strictness.  The argument that the  same  risk
        was  flown before without failure is often accepted as an
        argument for the safety of accepting it  again.   Because
        of this, obvious weaknesses are accepted again and again,
        sometimes  without  a  sufficiently  serious  attempt  to
        remedy  them,  or  to  delay  a  flight  because of their
        continued presence.

             There are several sources of information.  There are
        published criteria for certification, including a history
        of modifications in the form of waivers  and  deviations.
        In  addition, the records of the Flight Readiness Reviews
        for each flight document the arguments used to accept the
        risks  of  the flight.  Information was obtained from the
        direct testimony and the  reports  of  the  range  safety
        officer, Louis J.  Ullian, with respect to the history of
        success of solid fuel rockets.  There was a further study
        by  him  (as  chairman  of  the launch abort safety panel
        (LASP)) in an attempt to determine the risks involved  in
        possible  accidents  leading to radioactive contamination
        from attempting to fly a plutonium power supply (RTG) for
        future  planetary  missions.   The NASA study of the same
        question is also available.  For the History of the Space
        Shuttle  Main  Engines,  interviews  with  management and
        engineers  at  Marshall,  and  informal  interviews  with
        engineers  at Rocketdyne, were made.  An independent (Cal
        Tech) mechanical engineer who consulted  for  NASA  about
        engines  was  also  interviewed  informally.   A visit to
        Johnson was made to gather information on the reliability
        of  the  avionics  (computers,  sensors,  and effectors).
        Finally there is a  report  "A  Review  of  Certification
        Practices,  Potentially  Applicable to Man-rated Reusable
        Rocket  Engines,"  prepared   at   the   Jet   Propulsion
        Laboratory  by  N.  Moore, et al., in February, 1986, for

                                                                Page 2


        NASA Headquarters, Office of Space Flight.  It deals with
        the  methods  used by the FAA and the military to certify
        their gas turbine and rocket engines.  These authors were
        also interviewed informally.


        Solid Rockets (SRB)
        -------------------

             An estimate of the reliability of solid rockets  was
        made  by  the  range  safety  officer,  by  studying  the
        experience of all previous  rocket  flights.   Out  of  a
        total  of  nearly  2,900  flights,  121 failed (1 in 25).
        This includes, however, what may be called, early errors,
        rockets  flown  for  the  first few times in which design
        errors are  discovered  and  fixed.   A  more  reasonable
        figure  for  the  mature  rockets might be 1 in 50.  With
        special care in the selection of parts and in inspection,
        a  figure  of  below  1 in 100 might be achieved but 1 in
        1,000 is probably not attainable with today's technology.
        (Since there are two rockets on the Shuttle, these rocket
        failure rates must be  doubled  to  get  Shuttle  failure
        rates from Solid Rocket Booster failure.)

             NASA officials argue that the figure is much  lower.
        They  point  out  that  these  figures  are  for unmanned
        rockets but since the Shuttle is a  manned  vehicle  "the
        probability  of mission success is necessarily very close
        to 1.0." It is not very clear  what  this  phrase  means.
        Does  it  mean  it  is  close to 1 or that it ought to be
        close to 1?  They go on  to  explain  "Historically  this
        extremely  high  degree of mission success has given rise
        to a difference in philosophy between manned space flight
        programs   and   unmanned   programs;   i.e.,   numerical
        probability usage versus  engineering  judgment."  (These
        quotations  are  from  "Space  Shuttle Data for Planetary
        Mission RTG Safety Analysis," Pages  3-1,  3-1,  February
        15,  1985, NASA, JSC.) It is true that if the probability
        of failure was as low as 1 in 100,000 it  would  take  an
        inordinate  number  of  tests to determine it ( you would
        get nothing but a string of perfect flights from which no
        precise figure, other than that the probability is likely
        less than the number of such flights  in  the  string  so
        far).   But,  if  the  real  probability is not so small,
        flights would show troubles, near failures, and  possible
        actual  failures with a reasonable number of trials.  and
        standard statistical  methods  could  give  a  reasonable
        estimate.   In  fact, previous NASA experience had shown,
        on occasion, just such difficulties, near accidents,  and
        accidents,  all  giving  warning  that the probability of
        flight failure was not so very small.  The  inconsistency
        of  the  argument  not  to  determine reliability through
        historical experience, as the range safety  officer  did,
        is   that   NASA   also  appeals  to  history,  beginning
        "Historically this high  degree  of  mission  success..."

                                                                Page 3


        Finally,   if   we  are  to  replace  standard  numerical
        probability usage with engineering judgment,  why  do  we
        find  such  an  enormous disparity between the management
        estimate and the judgment of  the  engineers?   It  would
        appear  that, for whatever purpose, be it for internal or
        external consumption, the management of NASA  exaggerates
        the reliability of its product, to the point of fantasy.

             The  history  of  the   certification   and   Flight
        Readiness  Reviews will not be repeated here.  (See other
        part of Commission reports.) The phenomenon of  accepting
        for  flight,  seals that had shown erosion and blow-by in
        previous flights, is very clear.  The  Challenger  flight
        is an excellent example.  There are several references to
        flights that had gone before.  The acceptance and success
        of  these  flights  is  taken as evidence of safety.  But
        erosion and blow-by are not  what  the  design  expected.
        They are warnings that something is wrong.  The equipment
        is not operating as expected, and therefore  there  is  a
        danger  that it can operate with even wider deviations in
        this unexpected and not thoroughly understood  way.   The
        fact  that  this  danger  did  not  lead to a catastrophe
        before is no guarantee that it will not  the  next  time,
        unless it is completely understood.  When playing Russian
        roulette the fact that the first shot got off  safely  is
        little comfort for the next.  The origin and consequences
        of the erosion and blow-by were not understood.  They did
        not   occur  equally  on  all  flights  and  all  joints;
        sometimes more, and sometimes less.   Why  not  sometime,
        when  whatever conditions determined it were right, still
        more leading to catastrophe?

             In spite of these  variations  from  case  to  case,
        officials  behaved  as  if  they  understood  it,  giving
        apparently  logical  arguments  to   each   other   often
        depending  on  the  "success"  of  previous flights.  For
        example.  in determining if flight 51-L was safe  to  fly
        in  the face of ring erosion in flight 51-C, it was noted
        that the erosion depth was only one-third of the  radius.
        It  had been noted in an experiment cutting the ring that
        cutting it as deep as one radius was necessary before the
        ring  failed.   Instead  of  being  very  concerned  that
        variations  of   poorly   understood   conditions   might
        reasonably  create  a  deeper  erosion  this time, it was
        asserted, there was "a safety factor of three." This is a
        strange use of the engineer's term ,"safety factor." If a
        bridge is built to withstand a certain load  without  the
        beams  permanently  deforming,  cracking, or breaking, it
        may be designed for the materials used to actually  stand
        up  under  three times the load.  This "safety factor" is
        to allow for uncertain excesses of load, or unknown extra
        loads,  or  weaknesses  in  the  material that might have
        unexpected flaws, etc.  If now the expected load comes on
        to  the new bridge and a crack appears in a beam, this is
        a failure of the design.  There was no safety  factor  at

                                                                Page 4


        all;  even  though  the  bridge did not actually collapse
        because the crack went only one-third of the way  through
        the  beam.  The O-rings of the Solid Rocket Boosters were
        not designed to erode.  Erosion was a clue that something
        was  wrong.   Erosion was not something from which safety
        can be inferred.

             There was no way, without full  understanding,  that
        one  could  have confidence that conditions the next time
        might not produce erosion three times  more  severe  than
        the   time   before.    Nevertheless,   officials  fooled
        themselves into thinking they had such understanding  and
        confidence, in spite of the peculiar variations from case
        to case.  A mathematical  model  was  made  to  calculate
        erosion.    This  was  a  model  based  not  on  physical
        understanding but on empirical curve fitting.  To be more
        detailed, it was supposed a stream of hot gas impinged on
        the O-ring material, and the heat was determined  at  the
        point  of  stagnation  (so far, with reasonable physical,
        thermodynamic laws).  But to determine  how  much  rubber
        eroded  it was assumed this depended only on this heat by
        a formula suggested by data on  a  similar  material.   A
        logarithmic  plot  suggested  a  straight line, so it was
        supposed that the erosion varied as the .58 power of  the
        heat,  the .58 being determined by a nearest fit.  At any
        rate, adjusting some other  numbers,  it  was  determined
        that  the  model  agreed  with  the  erosion (to depth of
        one-third the radius of the ring).  There is nothing much
        so   wrong   with   this   as   believing   the   answer!
        Uncertainties appear  everywhere.   How  strong  the  gas
        stream  might  be was unpredictable, it depended on holes
        formed in the putty.  Blow-by showed that the ring  might
        fail  even  though not, or only partially eroded through.
        The empirical formula was known to be uncertain,  for  it
        did not go directly through the very data points by which
        it was determined.  There were a  cloud  of  points  some
        twice  above,  and  some twice below the fitted curve, so
        erosions twice predicted were reasonable from that  cause
        alone.    Similar   uncertainties  surrounded  the  other
        constants in  the  formula,  etc.,  etc.   When  using  a
        mathematical  model  careful  attention  must be given to
        uncertainties in the model.


        Liquid Fuel Engine (SSME)
        ------------------------- 

             During the flight of 51-L the  three  Space  Shuttle
        Main  Engines  all  worked  perfectly,  even, at the last
        moment, beginning to shut down the engines  as  the  fuel
        supply  began  to fail.  The question arises, however, as
        to whether, had it failed, and we were to investigate  it
        in  as much detail as we did the Solid Rocket Booster, we
        would find a similar lack of attention to  faults  and  a
        deteriorating  reliability.   In  other  words,  were the

                                                                Page 5


        organization weaknesses that contributed to the  accident
        confined  to the Solid Rocket Booster sector or were they
        a more general characteristic of NASA?  To that  end  the
        Space  Shuttle  Main  Engines  and the avionics were both
        investigated.  No similar study of the  Orbiter,  or  the
        External Tank were made.

             The engine is a much more complicated structure than
        the  Solid Rocket Booster, and a great deal more detailed
        engineering goes into  it.   Generally,  the  engineering
        seems  to  be of high quality and apparently considerable
        attention is paid to deficiencies  and  faults  found  in
        operation.

             The usual way that such engines  are  designed  (for
        military   or   civilian  aircraft)  may  be  called  the
        component system,  or  bottom-up  design.   First  it  is
        necessary  to  thoroughly  understand  the properties and
        limitations of the materials  to  be  used  (for  turbine
        blades, for example), and tests are begun in experimental
        rigs to determine  those.   With  this  knowledge  larger
        component  parts  (such  as  bearings)  are  designed and
        tested individually.  As deficiencies and  design  errors
        are  noted  they  are corrected and verified with further
        testing.  Since one tests only  parts  at  a  time  these
        tests   and   modifications  are  not  overly  expensive.
        Finally one works up to the final design  of  the  entire
        engine, to the necessary specifications.  There is a good
        chance, by this  time  that  the  engine  will  generally
        succeed,  or  that  any  failures are easily isolated and
        analyzed  because  the  failure  modes,  limitations   of
        materials, etc., are so well understood.  There is a very
        good chance that the modifications to the engine  to  get
        around  the final difficulties are not very hard to make,
        for most  of  the  serious  problems  have  already  been
        discovered and dealt with in the earlier, less expensive,
        stages of the process.

             The Space Shuttle  Main  Engine  was  handled  in  a
        different manner, top down, we might say.  The engine was
        designed and put together all  at  once  with  relatively
        little  detailed  preliminary  study  of the material and
        components.   Then  when  troubles  are  found   in   the
        bearings, turbine blades, coolant pipes, etc., it is more
        expensive and difficult to discover the causes  and  make
        changes.   For  example,  cracks  have  been found in the
        turbine blades of the  high  pressure  oxygen  turbopump.
        Are  they  caused by flaws in the material, the effect of
        the oxygen atmosphere on the properties of the  material,
        the   thermal   stresses  of  startup  or  shutdown,  the
        vibration and stresses of steady running,  or  mainly  at
        some  resonance at certain speeds, etc.?  How long can we
        run from crack initiation to crack failure, and how  does
        this  depend  on power level?  Using the completed engine
        as a test bed to  resolve  such  questions  is  extremely

                                                                Page 6


        expensive.  One does not wish to lose an entire engine in
        order to find out where and how failure occurs.  Yet,  an
        accurate  knowledge  of  this information is essential to
        acquire a confidence in the engine  reliability  in  use.
        Without  detailed  understanding,  confidence  can not be
        attained.

             A further disadvantage of  the  top-down  method  is
        that,  if  an  understanding  of  a  fault is obtained, a
        simple fix, such as a new shape for the turbine  housing,
        may  be impossible to implement without a redesign of the
        entire engine.

             The Space Shuttle Main Engine is a  very  remarkable
        machine.  It has a greater ratio of thrust to weight than
        any previous engine.  It is built  at  the  edge  of,  or
        outside  of, previous engineering experience.  Therefore,
        as  expected,  many  different   kinds   of   flaws   and
        difficulties  have turned up.  Because, unfortunately, it
        was built in the top-down manner, they are  difficult  to
        find  and  fix.   The  design  aim  of  a  lifetime of 55
        missions equivalent firings (27,000 seconds of operation,
        either  in  a mission of 500 seconds, or on a test stand)
        has not been obtained.   The  engine  now  requires  very
        frequent  maintenance and replacement of important parts,
        such as turbopumps, bearings, sheet metal housings,  etc.
        The high-pressure fuel turbopump had to be replaced every
        three or four mission equivalents (although that may have
        been  fixed,  now) and the high pressure oxygen turbopump
        every five or six.  This is at most ten  percent  of  the
        original specification.  But our main concern here is the
        determination of reliability.

             In a total of about 250,000  seconds  of  operation,
        the  engines  have  failed  seriously  perhaps  16 times.
        Engineering pays close attention to  these  failings  and
        tries  to  remedy  them  as quickly as possible.  This it
        does by  test  studies  on  special  rigs  experimentally
        designed for the flaws in question, by careful inspection
        of the engine for suggestive clues (like cracks), and  by
        considerable  study  and analysis.  In this way, in spite
        of the difficulties  of  top-down  design,  through  hard
        work, many of the problems have apparently been solved.

             A list of  some  of  the  problems  follows.   Those
        followed by an asterisk (*) are probably solved:

        1.  Turbine blade cracks in high pressure fuel turbopumps
            (HPFTP).  (May have been solved.)

        2.  Turbine  blade  cracks  in   high   pressure   oxygen
            turbopumps (HPOTP).

                                                                Page 7


        3.  Augmented Spark Igniter (ASI) line rupture.*

        4.  Purge check valve failure.*

        5.  ASI chamber erosion.*

        6.  HPFTP turbine sheet metal cracking.

        7.  HPFTP coolant liner failure.*

        8.  Main combustion chamber outlet elbow failure.*

        9.  Main combustion chamber inlet elbow weld offset.*

       10.  HPOTP subsynchronous whirl.*

       11.  Flight acceleration  safety  cutoff  system  (partial
            failure in a redundant system).*

       12.  Bearing spalling (partially solved).

       13.  A  vibration  at  4,000  Hertz  making  some  engines
            inoperable, etc.


             Many  of  these  solved  problems  are   the   early
        difficulties  of a new design, for 13 of them occurred in
        the first 125,000 seconds and only three  in  the  second
        125,000  seconds.   Naturally, one can never be sure that
        all the bugs are out, and, for some, the fix may not have
        addressed  the  true cause.  Thus, it is not unreasonable
        to guess there may be at least one surprise in  the  next
        250,000  seconds,  a  probability of 1/500 per engine per
        mission.  On a mission there are three engines, but  some
        accidents  would  possibly  be contained, and only affect
        one engine.  The system can abort with only two  engines.
        Therefore  let  us  say that the unknown suprises do not,
        even  of  themselves,  permit  us  to  guess   that   the
        probability  of  mission  failure do to the Space Shuttle
        Main Engine is less than 1/500.  To this we must add  the
        chance  of  failure  from  known,  but  as  yet unsolved,
        problems (those without the asterisk in the list  above).
        These  we  discuss  below.  (Engineers at Rocketdyne, the
        manufacturer, estimate the total probability as 1/10,000.
        Engineers  at  marshal  estimate  it as 1/300, while NASA
        management, to whom these engineers report, claims it  is
        1/100,000.   An  independent engineer consulting for NASA
        thought 1 or 2 per 100 a reasonable estimate.)

             The history  of  the  certification  principles  for
        these  engines  is  confusing  and  difficult to explain.
        Initially the rule seems to have  been  that  two  sample
        engines  must  each  have  had  twice  the time operating
        without failure as the operating time of the engine to be
        certified  (rule  of  2x).   At  least  that  is  the FAA

                                                                Page 8


        practice, and NASA seems to have adopted  it,  originally
        expecting  the certified time to be 10 missions (hence 20
        missions for each sample).  Obviously the best engines to
        use  for  comparison  would  be  those  of greatest total
        (flight plus test) operating time -- the so-called "fleet
        leaders."  But  what if a third sample and several others
        fail in a short time?  Surely we will not be safe because
        two were unusual in lasting longer.  The short time might
        be more representative of the real possibilities, and  in
        the  spirit  of  the  safety  factor of 2, we should only
        operate at half the time of the short-lived samples.

             The slow shift toward decreasing safety  factor  can
        be  seen  in  many  examples.   We take that of the HPFTP
        turbine blades.  First of all  the  idea  of  testing  an
        entire  engine was abandoned.  Each engine number has had
        many important parts  (like  the  turbopumps  themselves)
        replaced  at frequent intervals, so that the rule must be
        shifted from engines to components.  We accept  an  HPFTP
        for  a  certification  time  if two samples have each run
        successfully for twice that time (and  of  course,  as  a
        practical  matter,  no longer insisting that this time be
        as large as 10 missions).  But  what  is  "successfully?"
        The  FAA calls a turbine blade crack a failure, in order,
        in practice, to really provide a  safety  factor  greater
        than  2.   There  is  some  time  that  an engine can run
        between the time a crack originally starts until the time
        it  has  grown  large  enough  to  fracture.  (The FAA is
        contemplating new rules that take this extra safety  time
        into  account,  but only if it is very carefully analyzed
        through known models within a known range  of  experience
        and  with  materials  thoroughly  tested.   None of these
        conditions apply to the Space Shuttle Main Engine.

             Cracks were found in many second stage HPFTP turbine
        blades.   In  one  case  three  were  found  after  1,900
        seconds, while in another they were not found after 4,200
        seconds,   although  usually  these  longer  runs  showed
        cracks.  To follow this story further we  shall  have  to
        realize that the stress depends a great deal on the power
        level.  The Challenger flight was to be at, and  previous
        flights  had  been at, a power level called 104% of rated
        power level during most of  the  time  the  engines  were
        operating.    Judging  from  some  material  data  it  is
        supposed that at the level 104% of rated power level, the
        time  to  crack is about twice that at 109% or full power
        level (FPL).  Future flights were to  be  at  this  level
        because  of heavier payloads, and many tests were made at
        this level.  Therefore dividing time at  104%  by  2,  we
        obtain  units  called equivalent full power level (EFPL).
        (Obviously, some uncertainty is introduced by  that,  but
        it  has  not been studied.) The earliest cracks mentioned
        above occurred at 1,375 EFPL.

                                                                Page 9


             Now the certification rule becomes "limit all second
        stage  blades to a maximum of 1,375 seconds EFPL." If one
        objects that the safety factor of 2 is lost it is pointed
        out  that  the  one  turbine  ran  for 3,800 seconds EFPL
        without cracks, and half of this is 1,900 so we are being
        more  conservative.   We  have  fooled ourselves in three
        ways.  First we have only one sample, and it is  not  the
        fleet  leader, for the other two samples of 3,800 or more
        seconds had 17 cracked blades between them.   (There  are
        59  blades  in the engine.) Next we have abandoned the 2x
        rule and substituted equal time.  And finally,  1,375  is
        where  we  did see a crack.  We can say that no crack had
        been found below 1,375, but the last time we  looked  and
        saw  no  cracks  was  1,100 seconds EFPL.  We do not know
        when the crack formed between these  times,  for  example
        cracks   may   have   formed   at   1,150  seconds  EFPL.
        (Approximately 2/3 of the blade sets tested in excess  of
        1,375  seconds  EFPL had cracks.  Some recent experiments
        have, indeed, shown cracks as early as 1,150 seconds.) It
        was important to keep the number high, for the Challenger
        was to fly an engine very close to the limit by the  time
        the flight was over.

             Finally it is claimed  that  the  criteria  are  not
        abandoned,  and  the system is safe, by giving up the FAA
        convention  that  there  should   be   no   cracks,   and
        considering  only a completely fractured blade a failure.
        With this definition no engine has yet failed.  The  idea
        is  that  since  there  is sufficient time for a crack to
        grow to a fracture we can insure  that  all  is  safe  by
        inspecting  all  blades  for  cracks.  If they are found,
        replace them, and if none are found we have  enough  time
        for  a  safe mission.  This makes the crack problem not a
        flight safety problem, but merely a maintenance problem.

             This may in fact be true.  But how well do  we  know
        that  cracks  always  grow slowly enough that no fracture
        can occur in a mission?  Three engines have run for  long
        times  with  a  few  cracked  blades (about 3,000 seconds
        EFPL) with no blades broken off.

             But a fix for this cracking may have been found.  By
        changing  the  blade shape, shot-peening the surface, and
        covering with insulation to exclude  thermal  shock,  the
        blades have not cracked so far.

             A very similar  story  appears  in  the  history  of
        certification  of  the  HPOTP,  but we shall not give the
        details here.

             It is evident, in summary, that the Flight Readiness
        Reviews  and certification rules show a deterioration for
        some of the problems of the  Space  Shuttle  Main  Engine
        that  is  closely  analogous to the deterioration seen in
        the rules for the Solid Rocket Booster.

                                                               Page 10


        Avionics
        --------

             By "avionics" is meant the computer  system  on  the
        Orbiter   as   well  as  its  input  sensors  and  output
        actuators.  At first we will restrict  ourselves  to  the
        computers   proper   and   not   be  concerned  with  the
        reliability of the input information from the sensors  of
        temperature,   pressure,   etc.,  nor  with  whether  the
        computer output is faithfully followed by  the  actuators
        of  rocket  firings,  mechanical  controls,  displays  to
        astronauts, etc.

             The computer system is very elaborate,  having  over
        250,000  lines  of  code.   It is responsible, among many
        other things, for the automatic  control  of  the  entire
        ascent  to orbit, and for the descent until well into the
        atmosphere (below Mach  1)  once  one  button  is  pushed
        deciding  the landing site desired.  It would be possible
        to make the entire landing automatically (except that the
        landing  gear  lowering  signal  is expressly left out of
        computer control, and must  be  provided  by  the  pilot,
        ostensibly  for  safety  reasons)  but  such  an entirely
        automatic landing is probably not  as  safe  as  a  pilot
        controlled  landing.  During orbital flight it is used in
        the control of payloads, in displaying information to the
        astronauts,  and  the  exchange  of  information  to  the
        ground.  It is evident that the safety of flight requires
        guaranteed  accuracy of this elaborate system of computer
        hardware and software.

             In brief, the hardware  reliability  is  ensured  by
        having  four  essentially  independent identical computer
        systems.  Where possible each sensor  also  has  multiple
        copies, usually four, and each copy feeds all four of the
        computer lines.  If the inputs from the sensors disagree,
        depending   on  circumstances,  certain  averages,  or  a
        majority selection is used as the effective  input.   The
        algorithm  used  by each of the four computers is exactly
        the same, so their inputs (since each sees all copies  of
        the  sensors)  are  the same.  Therefore at each step the
        results in each computer should be identical.  From  time
        to time they are compared, but because they might operate
        at slightly different speeds a  system  of  stopping  and
        waiting  at  specific  times  is  instituted  before each
        comparison is made.  If one of the  computers  disagrees,
        or  is  too  late  in  having its answer ready, the three
        which do agree are assumed to be correct and  the  errant
        computer is taken completely out of the system.  If, now,
        another computer fails, as judged by the agreement of the
        other two, it is taken out of the system, and the rest of
        the flight canceled, and descent to the landing  site  is
        instituted,  controlled  by  the two remaining computers.
        It is seen that this is  a  redundant  system  since  the
        failure of only one computer does not affect the mission.

                                                               Page 11


        Finally, as an extra feature of safety, there is a  fifth
        independent  computer,  whose  memory is loaded with only
        the programs of ascent and descent, and which is  capable
        of  controlling the descent if there is a failure of more
        than two of the computers of the main line four.

             There is not enough room in the memory of  the  main
        line  computers  for all the programs of ascent, descent,
        and payload programs in flight, so the memory  is  loaded
        about four time from tapes, by the astronauts.

             Because of the enormous effort required  to  replace
        the  software  for  such  an  elaborate  system,  and for
        checking a new system out, no change has been made to the
        hardware  since the system began about fifteen years ago.
        The  actual  hardware  is  obsolete;  for  example,   the
        memories  are  of  the  old  ferrite  core  type.   It is
        becoming more difficult to find manufacturers  to  supply
        such   old-fashioned   computers  reliably  and  of  high
        quality.  Modern computers are very much  more  reliable,
        can  run  much faster, simplifying circuits, and allowing
        more to be done, and would not require so much loading of
        memory, for the memories are much larger.

             The  software  is  checked  very  carefully   in   a
        bottom-up  fashion.   First,  each  new  line  of code is
        checked, then sections of code or  modules  with  special
        functions  are  verified.  The scope is increased step by
        step until  the  new  changes  are  incorporated  into  a
        complete  system  and  checked.   This complete output is
        considered  the  final  product,  newly  released.    But
        completely   independently   there   is   an  independent
        verification group, that takes an adversary  attitude  to
        the  software  development  group, and tests and verifies
        the software as if it were a customer  of  the  delivered
        product.   There  is additional verification in using the
        new programs in simulators, etc.  A discovery of an error
        during  verification  testing is considered very serious,
        and its origin  studied  very  carefully  to  avoid  such
        mistakes in the future.  Such unexpected errors have been
        found only about six times in  all  the  programming  and
        program  changing  (for new or altered payloads) that has
        been done.  The principle that is followed  is  that  all
        the  verification  is not an aspect of program safety, it
        is merely a test of that safety,  in  a  non-catastrophic
        verification.   Flight  safety  is to be judged solely on
        how well the programs do in the  verification  tests.   A
        failure here generates considerable concern.

             To summarize then, the  computer  software  checking
        system  and  attitude  is  of the highest quality.  There
        appears to be no process  of  gradually  fooling  oneself
        while  degrading standards so characteristic of the Solid
        Rocket  Booster  or  Space  Shuttle  Main  Engine  safety
        systems.   To be sure, there have been recent suggestions

                                                               Page 12


        by management to curtail  such  elaborate  and  expensive
        tests  as  being unnecessary at this late date in Shuttle
        history.   This  must  be  resisted  for  it   does   not
        appreciate  the  mutual subtle influences, and sources of
        error generated by even small changes of one  part  of  a
        program  on  another.   There  are perpetual requests for
        changes as new payloads and new demands and modifications
        are  suggested  by  the  users.   Changes  are  expensive
        because they require extensive testing.  The  proper  way
        to  save  money  is  to  curtail  the number of requested
        changes, not the quality of testing for each.

             One might add that the  elaborate  system  could  be
        very   much   improved   by   more  modern  hardware  and
        programming techniques.  Any  outside  competition  would
        have  all  the  advantages  of starting over, and whether
        that is a good idea for  NASA  now  should  be  carefully
        considered.

             Finally, returning to the sensors and  actuators  of
        the  avionics system, we find that the attitude to system
        failure and reliability is not nearly as good as for  the
        computer  system.   For  example,  a difficulty was found
        with certain temperature sensors sometimes failing.   Yet
        18  months  later the same sensors were still being used,
        still  sometimes  failing,  until  a  launch  had  to  be
        scrubbed  because  two  of  them failed at the same time.
        Even on a succeeding flight this  unreliable  sensor  was
        used  again.   Again reaction control systems, the rocket
        jets used for reorienting and control in flight still are
        somewhat  unreliable.   There is considerable redundancy,
        but a long history of failures, none  of  which  has  yet
        been  extensive  enough  to seriously affect flight.  The
        action of the jets is checked by sensors,  and,  if  they
        fail  to  fire  the computers choose another jet to fire.
        But they are not designed to fail, and the problem should
        be solved.

                                                               Page 13


        Conclusions
        -----------

             If a reasonable launch schedule is to be maintained,
        engineering  often  cannot be done fast enough to keep up
        with  the   expectations   of   originally   conservative
        certification  criteria designed to guarantee a very safe
        vehicle.  In these situations,  subtly,  and  often  with
        apparently logical arguments, the criteria are altered so
        that flights  may  still  be  certified  in  time.   They
        therefore  fly  in  a relatively unsafe condition, with a
        chance of failure of  the  order  of  a  percent  (it  is
        difficult to be more accurate).

             Official management, on the other  hand,  claims  to
        believe  the  probability  of failure is a thousand times
        less.  One reason for this may be an  attempt  to  assure
        the government of NASA perfection and success in order to
        ensure the supply of funds.  The other may be  that  they
        sincerely believed it to be true, demonstrating an almost
        incredible lack of communication between  themselves  and
        their working engineers.

             In  any  event  this  has   had   very   unfortunate
        consequences,  the  most serious of which is to encourage
        ordinary citizens to fly in such a dangerous machine,  as
        if  it  had  attained the safety of an ordinary airliner.
        The astronauts,  like  test  pilots,  should  know  their
        risks,  and  we  honor  them  for their courage.  Who can
        doubt that  McAuliffe  was  equally  a  person  of  great
        courage,  who was closer to an awareness of the true risk
        than NASA management would have us believe?

             Let us make  recommendations  to  ensure  that  NASA
        officials  deal  in  a  world of reality in understanding
        technological weaknesses and imperfections well enough to
        be  actively trying to eliminate them.  They must live in
        reality in comparing the costs and utility of the Shuttle
        to  other  methods  of  entering space.  And they must be
        realistic in making contracts, in estimating  costs,  and
        the  difficulty  of  the projects.  Only realistic flight
        schedules should  be  proposed,  schedules  that  have  a
        reasonable  chance  of  being  met.   If  in this way the
        government would not support them, then so be  it.   NASA
        owes  it  to the citizens from whom it asks support to be
        frank, honest, and informative, so  that  these  citizens
        can  make  the  wisest  decisions  for  the  use of their
        limited resources.

             For  a  successful  technology,  reality  must  take
        precedence  over  public  relations, for nature cannot be
        fooled.
193.8Thanks Tony; lesson here for usDSSDEV::SAUTERJohn SauterTue Aug 19 1986 16:1132
    Thank you very much, Tony, for typing this in.  I have just had time to
    read it.  If anybody doesn't want to read the whole thing, at least
    read the last page, which is the Conclusions.  There is one part that
    struck a particular chord with me: 
    
    ``Only realistic flight schedules should be proposed, schedules that
    have a reasonable chance of being met.  If in this way the government
    would not support them, then so be it.  NASA owes it to the citizens
    from whom it asks support to be frank, honest, and informative, so that
    these citizens can make the wisest decisions for the use of their
    limited resources.'' 
    
    Change "flight" to "project", "the government" to "higher management",
    "NASA" to "engineering management" and "citizens" to "the Corporation",
    and the quotation applies too well to Digital.  Engineering management
    sometimes demands an unrealistic schedule, knowing that it cannot be
    met, because a realistic schedule would cause the project to be
    cancelled.  I don't think that's right--it amounts to lying to higher
    management in order to keep your funding.  I may have gone along with
    this in the past, but I'm not going to do it in the future. 
    
    My management can call me unreasonable, inflexible and bullheaded, but
    I am not going to sign my name to a schedule (or agree verbally to a
    schedule, or be silent when a schedule is described in my presence)
    unless I feel that it is a realistic schedule: one that I can meet. 
    If I cause a project to be cancelled, ``so be it''.  If my attitude 
    gets me fired, at least I will not feel ashamed. 
    
    Yeah, I know I'm a very small cog.  Nothing that I do will prevent
    somebody from being killed, or millions of dollars of property from
    being lost, but the principle is the same.
        John Sauter                          
193.9VIKING::BANKSDawn BanksTue Aug 19 1986 18:4022
    I found Feynman's comments fascinating.  It sort of makes me think
    along the following lines:
    
    {Beginning of extreme radical opinions}
    
    From everything I've read so far (Roger's Commission, Feynman's
    comments, AW&ST, etc), the events that lead up to the shuttle disaster
    precisely followed normally accepted management techniques.  If
    not accepted, at least techniques that are used in every single
    large organization I've ever gotten close enough to to know the
    difference.  Or, to put it another way, I cannot find a single
    occurance that I'd be surprised to see happen anywhere else.
    
    It seems to me, then, that allowing Feynman's commentary to be a
    part of the original Roger's Commission report would be to publish
    an indictment on commonly practised management techniques.  As a
    matter of fact, the reason that they didn't want to make such an
    indictment was that some of the board members would be indicting
    themselves as well as NASA, as I'm sure more than a few board members
    may have been guilty of the same techniques.
    
    In other words: Cover up.
193.10Good but scary pointODIXIE::VICKERSDon Vickers, Notes DIG memberTue Aug 19 1986 23:478
    Re: .9
    
    FORTUNE magazine did a cover story on just your ideas relative to
    'normal' management being the same as NASA.
    
    It is indeed frightening.
    
    Don
193.11The "Science" of ManagementGALLO::DZIEDZICWed Aug 20 1986 10:5949
    Let me interject a note of calmness (sanity?) into this discussion.
    
    Back in the late 60's, when the Apollo program was winding down,
    the person in charge of NASA (some general, whose name I forget)
    had never been exposed to "management science".  He was a guy
    who was perfectly willing to barge into anyone's office looking
    for answers to problems.  None of this barrage of memos for him!
    
    Eventually, the "old guard" management at NASA was replaced by
    "management scientists".  (I personally find the term contradictory,
    but that's another topic.)  These people were more concerned with
    making the "right management decision" than the "right decision".
    For many of them, the distinction blurred as time went by.
    
    However, I still feel the majority of the technical people involved
    with NASA are dedicated individuals who are working toward goals
    to which they have a strong personal commitment.  They are doing
    the best they can under less than ideal conditions.
    
    It is very easy to spout platitudes about how you'd never work
    under such conditions, but let's be realistic.  Where can these
    people go and still be able to contribute to the space effort?
    (Let's not even consider the financial end of things, like how
    do you support your family if you quit your job?)  Any place
    in the industry will likely have the same shortcomings.  I
    suspect many have made a painful decision they will have to
    "work within the system" to effect any changes.
    
    Of course, there is the slight problem with funding for NASA
    which has also contributed toward the shuttle disaster.  In
    the years since Apollo has wound down, NASA's budget (in
    dollars adjusted for inflation) has decreased consistantly
    (with maybe a few exceptions).  When the money starts getting
    tight managers look for shortcuts, with results of which we
    have become painfully aware.
    
    What can be done?  Perhaps James Fletcher might make a difference,
    although he always maintained the cost-to-orbit for a pound of
    payload on the shuttle would drop to the hundreds of dollars
    range and it never got within an order of magnitude of that.
    Maybe, until the collective mind of the legislature forgets,
    there will be a push to restore NASA to the necessary funding
    level to successfully support a national space effort.
    
    In the long run, however, I think until the "management science"
    trend starts to lose favor, there will always be problems like
    those which caused the loss of the shuttle, and seven people
    who gave a damn.  That's what hurts the most.
    
193.12which issue?COIN::ELKINDSteve ElkindThu Aug 21 1986 12:274
re .10

Do you remember which issue of Fortune magazine?  Would it be possible to
steal or borrow a copy of the article?
193.13decision failuresALIEN::MCCARTHYThu Aug 21 1986 18:3528
    I would contend that there are very few instances in which "the
    right decision" and "the right management decision" are different,
    since attention to detail generally means success in the long run.
    
    The problem is basically that decision making is very hard work,
    and the ability to make decisions is a skill that is distributed
    through the population a lot like other skills.
    
    With the growth of the American economy and the onslaught of
    automation, the U.S. is consuming managers a lot faster than it
    can produce them. This leads to a lot of people in the field who
    don't have a flare for it. For them, the "right decision" is the
    one that makes the problem go away the fastest, since the complexity
    of dealing with it bothers them.
    
    Some will contend that there are "right management decisions" which
    aren't "right decisions" for economic reasons. I contend that that
    argument is horse-hockey. A perfectly legitimate solution to a problem
    is to report to one's superiors that it can't be done for the amount
    desired. We do that all the time at DEC (thank heaven!). Many managers
    don't like to do it (again it just makes the problem take longer)
    and many managers encourage their underlings no to give those
    solutions (because then it becomes THEIR problem).
    
    The shuttle program has fallen victim to mediocrity, as has much
    of the American economy. The problem won't go away until we breed
    more smart managers and get them into circulation.
    							-Brian
193.14Memory failureNUTMEG::VICKERSDon Vickers, Notes DIG memberFri Aug 22 1986 21:329
    I wish I could remember which issue it was.  I always throw the
    silly things out when I'm done with them.
    
    It was in June or thereabouts, I believe.  One way to get the article
    is through your nearest public library in the Periodical Review.
    
    Sorry about that,
    
    Don
193.15COVER UP?MDSUPT::EATONDan EatonWed Jul 26 1989 21:5217
    In the May 1989 issue of DefenseScience magazine there's an interesting
    commentary by Yale Jay Lubkin. The title of the commentary is
    'CHALLENGER-The Truth Slowly Outs'. In it it mentions Feynman's book
    'What do you care what other people think?' having a lot to say
    about the CYA mentality of NASA management. The commentary also
    mentions a Ali AbuTaha and credits him with finding the real cause
    of the disaster. AbuTaha found that NASA used the wrong figure 
    in designing the booster strut nearest the failure. They used a
    figure that was too low by about 150%. The commentary goes on to
    say the O-ring design was not a problem and makes some interesting
    comparisons to the same design used in TITAN.
    
    Has anyone else seen this article? Has anyone ever heard of AbuTaha?
    Comments or pointers would be appreciated.
    
    Dan Eaton
193.16what does 150% too low mean?CHRCHL::GERMAINDown to the Sea in ShipsThu Jul 27 1989 10:1813
	Note 193.15
	MDSUPT::EATON "Dan Eaton"

    >in designing the booster strut nearest the failure. They used a
    >figure that was too low by about 150%. 

	I wonder what the article meant by "150% too low"? Does this mean, 
	for example, the strength was "10" and it needed to be "25"?

	Thanks,
    
    	Gregg
193.17SPMFG1::CHARBONNDI'm the NRAFri Jul 28 1989 10:291
    That would make it 60% too low, not 150%
193.18exCHRCHL::GERMAINDown to the Sea in ShipsMon Jul 31 1989 17:246
    .17
    
     That's what I thought, but 150% too low sounds like it is less
    than zero.
    
    Gregg
193.19The actual numbersMDSUPT::EATONDan EatonMon Jul 31 1989 18:4315
    RE:last few on the numbers
    
    The numbers used in the commentary were this:
    140,000 pounds - the number NASA decided was the max load on the strut.
    190,000 pounds - the number at which the strut fails.
    330,000 pounds - the number AbuTaha found was the actual load.

    They came up with the 150% figure.                                  
    
    The thing that bugs me about this report is that if the actual load
    was 330,000 pounds and the failure is at 140,000 then how did the
    strut last through as many shuttle missions as it did? That's why
    I was wondering if any one else had heard this.
    
    Dan Eaton