[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::decc

Title:DECC
Notice:General DEC C discussions
Moderator:TLE::D_SMITHNTE
Created:Fri Nov 13 1992
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2212
Total number of notes:11045

2078.0. "Latent 64-bit bugs can get you" by DECC::SULLIVAN (Jeff Sullivan) Thu Jan 30 1997 18:35

             <<< HUMANE::DISK$SCSI:[NOTES$LIBRARY]DIGITAL.NOTE;1 >>>
                        -< The Digital way of working >-
================================================================================
Note 5111.0        From the loss of a nail, a battle was lost          7 replies
LJSRV1::ENGBROCK                                    120 lines  29-JAN-1997 15:41
--------------------------------------------------------------------------------
This was sent to me without the source but it is interesting reading
    
    If they had only used an on-board Alpha system then.....
                                                            
SUBJECT: MINOR SOFTWARE BUG

 It took the European Space Agency 10 years and $7 billion to produce
 Ariane 5, a giant rocket capable of hurling a pair of three-ton satellites
into orbit with each launch and intended to give Europe overwhelming
supremacy in the commercial space business.

 All it took to explode that rocket less than a minute into its maiden
voyage last June, scattering fiery rubble across the mangrove swamps of
French Guiana, was a small computer program trying to stuff a 64-bit number
into a 16-bit space.

One bug, one crash. Of all the careless lines of code recorded in the annals
of computer science, this one may stand as the most devastatingly
 efficient. From interviews with rocketry experts and an analysis prepared
for the space agency, a clear path from an arithmetic error to total
destruction emerges.

 To play the tape backward:
 At 39 seconds after launch, as the rocket reached an altitude of two and a
half miles, a self-destruct mechanism finished off Ariane 5, along with its
payload of four expensive and uninsured scientific satellites.

 Self-destruction was triggered automatically because aerodynamic forces
were ripping the boosters from the rocket. This disintegration had begun
instantaneously when the spacecraft swerved off course under the pressure of
the three powerful nozzles in its boosters and main engine. The rocket was
making an abrupt course correction that was not needed, compensating for a
wrong turn that had not taken place.

 Steering was controlled by the on-board computer, which mistakenly thought
 the rocket needed a course change because of numbers coming from the
inertial guidance system. That device uses gyroscopes and accelerometers to
track motion. The numbers looked like flight data -- bizarre and
impossible flight data -- but were actually a diagnostic error message.

 The guidance system had in fact shut down. This shutdown occurred 36.7
 seconds after launch, when the guidance system's own computer tried to
convert one piece of data -- the sideways velocity of the rocket -- from a
64-bit format to a 16-bit format. The number was too big, and an overflow
 error resulted.

 When the guidance system shut down, it passed control to an identical,
 redundant unit, which was there to provide backup in case of just such a
failure. But the second unit had failed in the identical manner a few
milliseconds before. It was running the same software.

 This bug belongs to a species that has existed since the first computer
programmers realized they could store numbers as sequences of bits, atoms of
data, ones and zeroes: 1001010001101001. . . . A bug like this might crash a
spreadsheet or word processor on a bad day.

 Ordinarily, though, when a program converts data from one form to another,
 the conversions are protected by extra lines of code that watch for errors
 and recover gracefully. Indeed, many of the data conversions in the
guidance system's programming included such protection.

 But in this case, the programmers had decided that this particular velocity
figure would never be large enough to cause trouble. After all,  it never
had been before. Unluckily, Ariane 5 was a faster rocket than
 Ariane 4. One extra absurdity: the calculation containing the bug, which
shut down the guidance system, which confused the on-board computer, which
forced the rocket off course, actually served no purpose once the rocket
 was in the air. Its only function was to align the system before launch.

 So it should have been turned off. But engineers chose long ago, in an
earlier version of the Ariane, to leave this function running for the first
40 seconds of flight -- a "special feature" meant to make it easy to
 restart the system in the event of a brief hold in the countdown.

 The Europeans hope to launch a new Ariane 5 next spring, this time with a
newly designated "software architect" who will oversee a process of more
intensive and, they hope, realistic ground simulation.
Simulation is the great hope of software debuggers everywhere, though it
 can never anticipate every feature of real life. "Very tiny details can
have terrible consequences," says Jacques Durand, head of the project, in
Paris. "That's not surprising, especially in a complex software system  such
as this is."

 These days, we have complex software systems everywhere. We have them in
our dishwashers and in our wristwatches, though they're not quite so
mission-critical. We have computers in our cars -- from 15 to 50
microprocessors, depending how you count: in the engine, the transmission,
 the suspensions, the steering, the brakes and every other major subsystem.
 Each runs its own software, thoroughly tested, simulated and debugged, no
doubt.

 Bill Powers, vice president for research at Ford, says that cars' computing
power is increasingly devoted not just to actual control but to diagnostics
and contingency planning -- "Should I abort the mission, and if I abort,
where would I go?" he says. "We also have what's called a  limp-home
strategy." That is, in the worst case, the car is supposed to behave more or
less normally, like a car of the pre-computer era, instead of, say, taking
it upon itself to swerve into the nearest tree.

 The European investigators chose not to single out any particular
contractor or department for blame. "A decision was taken," they wrote.  "It
was not analyzed or fully understood." And "the possible implications of
allowing it to continue to function during flight were not realized."  They
did not attempt to calculate how much time or money was saved by omitting
the standard error-protection code.

 "The board wishes to point out," they added, with the magnificent blandness
of many official accident reports, "that software is an expression of a
highly detailed design and does not fail in the same sense
as a mechanical system." No. It fails in a different sense.  Software built
up over years from millions of lines of code, branching and unfolding and
intertwining, comes to behave more like an organism than a
machine.

 "There is no life today without software," says Frank Lanza, an executive
 vice president of the American rocket maker Lockheed Martin.  "The world
 would probably just collapse." Fortunately, he points out, really important
software has a reliability of 99.9999999 percent. At least, until it
doesn't.


             <<< HUMANE::DISK$SCSI:[NOTES$LIBRARY]DIGITAL.NOTE;1 >>>
                        -< The Digital way of working >-
================================================================================
Note 5111.4        From the loss of a nail, a battle was lost             4 of 7
DECWET::FARLEE "Insufficient Virtual um...er...."    23 lines  29-JAN-1997 18:00
--------------------------------------------------------------------------------
You can get the full, original Ariane5 report from ESA at:

http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html

There are many lessons for us in this story, and many of them fall
into two corners:

Arrogance can be dangerous:
	The software philosophy was that systemic software failures were
	not a possibility.  If a component failed, it must be a physical 
	failure, thus the only response is to shut it down and switch to
	the alternate.  This is useless in the case of a software bug which
	will faithfully fail on both processors.

Cutting corners in testing can also be dangerous:
	Ariane5 re-used Ariane4 software.  It worked on Airane4, right?
	Unfortunately, it was deemed to be "too expensive, too much trouble
	to actually do an end-to-end simulation with anticipated Ariane5
	flight data.  So the software was NEVER tested against an Ariane5
	flight profile...  Until after the accident...  At that time, the
	bug was faithfully demonstrated, with exactly the same results as in
	the real flight.  This testing would probably have been alot cheaper
	than the rocket and satellite that they lost...
T.RTitleUserPersonal
Name
DateLines
2078.1"The Uneven Migration to 64-bit UNIX" by D.H. Brown Associates, Inc.DECC::SULLIVANJeff SullivanFri Feb 21 1997 13:4224
There is a good analysis of 64-bit UNIX operating systems online by D.H. Brown 
Associates, Inc. 

See http://www-unix.zk3.dec.com/digital_unix/v4/64-bit/migration/contents.htm

Also, other intersting Digital Competitive info at
    http://www-unix.zk3.dec.com/www/competition.html

Quoting from the D.H. Brown conclusion:
"
Digital UNIX and 64-bit SGI IRIX 6.x provide full 64-bit capabilities today,
while IBM's AIX 4.1, HP's HP-UX 10.1, and Sun's Solaris 2.5 are only just
beginning the ramp to full 64-bit capability -- a move expected to last 12 to 24
months. SGI attempts to reap the best of both worlds by supporting 32-bit and
64-bit worlds simultaneously -- a feat all other vendors will emulate. While SGI
alone today provides strong backwards compatibility with 32-bit systems, Digital
avoids many migration-related pitfalls since its Digital UNIX operating system
and all its applications written were written or ported with a 64-bit
environment in mind. Thus, Digital already incurred the cost of adopting a
64-bit environment in going to the faster Alpha chip. Not incidental, for those
end-users moving early on inevitable 64-bit porting, Digital provides not only
tools, but surrounding services and expertise backed by porting centers
dedicated to assisting customers.
"