T.R | Title | User | Personal Name | Date | Lines |
---|
5111.1 | wait til Jan. 1 2000... | CSC32::S_WASKEWICZ | | Wed Jan 29 1997 17:08 | 2 |
|
The year 2000 will REALLY bring em out of the woodwork...
|
5111.2 | | BHAJEE::JAERVINEN | Ora, the Old Rural Amateur | Wed Jan 29 1997 17:13 | 11 |
| Take all your money out of the bank well before 2000, and wait a few
months before dpositing it again...
I haven't really followe the Ariane story - what I initially read here
(in Munich) said it _was_ a software error. Whether it was like
described in .0 I don't know though I have the impression whoever wrote
it used his/her journalistic freedom.
Ariane 5 is different from Ariane 4 and apparently they saved some
money in not recertifying all the software.
|
5111.3 | | DECWET::LYON | Bob Lyon, DECmessageQ Engineering | Wed Jan 29 1997 17:57 | 18 |
| re: .2
> I haven't really followe the Ariane story - what I initially read here
> (in Munich) said it _was_ a software error. Whether it was like
> described in .0 I don't know though I have the impression whoever wrote
> it used his/her journalistic freedom.
The full report can be read at:
http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html
> Ariane 5 is different from Ariane 4 and apparently they saved some
> money in not recertifying all the software.
One wonders if they saved more than the several hundred million that was lost
when Ariane 501 went boom ...
Bob
|
5111.4 | | DECWET::FARLEE | Insufficient Virtual um...er.... | Wed Jan 29 1997 18:00 | 23 |
| You can get the full, original Ariane5 report from ESA at:
http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html
There are many lessons for us in this story, and many of them fall
into two corners:
Arrogance can be dangerous:
The software philosophy was that systemic software failures were
not a possibility. If a component failed, it must be a physical
failure, thus the only response is to shut it down and switch to
the alternate. This is useless in the case of a software bug which
will faithfully fail on both processors.
Cutting corners in testing can also be dangerous:
Ariane5 re-used Ariane4 software. It worked on Airane4, right?
Unfortunately, it was deemed to be "too expensive, too much trouble
to actually do an end-to-end simulation with anticipated Ariane5
flight data. So the software was NEVER tested against an Ariane5
flight profile... Until after the accident... At that time, the
bug was faithfully demonstrated, with exactly the same results as in
the real flight. This testing would probably have been alot cheaper
than the rocket and satellite that they lost...
|
5111.5 | ... and did you hear John Wayne died? | vaxcpu.zko.dec.com::michaud | Jeff Michaud - ObjectBroker | Wed Jan 29 1997 21:16 | 3 |
| From the date on the story pointed to in the URL this looks like
old news (the rocket exploded June 4, 1996, and the inquiry board
report is dated July 19, 1996).
|
5111.6 | Still a good story | PERFOM::HENNING | | Thu Jan 30 1997 04:21 | 6 |
| Even if 6-month-old-news, it's still a very good story to rub
into the noses of all who could use a dose of humility.
Sign me
/software_writer
|
5111.7 | good reference... | DOODL1::FISCHER | | Thu Jan 30 1997 08:24 | 20 |
| There is a more thorough treatment of this bug and the issues involved
with the software sharing between Ariane 4 and 5 in Aviation Week
Magazine (from last summer). As usual, a problem like this is more
complicated than this article implies. The issue of the alignment
software running after liftoff was a real design contraint, not a
simple "feature: it allowed for flexibility in the countdown process
(the sequencing of launch events is extremely complicated, especially
around handling and restarting holds).
From what I recall, analysis showed the real problem was introduced when,
in analyzing the data sets being sent to this routine for conversion,
engineers knowingly used Ariane 4 launch profile data instead of
Ariane 5 (similar, though slightly different) and then underestimated
the risk of the real flight values being different from these
"estimated" values. The incorrect risk assessment then drove the
decision not to add an error handler for this conversion.
The picture in the Aviation Week article underscores the seriousness
of such a "minor" oversight.
|
5111.8 | That launch attempt was classified an experimental launch... | NETCAD::BATTERSBY | | Thu Jan 30 1997 12:38 | 14 |
| There is one thing that has failed being mentioned here (unless
it was buried in the full report), but not part of the report
posted in the base note.
That Ariane 5 flight was the first flight in the Ariane 5 series.
As such it was deemed an experimental flight.
The satellite payloads got a free ride on this launch because
it was deemed an experimental flight instead of a full operational
production flight. To the payload owners, knowing the launch success
of the Ariane 4 series, they presumed a low risk (relatively speaking),
in a launch failure. Little did they know or would have expected that
a software bug, and the lack of pre-flight simulation would manifest
itself in such a totally destructive failure mode.
Bob
|
5111.9 | see my personal name | DECWET::ONO | Software doesn't break-it comes broken | Thu Jan 30 1997 12:47 | 0 |
5111.10 | since this topic is a soapbox to begin with :-) | vaxcpu.zko.dec.com::michaud | Jeff Michaud - ObjectBroker | Thu Jan 30 1997 13:54 | 25 |
| > Even if 6-month-old-news, it's still a very good story to rub
> into the noses of all who could use a dose of humility.
and what does this story have to do with Digital? :-)
The author of the base note implied that if they used an Alpha
that this wouldn't of happened. But that's a stretch. While
Alpha is considered a 64-bit processor, the problem of trying
to stuff a 64-bit quantity into a 16-bit quantity will also
fail using an Alpha cpu. The problem was not that the processor
used couldn't deal with 64-bit numbers (it was dealing with them
to begin with), it's that the software assumed (and correctly
it sounds like for what it was originally designed for, the model
4 rocket) it could, using C language lingo, cast the 64-bit number
to a 16-bit number without losing any resolution. For example, on
Digital UNIX a long is 64-bits and a short is 16-bits, and they
tried to do this:
long l = 123456;
short s;
s = l;
even on an all-powerful Alpha, this will not work (well it
will work, but the value of s won't be 123456).
|
5111.11 | | NETCAD::MORRISON | Bob M. LKG2-A/R5 226-7570 | Thu Jan 30 1997 17:32 | 9 |
| > The satellite payloads got a free ride on this launch because
> it was deemed an experimental flight instead of a full operational
> production flight.
Interesting. I wonder if the payloads were uninsured because the under-
writers thought an "experimental" flight was too risky?
I am surprised that the software didn't have range checks for ALL numbers.
Is there an issue that doing so would make the software too large to fit in
memory? Or too slow?
|
5111.12 | Many "but it works on ..." bugs are really source code problems... | DECC::SULLIVAN | Jeff Sullivan | Thu Jan 30 1997 18:23 | 9 |
| In the compiler and Digital UNIX groups, we get a lot of code that happens to
work on other vendors UNIXes. Many times, the problem can be traced back to
stuffing a 64-bit (pointer) value into a 32-bit variable. On other machines,
pointers are generally not 64-bit, so this is not a problem there. The code is
broken, but just happens to work.
If we only had $7 billion for each one of those we've seen...
-Jeff
|
5111.13 | | DECWET::LYON | Bob Lyon, DECmessageQ Engineering | Thu Jan 30 1997 18:23 | 12 |
| > Interesting. I wonder if the payloads were uninsured because the under-
>writers thought an "experimental" flight was too risky?
Yes, but that's not all that uncommon even for "routine" flights. The
underwriting costs are phenominal.
> I am surprised that the software didn't have range checks for ALL numbers.
>Is there an issue that doing so would make the software too large to fit in
>memory? Or too slow?
It would have used up more CPU than the design parameters allowed (85% maximum
utilization).
|
5111.14 | Quality assurance needed seriously | 33102::JAUNG | Dave Bowers @WHO | Fri Jan 31 1997 09:15 | 19 |
| ref .0
About 30 or more years ago, US launched a satellite into the orbit. The
The mission failed after the satellite revolved one round. Engineers
went back to check every parts and the simulation codes. They found
that the compiler did not pick up the following FORTRAN statement:
DO 100 I=1.3
...
The simiulation run flawlessly but the mission only lasted one round.
If the code was corrected to:
DO 100 I=1,3
The design defects would be found latter when I=3 thus the mission
would not be failed.
|
5111.15 | | QUARK::LIONEL | Free advice is worth every cent | Fri Jan 31 1997 12:51 | 6 |
| Re: .14
A popular story, and one also attributed to the failure of a Venus mission,
but it never happened.
Steve
|
5111.16 | | DANGER::ARRIGHI | Life is an else-if construct | Fri Jan 31 1997 14:00 | 7 |
| re .15
And since we're talking FORTRAN, I'm sure you know the facts. :)
That language is like an old lover that you never quite get over.
Tony (certainly I=1.3 would have produced a compile error)
|
5111.17 | | AXEL::FOLEY | http://axel.zko.dec.com | Fri Jan 31 1997 14:17 | 7 |
| RE: .16
Depends on which compiler.. You can write bad FORTRAN in
any language. :)
mike
|
5111.18 | | DECWET::ONO | Software doesn't break-it comes broken | Fri Jan 31 1997 14:49 | 10 |
| re: .16
Shouldn't produce an error. Fortran ignores whitespace on a line
(or at least it used to), so you end up with
DO100I = 1.3
It looks strange to see DO100I in the variable list.
Wes
|
5111.19 | | COOKIE::FROEHLIN | Let's RAID the Internet! | Fri Jan 31 1997 14:52 | 32 |
| .16>Tony (certainly I=1.3 would have produced a compile error)
Nope! DO loop counter can either be real or integer.
A story here:
The technical inspection authority in Germany (T�V) is responsible to
check all aspects of security requirements in nuclear power plants.
It starts with verifying the calculations for the construction of the
security core element. T�V had the program written and executed
by 3 different companies, in 3 different countries, in 3 different
programming languages. One day, the company in Germany reported
differences in their data. Was on a VAX-11/780. Turned out, the FP
processor started random number generating instead of correct FP
calculations. Would not have been detected without available comparison
data. I was involved to verify that this VAX can do 1+1 correctly.
Or another one (or Next Unseen to skip):
A huge steam turbine in a power plant was started the first time. The
FORTRAN program running in the PDP-11 controlling the beast had the
task to bring this beast thru devilish resonanzies on its way up to
test speed. Lots of sensors signalled one FORTRAN program housing and
shaft vibrations. Whenever vibrations increased, the program opened up
the steam intake valve to set over this resonanzy quickly. Due to
a sign error in one equation the program kept the turbine steady at
thes first hot spot. Only an emergency steam release enabled by a human
being avoided a catastrophy. I was allowed to visit the cracks in the
building before it was rebuild.
Enough the storries...back to testing
Guenther
|
5111.20 | | BHAJEE::JAERVINEN | Ora, the Old Rural Amateur | Fri Jan 31 1997 17:13 | 33 |
| >Nope! DO loop counter can either be real or integer.
G�nther, Du solltest es besser wissen!
It's not a question of an integer or real DO loop counter - as someone
else said, spaces have no meaning on FORTRAN, and a variable beginning
with D normally wouldn't be an integer, so
DO 100 I=1.3
is perfectly legal FORTRAN (and doesn't initiate a loop). I haven't
written any FORTRAN since a looong time - I don't know whether a newer
compiler might warn saying "this is basically correct, but probably not
what you wanted to do". I haven' seen a C/C++ compiler either that
warns about something like
if (xxx = 0)
{
//blow up the nuclear plant
}
Whether the stories about this bug are urban legend I don't know...
I've heard them since my FORTRAN II times, and it's fairly long ago.
BTW, those of you who have tried to write parsers for these languages
(BASIC is another one where in the classic version whitespace was
meaningless) it's a pain in the a** - apparently someone thought 20-40
years ago it would be easier to parse, but it ain't.
And, if FORTRAN were invented in Germany, it would probably be called
FORM�B or something... ;-)
|
5111.21 | Old space programming errors | DECCXX::AMARTIN | Alan H. Martin | Fri Jan 31 1997 18:22 | 53 |
| Re .20:
>I haven' seen a C/C++ compiler either that warns about something like
>
> if (xxx = 0)
> {
> //blow up the nuclear plant
> }
$ TYPE CCW.C
int status;
extern int GetRadarInfo(void);
extern void LaunchMissiles(void);
int main(void)
{
while (1) {
status = GetRadarInfo();
if (status = 1)
LaunchMissiles();
}
return 0;
}
$ CC/DECC CCW/WARN:ENABLE=CHECK
if (status = 1)
........^
%CC-W-CONTROLASSIGN, In this statement, the assignment expression "status=1" is
used as the controlling expression of an if, while or for statement.
At line number 9 in CCW.C.
...
$ CC/DECC CCW/WARN:ENABLE=CHECK/VERSION
DEC C V5.3-006 on OpenVMS VAX V6.2
$
> Whether the stories about this bug are urban legend I don't know...
> I've heard them since my FORTRAN II times, and it's fairly long ago.
The missing superscript bar which caused ground-commanded self-destruct of the
Venus probe "Mariner 1" during early ascent is discussed in
http://catless.ncl.ac.uk/Risks/5.66.html#subj1 .
Re .15
>A popular story, and one also attributed to the failure of a Venus mission,
>but it never happened.
The "DO 10 I=1.10" which caused orbit prediction errors during Project Mercury
is discussed by the NASA employee who found it in
http://www.op.net/docs/Computer-Folklore/mariner_bug .
/AHM
|
5111.22 | | COOKIE::FROEHLIN | Let's RAID the Internet! | Fri Jan 31 1997 18:57 | 5 |
| Right Ora...this "DO 100 I=1.3" is an assignment and not a DO loop
statement. The old FORTRAN trap.
Thanks
Guenther
|
5111.23 | | QUARK::LIONEL | Free advice is worth every cent | Fri Jan 31 1997 19:06 | 4 |
| In Fortran 90 free-form source, the compiler would reject the malformed
statement.
Steve
|
5111.24 | random rambling | vaxcpu.zko.dec.com::michaud | Jeff Michaud - ObjectBroker | Fri Jan 31 1997 19:23 | 25 |
| > I haven' seen a C/C++ compiler either that warns about something like
> if (xxx = 0)
> {
> //blow up the nuclear plant
> }
Also FWIW, I knew a group of programmers whose coding standard
for equality comparision tests was to transpose the arguments.
Ie.
if( 0 == xxx ) ....
this way if they accidently used = instead of ==, like:
if( 0 = xxx ) ....
they'd get a compiler error (never mind a warning).
I personally however don't follow the coding standard as I find
it visually unappealling and harder to read when reading code
in English. Ex. "if xxx is equal to 0" I don't even need to
think about, but if the arguments to the comparision operator
are reversed it reads "if 0 is equal to xxx", which to me at
least is not natural sounding. It's like saying "32 years old is
john" instead of "john is 32 years old".
|
5111.25 | | BHAJEE::JAERVINEN | Ora, the Old Rural Amateur | Sat Feb 01 1997 07:29 | 15 |
| re .21: Ok, ok, Alan, I admit I haven't used DEC C for ages.. FWIW,
even V4.2 of Visual C++ doesn't complain:
void foo(int i)
{
if(i=1)
{
i=0;
}
}
Compiling...
foo.cpp
foo.obj - 0 error(s), 0 warning(s)
|
5111.26 | | RUSURE::EDP | Always mount a scratch monkey. | Mon Feb 03 1997 09:05 | 17 |
| Re .11:
> I am surprised that the software didn't have range checks for ALL
> numbers.
There were range checks. The overflowing numbers were dutifully caught
and reported. That error message output appeared in place of the
expected output. It was interpreted as data, which caused the rocket
to veer off course. Since the rocket was off course, safety mechanisms
destroyed it.
-- edp
Public key fingerprint: 8e ad 63 61 ba 0c 26 86 32 0a 7d 28 db e7 6f 75
To find PGP, read note 2688.4 in Humane::IBMPC_Shareware.
|
5111.27 | They knew the theory but didn't have the equipment. | ULYSSE::sbudhcp23.sbu.vbe.dec.com::Mike | | Mon Feb 03 1997 10:02 | 13 |
| >> BTW, those of you who have tried to write parsers for these languages
>> (BASIC is another one where in the classic version whitespace was
>> meaningless) it's a pain in the a** - apparently someone thought 20-40
>> years ago it would be easier to parse, but it ain't.
The reasons for eliminating white space had little to do with the simplicity
of parsing, real or imagined. The problem was one of cost - in this case the
cost of storage, particularly temporary storage between passes on a
multi-pass compiler. Deleting all the white space could give a 20% decrease
in the space required.
Mike.
|
5111.28 | | BHAJEE::JAERVINEN | Ora, the Old Rural Amateur | Mon Feb 03 1997 10:33 | 17 |
| re .27:
�The problem was one of cost - in this case the
�cost of storage, particularly temporary storage between passes on a
�multi-pass compiler. Deleting all the white space could give a 20% decrease
�in the space required.
I can buy that argument for interpreters (like BASIC usually) - when
the C64 was popular it was common to see long BASIC programs without a
single space in them... pretty difficult to read.
However, I don't see why a normal compiler would save any whitespace in
it's intermediate code - whitespace is useful for tokenising the input,
but then you throw it away anyway. Also, even though spaces have (more
or less) no meaning in FORTRAN, it was customary to use them anyway, to
make the code more readable.
|
5111.29 | Token delimiters? Sheer luxury, lad! | ULYSSE::sbudhcp23.sbu.vbe.dec.com::Mike | | Mon Feb 03 1997 10:58 | 18 |
|
Re: .-1
One system I used allowed input from cards, paper tape or magnetic
tape and stripped the white space before writing to drum as an
intermediate file. The drum was about 100Kb. If the drum space
overflowed the compiler crashed.
It was possible to use it without drum by feeding the paper tape in
multiple times or load it to mag tape first but this would usually
exceed your alloted time on the machine. A halfhour per day.
I don't think that this was untypical during the late 60s, early
70s.
Mike.
|
5111.30 | Gotta use Warning Level 4 | DECCXX::AMARTIN | Alan H. Martin | Mon Feb 03 1997 11:49 | 8 |
| Re .25:
--------------------Configuration: foo - Win32 Debug--------------------
Compiling...
foo.cxx
foo.cxx(3) : warning C4706: assignment within conditional expression
foo.obj - 0 error(s), 1 warning(s)
/AHM
|
5111.31 | | BHAJEE::JAERVINEN | Ora, the Old Rural Amateur | Mon Feb 03 1997 13:55 | 2 |
| re .29: You had drums? Sheer luxury! :-)
|
5111.32 | | BIGUN::nessus.cao.dec.com::Mayne | Wake up, time to die | Tue Feb 04 1997 01:34 | 19 |
| In the good old days when BASIC was interpreted and possibly in ROM, and
compilers were for mainframes, BASIC was often tokenised on input to the
interpreter. Thus, the line
10 FOR I = 1 TO 10
woubd be stored as
[binary line number 10] [1 byte token FOR] I = 1 [1 byte token TO] 10
thus not only saving bytes, but running faster, since the interpreter's parsing
job was a lot simpler. The LIST command would expand the tokens, so you'd never
know it was happening. (Unless, like me, you wrote BASIC programs that read in
the tokenised versions and rewrote them.)
(Wow, I had *two* 5�" floppy disk drives in those days, because I was the one
writing the database software. 8-)
PJDM
|
5111.33 | JAVA brings back the good old days | STAR::jacobi.zko.dec.com::jacobi | Paul A. Jacobi - OpenVMS Systems Group | Tue Feb 04 1997 14:07 | 7 |
| >>> In the good old days when BASIC was interpreted
The "good old days" are back with a new name -- JAVA!
-Paul
|
5111.34 | | BHAJEE::JAERVINEN | Ora, the Old Rural Amateur | Tue Feb 04 1997 18:35 | 8 |
| re .32:
That was the case e.g. on the first Novas by Data general... but it
wasn't true for the C-64. (I've had more to do with BASIC than I
like...).
re .33: There's a huge difference bewteeen BASIC and Java...
|
5111.35 | | SKYLAB::FISHER | Gravity: Not just a good idea. It's the law! | Mon Feb 10 1997 16:13 | 34 |
| re .26:
> Re .11:
>
> > I am surprised that the software didn't have range checks for ALL
> > numbers.
>
> There were range checks. The overflowing numbers were dutifully caught
> and reported. That error message output appeared in place of the
> expected output. It was interpreted as data, which caused the rocket
> to veer off course. Since the rocket was off course, safety mechanisms
> destroyed it.
Not quite, according to my understanding of the Aviation Week articles cited
earlier. It was explicitly decided for many cases not to trap overflows and
underflows because of the analysis which showed that the only way such things
could happen was if there were a hardware failure. If there is a hardware
failure, there is a redundant processor to take over. Therefore, the right
thing to do if you get an overflow trap is to HALT the current processor to
ensure that the redundant processor takes over. The HALT put some data out on
the lines that was essentially an error code showing why the halt had occurred.
That all would have been fine except the problem was not a random h/w failure,
but a systematic s/w error. Microseconds after the first processor halted, the
second one did two, leaving the error code as the only data available to the
engine computer.
Another thing from the AvWeek article to add:
Not only was the code in question not used after liftoff; it was not used AT
ALL in the Ariane V. They decided not to remove it because they wanted to
change as few things as possible. (How many times have we all made decisions
like that?)
Burns
|
5111.36 | | RUSURE::EDP | Always mount a scratch monkey. | Wed Feb 12 1997 08:48 | 14 |
| Re .35:
> Not quite, according to my understanding of the Aviation Week
> articles cited earlier.
Your version is more accurate. The inquiry board report is at
http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html.
-- edp
Public key fingerprint: 8e ad 63 61 ba 0c 26 86 32 0a 7d 28 db e7 6f 75
To find PGP, read note 2688.4 in Humane::IBMPC_Shareware.
|