T.R | Title | User | Personal Name | Date | Lines |
---|
2203.1 | This problem fixed in 1.4 | STAR::SWEENEY | | Wed Apr 02 1997 09:54 | 5 |
|
The time change problem is corrected in version 1.4 of DCE for OpenVMS.
You should be aware that V1.3A is not a supported version of DCE.
Dave
|
2203.2 | Is it really fixed? | ESME::SPENCE | Bugs? You mean insects? | Thu Apr 03 1997 05:22 | 19 |
| I've got the same problem with DCE V1.4 and OpenVMS/AXP V6.2.
Looking in DCE$SPECIFIC:[VAR.ADM.TIME]DCE$DTSD.OUT:
Fatal error at line 782 in file DECW$DCERESD:[DTS.SRC]TIMERS.C;1
%CMA-F-EXCCOP, exception raised; VMS condition code follows
-SYSTEM-F-OPCCUS, opcode reserved to customer fault at PC=8059457C, PS=0000001B
The image identification information of SYS$SYSTEM:DTSD.EXE is:
image name:"DCE$DTSD"
image file identification: "DCE V1.4-961030"
link date/time: 30-OCT-1996 14:24:11.47
linker identification: "T10-58"
This has broken our internal DCE cell; I'd really value any other ideas
as to how we can fix it.
Thanks,
Cameron
|
2203.3 | Is this the real culprit? | ESME::SPENCE | Bugs? You mean insects? | Thu Apr 03 1997 08:03 | 16 |
| As I said in .-1, I'm running DCE V1.4, and I had the same problem.
I'd also previously read the note concerning the problem due to hit
DCE (see note 2132 in this conference) on 19th May, and wondered
if it could be related. So, I installed the patch for that particular
problem, and guess what?
The problem with DTSD went away - it doesn't crash anymore.
This would tend to suggest to me that this 'timebomb' in OpenVMS has,
in fact, 'exploded' NOW (at least, it did on my system), not next month.
Perhaps someone from DCE engineering might like to comment. If my
hypothesis is true, then we need to get this patch kit to our DCE/OpenVMS
customers rather more urgently than we previously thought...
- Cameron
|
2203.4 | Hmmm... | STAR::SWEENEY | | Thu Apr 03 1997 10:22 | 14 |
|
I believe if your hypothesis were true, I would have been inundated with cases
from Europe with this error. Any other European field engineers encounter this
problem?
In .1 it states "he didn't change the time which meant that he couldn't start
dce". Did you do anything else along with installing the patch? So you mean
DCE was not running when the time change occured?
If DCE was not running not only do you have to change the time but you need to
update the timezone configuration as described in section 18.2 of the DCE for
OpenVMS product guide. Please let me know if this fixes the problem.
Dave
|
2203.5 | here's one more site | COMICS::HOWLAND | | Thu Apr 03 1997 17:15 | 6 |
| Hi Dave,
I can confirm this on at least one other site in the UK.
They could not start dtsd after the time change
Systems already running the "May 19th" patch were no affected
installing the patch cleared the problem on affected nodes.
Graham
|
2203.6 | Thanks Graham... | STAR::SWEENEY | | Thu Apr 03 1997 17:23 | 4 |
|
Thanks for confiming the hypothesis. I believe now it can be deemed a theory.
Dave
|
2203.7 | pthread_cond_timedwait set to expire after 19-May | STAR::SWEENEY | | Thu Apr 03 1997 17:56 | 5 |
|
This appears to be caused by a pthread_cond_timedwait being set to expire after
19-May-1997.
Dave
|
2203.8 | Um...not so fast... | WTFN::SCALES | Despair is appropriate and inevitable. | Thu Apr 03 1997 19:37 | 32 |
| Folks,
This is not a "10,000 Day" (i.e., 19 May 1997) problem. It's wonderful that the
patch for that problem -appears- to address this problem, but I expect that it
is merely masking it.
From .2:
SYSTEM-F-OPCCUS, opcode reserved to customer fault at PC=8059457C, PS=0000001B
This is the real symptom of the problem. (The fact that it's chained to the
"CMA-F-EXCCOP" primary is inconsequential!!) There is a memory corruptor in
this application which is presumably corrupting the return address in some frame
on some thread's stack (resulting in the thread attempting to execute something
which is not code...).
Presumably, installing the 10,000 Day ECO kit changed the environment enough to
make the problem seem to "go away". Perhaps, with new images of slightly
different sizes the memory layout has changed so that now the corruption isn't
fatal (i.e., maybe it's now benign or maybe it's simply going unnoticed at the
moment!). Or, possibly, the new code paths have changed the execution order so
that a race condition which has resulted in the corruption no longer occurs.
Either way, there is a bug somewhere in the application, and there is no reason
to believe that the 10,000 Day ECO actually fixes it.
Sorry,
Webb Scales
DECthreads
|
2203.9 | I still doubt it. | WTFN::SCALES | Despair is appropriate and inevitable. | Thu Apr 03 1997 19:40 | 9 |
| .7> This appears to be caused by a pthread_cond_timedwait being set to expire
.7> after 19-May-1997.
Dave, is there really reason to believe that DTSD would be setting timeouts more
than a month into the future? (Regardless, I don't see how you get from
DECthreads returning an EINVAL status to the process taking an OPCCUS trap!)
Webb
|
2203.10 | Not too slow. If we err, let's err on the side of protecting our customers | STAR::SWEENEY | | Thu Apr 03 1997 21:11 | 62 |
|
Webb,
To try to address Webb's concern:
Can any field engineer acknowledge that DCE systems without the patch functioned
properly after the time change? All my systems have had the patch installed for
quite some time. Can you jump in here Marco Bill?
To try to address my concern:
As you can see in the initial notes I did not originally think this problem was
related to the 10,000 day limit patch. I believe the confirmation from two
additional customers is reason enough to take action informing other customers
how to work around the problem if it occurs. So what do you propose? Are you
proposing we not inform our customers and hope we come up with another
workaround and a patch kit tomorrow? Or you are saying the problem will not
occur? According to our field engineers we have a solution, and it's already
gone to many of our production customer sites. You better believe there are
customers out there who will put this patch on if they encounter the problem.
After it "alleviates" the problem, they will ask why we did not recommend
putting the patch on before the time change.
Even if the patch "masks" the problem, a workaround IS a fix to a large customer
production site that risks losing bundles of money when the application does not
work.
-----
DCE is a rather large accumulation of code, it's hard to know the internals of
all it's components. As DTS is a persistent process, I understand why it would
set timers to expire far out in the future. For example, it will set a timer to
adjust the time zone differential when daylight savings time expires.
From 8.
>Presumably, installing the 10,000 Day ECO kit changed the environment enough to
>make the problem seem to "go away". Perhaps, with new images of slightly
>different sizes the memory layout has changed so that now the corruption isn't
>fatal (i.e., maybe it's now benign or maybe it's simply going unnoticed at the
>moment!). Or, possibly, the new code paths have changed the execution order so
>that a race condition which has resulted in the corruption no longer occurs.
I do not know of anyone with internals knowledge of DTS who could point to
suspicious sections of code where your analysis may apply. Why did the exact
same DCE code not exhibit the problem last year. I have heard this exact same
analysis concerning other problems encountered with DCE's use of threads in the
past. Excuse my skepticism, but this analysis has not rung true in many cases.
From 8.
>Either way, there is a bug somewhere in the application, and there is no reason
>to believe that the 10,000 Day ECO actually fixes it.
Or there is a bug in something the application uses. So you do not believe
Graham that the patch fixed the problem?
Dave
|
2203.11 | Better safe than sorry, IMHO | CSC32::J_MORTON | O8-OO-2b || ! 2b | Thu Apr 03 1997 22:08 | 18 |
| Hi,
While the DECthreads engineer may well be correct that the patch isn't
the "fix" to the problem, I think Dave's concern for our credibility
is valid. Can we risk that customers will have the problem while we
have a "workaround" on the shelf ready to go?
If we find a site with the problem that will allow us to do further
analysis possibly we can provide a real "fix" at a later date.
Also, given that the 10,000 day ECO will be necessary soon anyway, it
certainly won't hurt for the customers to install it early if they
haven't already.
.02,
Jim
CSC/CS
|
2203.12 | TDCE may be masking the 10,000 day problem... | STAR::SWEENEY | | Thu Apr 03 1997 23:09 | 24 |
|
Below is the "offending" DCE code. Any error besides EAGAIN is
bugchecked. My review of the code generated by the bugcheck() call
leads me to believe the "real" error, invalid argument, is not bubbling
up the stack. Looks like it simpy prints the line number and calls
abort(). I'll have to look at it with fresher eyes tomorrow. It been
common in my experience with DCE that any unhandled condition ends up
being reported as ACCVIO or OPCDEC.
Dave
if (pthread_cond_timedwait (&timerPtr->timerEventCond,
&timerPtr->mutex,
&expirationTime) < 0) {
if (errno == EAGAIN) {
if (timerPtr->timerEvent == K_TIMER_EVENT_NULL)
{
timerPtr->timerEvent =
K_TIMER_EVENT_EXPIRE;
}
}
else
BugCheck(); << this is the line aborting
|
2203.13 | OK, now I'm satisfied that this IS a 10,000 day problem. | WTFN::SCALES | Despair is appropriate and inevitable. | Fri Apr 04 1997 13:20 | 83 |
| .12> Looks like it simpy prints the line number and calls abort().
OK, there's the missing piece of data! As I recall, the result of calling
abort() -is- the OPCCUS trap (which is REALLY gross, but there you are).
.10> For example, it will set a timer to adjust the time zone differential
.10> when daylight savings time expires.
Yes, this _would_ be subject to the 10,000 day problem, resulting in the call
to abort() above, and _would_ be fixed by applying the ECO. So, I think we
are all squared away, now.
Nevertheless, I'd like to make clear my position on this situation. I never
indicated that anyone should not apply the ECO now or in the future. What I
said was (in the absence of knowing that DTSD was calling abort()) the
reported symptoms did not match the known symptoms of the problem which the
ECO addresses. Thus, while applying the patch SEEMED to address the problem,
there was no sound basis for believing it fixed the problem. Thus, it was
critical that we understood this and set our customer's expectations
accordingly: that the patch would seem to provide a workaround for this
problem but that it was possible that other problems (data corruptions!)
might result. If any such problems HAD cropped up after we claimed that the
ECO was a "fix" would have severly compromised Digital's credibility!!
Furthermore, without the patch, the problem seemed to be reasonably
reproducable. That is, without the patch, it might have been possible for
Digital engineers to isolate the source of the apparent corruption. Thus,
installing the ECO on -all- systems (internal as well as customers') would
have been a _disservice_ to our customers, because it might have made it
nearly impossible for Digital to locate and fix the real problem.
.10> I do not know of anyone with internals knowledge of DTS who could point to
.10> suspicious sections of code where your analysis may apply. Why did the exact
.10> same DCE code not exhibit the problem last year. I have heard this exact same
.10> analysis concerning other problems encountered with DCE's use of threads in the
.10> past. Excuse my skepticism, but this analysis has not rung true in many cases.
*grin* Just because the DCE code hasn't changed doesn't mean that nothing
else has changed. Other shared images activated in the process might have
changed size. This changes the virtual addresses at which allocations end
up, potentially changing the location and effects of a data corruption. This
also changes code path lengths. This could affect when certain events (such
as locking or unlocking a mutex) occur relative to events in other threads.
Changes in memory layout change the pagefault patterns (as do accesses to
sharable images from other processes on the system) which also affect timing.
Finally, simple things like system and network load and I/O latencies alter
when a thread gets to run and when its timeslices occur. Thus, if there are
*any* synchronization bugs ("race conditions") in your multithreaded code,
they could remain hidden throughout all of your testing and even long
stretches of deployment at multiple customer sites, but, when the right
factors line up, the race suddenly goes the other way and a data corruption
results (which can look like an ACCVIO, or an OPCDEC, or, yes, an OPCCUS,
depending on the exact nature of the corruption).
In multithreaded programming, you are subject to all of the same classes of
errors that you encounter in sequential programming and, in -addition- you're
subject to errors in synchronization in which timing is a factor. In
sequential programming, it is possible to test all possible code paths;
however, in multithreaded programming (and other forms of asynchronous
programming) it is simply not possible to test the code completely, because
you cannot control the timing factor from outside the code.
Dave, I'm glad that you and no one that you know with internals knowledge of
DTS have experienced this class of problem. That speaks highly of your code
(or your luck, but hopefully it's the former ;-). Nevertheless, we on the
DECthreads team -do- encounter this sort of thing in our consumers' code from
time to time, and it's always a pain for us and for them. For them because
it's very hard to find (it almost always has to be found via code-review),
and for us because they always say "well, the exact same code did not exhibit
the problem until we [changed something external to it]."
In closing, I'd like to apologise for having had to push back when this
problem did turn out to be a 10,000 day problem after all (everything would
have been clearer if abort() resulted in an SS$_ABORT instead of
SS$_OPCCUS!!). Nevertheless, it would have been inappropriate for Digital to
claim that the ECO "fixed" the problem until we understood what the problem
actually was and how the ECO fixed it. And, doing so would have been a
disservice to our customers and possibly damaging to Digital.
Webb
|
2203.14 | FLASH article that we sent to customers. | CSC32::R_WILLIAMS | | Fri Apr 04 1997 14:32 | 42 |
|
[DCE] DCE$DTSD Fails to Start After Daylight Savings Time
Any party granted access to the following copyrighted information
(protected under Federal Copyright Laws), pursuant to a duly executed
Digital Service Agreement may, under the terms of such agreement copy
all or selected portions of this information for internal use and
distribution only. No other copying or distribution for any other
purpose is authorized.
Copyright (c) Digital Equipment Corporation 1997. All rights reserved.
PRODUCT: DIGITAL Distributed Computing Environment (DCE) for OpenVMS,
Versions 1.3A through 1.4
OP/SYS: DIGITAL OpenVMS VAX, Versions 5.5-2 through 7.0
DIGITAL OpenVMS Alpha, Version 6.1 through 7.0
SOURCE: Digital Equipment Corporation
OVERVIEW:
OpenVMS Sustaining engineering strongly recommends the installation of
VAXLIBR06_070 and ALPLIBR05_070 patch kits prior to the daylight savings
time change on the weekend of April 5 and 6, 1997.
INFORMATION:
After the daylight savings time change on the weekend of April 5 and 6,
the DCE time service daemon, DCE$DTSD, may fail to start. The following
error is written to the DCE$DTSD.OUT file:
"Fatal error at line 782 in file decw$dceresd:[dts.src]timers.c;1"
We believe this problem is caused by a pthread_cond_timedwait being set
to expire after 19-May-1997. Refer to the "OpenVMS Delta-Time Restric-
tions and 19-May-1997" Blitz documentation for a description of the
19-May-1997 time problem. Additional information will be provided in
future updates concerning the OpenVMS Delta-Time Restrictions issue.
|
2203.15 | Thanks for your input | STAR::SWEENEY | | Fri Apr 04 1997 14:56 | 32 |
|
I think it is good that you pushed back as your input helped me identify the
10,000 day limit as the real cause of the problem. The DTS code and the DCE
code in general, does a poor job of supplying the real condition that causes
image terminations. I knew this limitation of the code, you did not.
>Dave, I'm glad that you and no one that you know with internals knowledge of
>DTS have experienced this class of problem. That speaks highly of your code
>(or your luck, but hopefully it's the former ;-). Nevertheless, we on the
>DECthreads team -do- encounter this sort of thing in our consumers' code from
>time to time, and it's always a pain for us and for them. For them because
>it's very hard to find (it almost always has to be found via code-review),
>and for us because they always say "well, the exact same code did not exhibit
>the problem until we [changed something external to it]."
Well I can't take credit for writing the DCE DTS code. I'm just one of the
lucky guys who gets to support it!
>In closing, I'd like to apologise for having had to push back when this
>problem did turn out to be a 10,000 day problem after all (everything would
>have been clearer if abort() resulted in an SS$_ABORT instead of
>SS$_OPCCUS!!). Nevertheless, it would have been inappropriate for Digital to
>claim that the ECO "fixed" the problem until we understood what the problem
>actually was and how the ECO fixed it. And, doing so would have been a
>disservice to our customers and possibly damaging to Digital.
Yes, I agree it would have been a disservice if the patch did not correct the
problem. However given the time criticality of the issue, I was willing to take
action without an absolute understanding of the problem.
dave
|