[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference tuxedo::dce-products

Title:	DCE Product Information
Notice:	Kit Info - See 2.-4.
Moderator:	TUXEDO::MAZZAFERRO

Created:	Fri Jun 26 1992
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2269
Total number of notes:	10003

2203.0. "error at decw$dceresd:[dts.src]timers.c;1" by BARNA::DSMAIL () Wed Apr 02 1997 06:06

    Hi to all!!!!
    I have a customer who is using dce v.1.3A in a vax 4100 with open vms
    6.1 and the same version for decnet IV.
    He had a problem with the time server which comes with dce when this
    saturday in Europe the time changed.
    He didn't change the time which meant that he couldn't start dce,
    getting the next error: "Fatal error at line 782 in file
    decw$dceresd:[dts.src]timers.c;1" This error occurs when the dce is 
    trying to start the dce$dtsd
    As you can imagine that's a great problem. Is a known bug?
    What's happenning?
    Thanks in advanced
    
    Amalia Mart�n (Barcelona)

T.R	Title	User	Personal Name	Date	Lines
2203.1	This problem fixed in 1.4	STAR::SWEENEY		`Wed Apr 02 1997 09:54`	5
	The time change problem is corrected in version 1.4 of DCE for OpenVMS. You should be aware that V1.3A is not a supported version of DCE. Dave
2203.2	Is it really fixed?	ESME::SPENCE	Bugs? You mean insects?	`Thu Apr 03 1997 05:22`	19
	I've got the same problem with DCE V1.4 and OpenVMS/AXP V6.2. Looking in DCE$SPECIFIC:[VAR.ADM.TIME]DCE$DTSD.OUT: Fatal error at line 782 in file DECW$DCERESD:[DTS.SRC]TIMERS.C;1 %CMA-F-EXCCOP, exception raised; VMS condition code follows -SYSTEM-F-OPCCUS, opcode reserved to customer fault at PC=8059457C, PS=0000001B The image identification information of SYS$SYSTEM:DTSD.EXE is: image name:"DCE$DTSD" image file identification: "DCE V1.4-961030" link date/time: 30-OCT-1996 14:24:11.47 linker identification: "T10-58" This has broken our internal DCE cell; I'd really value any other ideas as to how we can fix it. Thanks, Cameron
2203.3	Is this the real culprit?	ESME::SPENCE	Bugs? You mean insects?	`Thu Apr 03 1997 08:03`	16
	As I said in .-1, I'm running DCE V1.4, and I had the same problem. I'd also previously read the note concerning the problem due to hit DCE (see note 2132 in this conference) on 19th May, and wondered if it could be related. So, I installed the patch for that particular problem, and guess what? The problem with DTSD went away - it doesn't crash anymore. This would tend to suggest to me that this 'timebomb' in OpenVMS has, in fact, 'exploded' NOW (at least, it did on my system), not next month. Perhaps someone from DCE engineering might like to comment. If my hypothesis is true, then we need to get this patch kit to our DCE/OpenVMS customers rather more urgently than we previously thought... - Cameron
2203.4	Hmmm...	STAR::SWEENEY		`Thu Apr 03 1997 10:22`	14
	I believe if your hypothesis were true, I would have been inundated with cases from Europe with this error. Any other European field engineers encounter this problem? In .1 it states "he didn't change the time which meant that he couldn't start dce". Did you do anything else along with installing the patch? So you mean DCE was not running when the time change occured? If DCE was not running not only do you have to change the time but you need to update the timezone configuration as described in section 18.2 of the DCE for OpenVMS product guide. Please let me know if this fixes the problem. Dave
2203.5	here's one more site	COMICS::HOWLAND		`Thu Apr 03 1997 17:15`	6
	Hi Dave, I can confirm this on at least one other site in the UK. They could not start dtsd after the time change Systems already running the "May 19th" patch were no affected installing the patch cleared the problem on affected nodes. Graham
2203.6	Thanks Graham...	STAR::SWEENEY		`Thu Apr 03 1997 17:23`	4
	Thanks for confiming the hypothesis. I believe now it can be deemed a theory. Dave
2203.7	pthread_cond_timedwait set to expire after 19-May	STAR::SWEENEY		`Thu Apr 03 1997 17:56`	5
	This appears to be caused by a pthread_cond_timedwait being set to expire after 19-May-1997. Dave
2203.8	Um...not so fast...	WTFN::SCALES	Despair is appropriate and inevitable.	`Thu Apr 03 1997 19:37`	32
	Folks, This is not a "10,000 Day" (i.e., 19 May 1997) problem. It's wonderful that the patch for that problem -appears- to address this problem, but I expect that it is merely masking it. From .2: SYSTEM-F-OPCCUS, opcode reserved to customer fault at PC=8059457C, PS=0000001B This is the real symptom of the problem. (The fact that it's chained to the "CMA-F-EXCCOP" primary is inconsequential!!) There is a memory corruptor in this application which is presumably corrupting the return address in some frame on some thread's stack (resulting in the thread attempting to execute something which is not code...). Presumably, installing the 10,000 Day ECO kit changed the environment enough to make the problem seem to "go away". Perhaps, with new images of slightly different sizes the memory layout has changed so that now the corruption isn't fatal (i.e., maybe it's now benign or maybe it's simply going unnoticed at the moment!). Or, possibly, the new code paths have changed the execution order so that a race condition which has resulted in the corruption no longer occurs. Either way, there is a bug somewhere in the application, and there is no reason to believe that the 10,000 Day ECO actually fixes it. Sorry, Webb Scales DECthreads
2203.9	I still doubt it.	WTFN::SCALES	Despair is appropriate and inevitable.	`Thu Apr 03 1997 19:40`	9
	.7> This appears to be caused by a pthread_cond_timedwait being set to expire .7> after 19-May-1997. Dave, is there really reason to believe that DTSD would be setting timeouts more than a month into the future? (Regardless, I don't see how you get from DECthreads returning an EINVAL status to the process taking an OPCCUS trap!) Webb
2203.10	Not too slow. If we err, let's err on the side of protecting our customers	STAR::SWEENEY		`Thu Apr 03 1997 21:11`	62
	Webb, To try to address Webb's concern: Can any field engineer acknowledge that DCE systems without the patch functioned properly after the time change? All my systems have had the patch installed for quite some time. Can you jump in here Marco Bill? To try to address my concern: As you can see in the initial notes I did not originally think this problem was related to the 10,000 day limit patch. I believe the confirmation from two additional customers is reason enough to take action informing other customers how to work around the problem if it occurs. So what do you propose? Are you proposing we not inform our customers and hope we come up with another workaround and a patch kit tomorrow? Or you are saying the problem will not occur? According to our field engineers we have a solution, and it's already gone to many of our production customer sites. You better believe there are customers out there who will put this patch on if they encounter the problem. After it "alleviates" the problem, they will ask why we did not recommend putting the patch on before the time change. Even if the patch "masks" the problem, a workaround IS a fix to a large customer production site that risks losing bundles of money when the application does not work. ----- DCE is a rather large accumulation of code, it's hard to know the internals of all it's components. As DTS is a persistent process, I understand why it would set timers to expire far out in the future. For example, it will set a timer to adjust the time zone differential when daylight savings time expires. From 8. >Presumably, installing the 10,000 Day ECO kit changed the environment enough to >make the problem seem to "go away". Perhaps, with new images of slightly >different sizes the memory layout has changed so that now the corruption isn't >fatal (i.e., maybe it's now benign or maybe it's simply going unnoticed at the >moment!). Or, possibly, the new code paths have changed the execution order so >that a race condition which has resulted in the corruption no longer occurs. I do not know of anyone with internals knowledge of DTS who could point to suspicious sections of code where your analysis may apply. Why did the exact same DCE code not exhibit the problem last year. I have heard this exact same analysis concerning other problems encountered with DCE's use of threads in the past. Excuse my skepticism, but this analysis has not rung true in many cases. From 8. >Either way, there is a bug somewhere in the application, and there is no reason >to believe that the 10,000 Day ECO actually fixes it. Or there is a bug in something the application uses. So you do not believe Graham that the patch fixed the problem? Dave
2203.11	Better safe than sorry, IMHO	CSC32::J_MORTON	O8-OO-2b \|\| ! 2b	`Thu Apr 03 1997 22:08`	18
	Hi, While the DECthreads engineer may well be correct that the patch isn't the "fix" to the problem, I think Dave's concern for our credibility is valid. Can we risk that customers will have the problem while we have a "workaround" on the shelf ready to go? If we find a site with the problem that will allow us to do further analysis possibly we can provide a real "fix" at a later date. Also, given that the 10,000 day ECO will be necessary soon anyway, it certainly won't hurt for the customers to install it early if they haven't already. .02, Jim CSC/CS
2203.12	TDCE may be masking the 10,000 day problem...	STAR::SWEENEY		`Thu Apr 03 1997 23:09`	24
	Below is the "offending" DCE code. Any error besides EAGAIN is bugchecked. My review of the code generated by the bugcheck() call leads me to believe the "real" error, invalid argument, is not bubbling up the stack. Looks like it simpy prints the line number and calls abort(). I'll have to look at it with fresher eyes tomorrow. It been common in my experience with DCE that any unhandled condition ends up being reported as ACCVIO or OPCDEC. Dave if (pthread_cond_timedwait (&timerPtr->timerEventCond, &timerPtr->mutex, &expirationTime) < 0) { if (errno == EAGAIN) { if (timerPtr->timerEvent == K_TIMER_EVENT_NULL) { timerPtr->timerEvent = K_TIMER_EVENT_EXPIRE; } } else BugCheck(); << this is the line aborting
2203.13	OK, now I'm satisfied that this IS a 10,000 day problem.	WTFN::SCALES	Despair is appropriate and inevitable.	`Fri Apr 04 1997 13:20`	83
	.12> Looks like it simpy prints the line number and calls abort(). OK, there's the missing piece of data! As I recall, the result of calling abort() -is- the OPCCUS trap (which is REALLY gross, but there you are). .10> For example, it will set a timer to adjust the time zone differential .10> when daylight savings time expires. Yes, this _would_ be subject to the 10,000 day problem, resulting in the call to abort() above, and _would_ be fixed by applying the ECO. So, I think we are all squared away, now. Nevertheless, I'd like to make clear my position on this situation. I never indicated that anyone should not apply the ECO now or in the future. What I said was (in the absence of knowing that DTSD was calling abort()) the reported symptoms did not match the known symptoms of the problem which the ECO addresses. Thus, while applying the patch SEEMED to address the problem, there was no sound basis for believing it fixed the problem. Thus, it was critical that we understood this and set our customer's expectations accordingly: that the patch would seem to provide a workaround for this problem but that it was possible that other problems (data corruptions!) might result. If any such problems HAD cropped up after we claimed that the ECO was a "fix" would have severly compromised Digital's credibility!! Furthermore, without the patch, the problem seemed to be reasonably reproducable. That is, without the patch, it might have been possible for Digital engineers to isolate the source of the apparent corruption. Thus, installing the ECO on -all- systems (internal as well as customers') would have been a _disservice_ to our customers, because it might have made it nearly impossible for Digital to locate and fix the real problem. .10> I do not know of anyone with internals knowledge of DTS who could point to .10> suspicious sections of code where your analysis may apply. Why did the exact .10> same DCE code not exhibit the problem last year. I have heard this exact same .10> analysis concerning other problems encountered with DCE's use of threads in the .10> past. Excuse my skepticism, but this analysis has not rung true in many cases. grin Just because the DCE code hasn't changed doesn't mean that nothing else has changed. Other shared images activated in the process might have changed size. This changes the virtual addresses at which allocations end up, potentially changing the location and effects of a data corruption. This also changes code path lengths. This could affect when certain events (such as locking or unlocking a mutex) occur relative to events in other threads. Changes in memory layout change the pagefault patterns (as do accesses to sharable images from other processes on the system) which also affect timing. Finally, simple things like system and network load and I/O latencies alter when a thread gets to run and when its timeslices occur. Thus, if there are any synchronization bugs ("race conditions") in your multithreaded code, they could remain hidden throughout all of your testing and even long stretches of deployment at multiple customer sites, but, when the right factors line up, the race suddenly goes the other way and a data corruption results (which can look like an ACCVIO, or an OPCDEC, or, yes, an OPCCUS, depending on the exact nature of the corruption). In multithreaded programming, you are subject to all of the same classes of errors that you encounter in sequential programming and, in -addition- you're subject to errors in synchronization in which timing is a factor. In sequential programming, it is possible to test all possible code paths; however, in multithreaded programming (and other forms of asynchronous programming) it is simply not possible to test the code completely, because you cannot control the timing factor from outside the code. Dave, I'm glad that you and no one that you know with internals knowledge of DTS have experienced this class of problem. That speaks highly of your code (or your luck, but hopefully it's the former ;-). Nevertheless, we on the DECthreads team -do- encounter this sort of thing in our consumers' code from time to time, and it's always a pain for us and for them. For them because it's very hard to find (it almost always has to be found via code-review), and for us because they always say "well, the exact same code did not exhibit the problem until we [changed something external to it]." In closing, I'd like to apologise for having had to push back when this problem did turn out to be a 10,000 day problem after all (everything would have been clearer if abort() resulted in an SS$_ABORT instead of SS$_OPCCUS!!). Nevertheless, it would have been inappropriate for Digital to claim that the ECO "fixed" the problem until we understood what the problem actually was and how the ECO fixed it. And, doing so would have been a disservice to our customers and possibly damaging to Digital. Webb
2203.14	FLASH article that we sent to customers.	CSC32::R_WILLIAMS		`Fri Apr 04 1997 14:32`	42
	[DCE] DCE$DTSD Fails to Start After Daylight Savings Time Any party granted access to the following copyrighted information (protected under Federal Copyright Laws), pursuant to a duly executed Digital Service Agreement may, under the terms of such agreement copy all or selected portions of this information for internal use and distribution only. No other copying or distribution for any other purpose is authorized. Copyright (c) Digital Equipment Corporation 1997. All rights reserved. PRODUCT: DIGITAL Distributed Computing Environment (DCE) for OpenVMS, Versions 1.3A through 1.4 OP/SYS: DIGITAL OpenVMS VAX, Versions 5.5-2 through 7.0 DIGITAL OpenVMS Alpha, Version 6.1 through 7.0 SOURCE: Digital Equipment Corporation OVERVIEW: OpenVMS Sustaining engineering strongly recommends the installation of VAXLIBR06_070 and ALPLIBR05_070 patch kits prior to the daylight savings time change on the weekend of April 5 and 6, 1997. INFORMATION: After the daylight savings time change on the weekend of April 5 and 6, the DCE time service daemon, DCE$DTSD, may fail to start. The following error is written to the DCE$DTSD.OUT file: "Fatal error at line 782 in file decw$dceresd:[dts.src]timers.c;1" We believe this problem is caused by a pthread_cond_timedwait being set to expire after 19-May-1997. Refer to the "OpenVMS Delta-Time Restric- tions and 19-May-1997" Blitz documentation for a description of the 19-May-1997 time problem. Additional information will be provided in future updates concerning the OpenVMS Delta-Time Restrictions issue.
2203.15	Thanks for your input	STAR::SWEENEY		`Fri Apr 04 1997 14:56`	32
	I think it is good that you pushed back as your input helped me identify the 10,000 day limit as the real cause of the problem. The DTS code and the DCE code in general, does a poor job of supplying the real condition that causes image terminations. I knew this limitation of the code, you did not. >Dave, I'm glad that you and no one that you know with internals knowledge of >DTS have experienced this class of problem. That speaks highly of your code >(or your luck, but hopefully it's the former ;-). Nevertheless, we on the >DECthreads team -do- encounter this sort of thing in our consumers' code from >time to time, and it's always a pain for us and for them. For them because >it's very hard to find (it almost always has to be found via code-review), >and for us because they always say "well, the exact same code did not exhibit >the problem until we [changed something external to it]." Well I can't take credit for writing the DCE DTS code. I'm just one of the lucky guys who gets to support it! >In closing, I'd like to apologise for having had to push back when this >problem did turn out to be a 10,000 day problem after all (everything would >have been clearer if abort() resulted in an SS$_ABORT instead of >SS$_OPCCUS!!). Nevertheless, it would have been inappropriate for Digital to >claim that the ECO "fixed" the problem until we understood what the problem >actually was and how the ECO fixed it. And, doing so would have been a >disservice to our customers and possibly damaging to Digital. Yes, I agree it would have been a disservice if the patch did not correct the problem. However given the time criticality of the issue, I was willing to take action without an absolute understanding of the problem. dave