[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clt::cma

Title:	DECthreads Conference

Moderator:	PTHRED::MARYSTEON

Created:	Mon May 14 1990
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1553
Total number of notes:	9541

1516.0. "enter_kernel: deadlock" by PRSSOS::MAILLARD (Denis MAILLARD) Mon Apr 07 1997 11:54

	Can anybody tell me whether the following DECthreads bugcheck is
something already known and hopefully corrected in DECthreads or if it is
unknown and must be investigated? In the later case it could well be an RSM
problem, which might make it difficult to escalate as I understand that RSM was
among the Polycenter products sold to Computer Associates.

	This bugcheck has happened three times in the last month on my
customer's site, while nothing was changed software-wise and it had never
occurred before. It might be the result of a load increase, but the customer is
not sure that the cpu load has really increased.

	Once the problem occurs, the AlphaStation is completely hung and does
not respond to CTRL/P or even the reset button, so it is not possible to obtain
a forced crash dump. The only way out is an electrical power down followed by a
power up. There is a monitoring software on this machine that each time has
signaled a number of process going into RWMBX state just before the hang.

	Just in order to be up-to-date, I've recommended to install the
ALPCMAR04_062 patch, but I've seen nothing in its release notes that might even
remotely be connected to such a problem.
	Thanks for any help,
		Denis.

%DECthreads bugcheck (version T2.12-296), terminating execution.
% Running on OpenVMS AXP [OpenVMS V6.2-1H2; AlphaStation 255 4/232, 1 cpu,
%  64Mb]
% Reason: enter_kernel: deadlock at $64$DUA3210:[CMARTL.SRC]CMA_THREAD.C;1:1119
%     
%     The DECthreads library has detected an inconsistency in its internal
%   state and cannot continue execution. The inconsistency may be due to a bug
%   within the DECthreads library, the application program, or in any library
%   active in the address space. Common causes are unchecked stack overflows,
%   writes through uninitialized pointers, and synchronization races that
%   result in use of invalid data by some thread.
%     Application and library developers are requested to please check for
%   such problems before reporting a DECthreads library problem.
%     The information in this file may aid application program, library, or
%   DECthreads developers in determining the state of the process at the time
%   of this bugcheck. When the problem is reported, this information should be
%   included, along with a detailed procedure for reproducing the problem, if
%   that is possible. The 'detailed procedure' most likely to be of use to
%   developers is a complete program.
% 
% The bugcheck occurred at 04-APR-1997 19:39:47.89 running image
%  $20$DKA0:[SYS0.SYSCOMMON.][SYSEXE]RSM$NETDCL.EXE;3 in process 00000F69
%  (named "RSM$NETDC_24700"), under username "ROBOMON$MGR". AST delivery is
%  enabled for all modes; no ASTs active.
% The current thread is 2 (address 0x0054D920)
% Current thread traceback:
%     0:  PC 0x00429CF8, FP 0x0058EA90, DESC 0x003EAAC8
%     1:  PC 0x0043A2D0, FP 0x0058EB60, DESC 0x003EE258
%     2:  PC 0x003A9E24, FP 0x0058EBB0, DESC 0x0035CD18
%     3:  PC 0x8002B3F4, FP 0x0058ED40, DESC 0x86F252C8

T.R	Title	User	Personal Name	Date	Lines
1516.1		DCETHD::BUTENHOF	Dave Butenhof, DECthreads	`Mon Apr 07 1997 12:31`	37
	Historically, the most common cause of this bugcheck is an application that tries to do something requiring DECthreads scheduling from within an AST: for example, locking a mutex that's already blocked. HOWEVER, as of 6.2, the DECthreads bugcheck log includes the current process AST state -- and the dump you've included clearly shows that no ASTs are active. Other possible application-related causes would include a memory corruption that set the DECthreads scheduling spinlock. (It's a likely target for memory corruption, since in a program that calls DECthreads a lot, the address of the spinlock is likely to be scattered across the thread stacks, where uninitialized pointer operations can find them pretty easily.) Of course, I can't rule out the possibility of a DECthreads bug. But, on the other hand, 6.2 has been out a long time now. ALPCMAR04_062 isn't really a "bug fix" at all, by the way. It was just a pragmatic way to address a problem the came up in using a certain layered product under high load -- it linked against DECthreads but didn't actually use threads most of the time, and the kernel wasn't dealing well with the overhead of the DECthreads timeslice AST in a large number of these processes. It would have been "hard" to fix the kernel or to modify the layered product to avoid using DECthreads in a single threaded process while still supporting everything it needed to support, and Webb figured out a fairly easy fix to just defer starting the timeslicer until a thread is created. In any process that creates at least one thread, the patch doesn't really do anything at all. > Once the problem occurs, the AlphaStation is completely hung and does >not respond to CTRL/P or even the reset button, so it is not possible to >obtain a forced crash dump. This, of course, implies that either something's wrong with the SYSTEM. DECthreads is completely non-privileged user code. It seems unlikely that your bugcheck could be a symptom of these system problems -- but then, one never knows... /dave
1516.2	Several problems, I'd guess...	WTFN::SCALES	Despair is appropriate and inevitable.	`Mon Apr 07 1997 13:06`	15
	FWIW, the indicated line is in cma_thread_set_sched(). At this point, it's much more likely to be a memory corruptor in the customer's application than a bug in DECthreads. > Once the problem occurs, the AlphaStation is completely hung and does > not respond to CTRL/P or even the reset button, so it is not possible to > obtain a forced crash dump. Doesn't respond to the reset button?? That sounds like a hardware problem to me... (Either that or maybe it's something stuck in a VERY high interrupt priority loop...which could be the result of hardware problems in some device.) Webb
1516.3		PRSSOS::MAILLARD	Denis MAILLARD	`Tue Apr 08 1997 05:17`	14
	Re .1, .2: Thanks for the infos. I'd like it to be a hardware problem, but, except for the fact that even pushing the reset button does not get a response, I don't have anything that points in that direction. I'll ask for the errorlog. Beside that, another thing that points to a software problem is that, if I understood correctly the customer, each time the problem occurred, it was the same application that was involved, and it is not a customer application, but a Digital layered product: RSM. I might have to escalate a RSM IPMT soon... Thanks again for your help, I'll try to update this note if we make any progress. Denis
1516.4	Software can "cause" hardware problems...	WTFN::SCALES	Despair is appropriate and inevitable.	`Tue Apr 08 1997 15:14`	13
	.3> another thing that points to a software problem There's nothing to say that software (correctly or erroneously) cannot provoke a hardware problem. (The WAR_STORY notes conference has several marvelous examples... :-) We were seeing a problem for awhile where one of our VAXes would lock up like you were describing, and we think it was related to the cluster disk back-ups. We never did prove anything (although we swapped lots of components around). (I think we got rid of the problem by upgrading the machine.) Webb
1516.5		PRSSOS::MAILLARD	Denis MAILLARD	`Wed Apr 09 1997 10:57`	15
	I asked the customer for the errorlog, but he has just informed me that as he was going to copy it, the system disk crashed. He's going to restore an old backup on a new disk, but the errorlog is lost (the backup dates from before the first occurrence of the problem). So he asked me to close the call until it happens again. I'd like to think that it's the last we'll hear about it, but I wouldn't bet anything on the chances... I'll enter a new reply to this call if/when the customer calls me back. BTW, when I asked him for the two other CMA_DUMP.LOG files, he told me that they have been automatically purged before he thought to save them, and he's not so sure anymore that the software running in the two first cases was also RSM. We'll need one or more new occurrences to be able to indict RSM... Denis.