[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clt::cma

Title:DECthreads Conference
Moderator:PTHRED::MARYSTEON
Created:Mon May 14 1990
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1553
Total number of notes:9541

1516.0. "enter_kernel: deadlock" by PRSSOS::MAILLARD (Denis MAILLARD) Mon Apr 07 1997 12:54

	Can anybody tell me whether the following DECthreads bugcheck is
something already known and hopefully corrected in DECthreads or if it is
unknown and must be investigated? In the later case it could well be an RSM
problem, which might make it difficult to escalate as I understand that RSM was
among the Polycenter products sold to Computer Associates.

	This bugcheck has happened three times in the last month on my
customer's site, while nothing was changed software-wise and it had never
occurred before. It might be the result of a load increase, but the customer is
not sure that the cpu load has really increased.

	Once the problem occurs, the AlphaStation is completely hung and does
not respond to CTRL/P or even the reset button, so it is not possible to obtain
a forced crash dump. The only way out is an electrical power down followed by a
power up. There is a monitoring software on this machine that each time has
signaled a number of process going into RWMBX state just before the hang.

	Just in order to be up-to-date, I've recommended to install the
ALPCMAR04_062 patch, but I've seen nothing in its release notes that might even
remotely be connected to such a problem.
	Thanks for any help,
		Denis.

%DECthreads bugcheck (version T2.12-296), terminating execution.
% Running on OpenVMS AXP [OpenVMS V6.2-1H2; AlphaStation 255 4/232, 1 cpu,
%  64Mb]
% Reason: enter_kernel: deadlock at $64$DUA3210:[CMARTL.SRC]CMA_THREAD.C;1:1119
%     
%     The DECthreads library has detected an inconsistency in its internal
%   state and cannot continue execution. The inconsistency may be due to a bug
%   within the DECthreads library, the application program, or in any library
%   active in the address space. Common causes are unchecked stack overflows,
%   writes through uninitialized pointers, and synchronization races that
%   result in use of invalid data by some thread.
%     Application and library developers are requested to please check for
%   such problems before reporting a DECthreads library problem.
%     The information in this file may aid application program, library, or
%   DECthreads developers in determining the state of the process at the time
%   of this bugcheck. When the problem is reported, this information should be
%   included, along with a detailed procedure for reproducing the problem, if
%   that is possible. The 'detailed procedure' most likely to be of use to
%   developers is a complete program.
% 
% The bugcheck occurred at 04-APR-1997 19:39:47.89 running image
%  $20$DKA0:[SYS0.SYSCOMMON.][SYSEXE]RSM$NETDCL.EXE;3 in process 00000F69
%  (named "RSM$NETDC_24700"), under username "ROBOMON$MGR". AST delivery is
%  enabled for all modes; no ASTs active.
% The current thread is 2 (address 0x0054D920)
% Current thread traceback:
%     0:  PC 0x00429CF8, FP 0x0058EA90, DESC 0x003EAAC8
%     1:  PC 0x0043A2D0, FP 0x0058EB60, DESC 0x003EE258
%     2:  PC 0x003A9E24, FP 0x0058EBB0, DESC 0x0035CD18
%     3:  PC 0x8002B3F4, FP 0x0058ED40, DESC 0x86F252C8
T.RTitleUserPersonal
Name
DateLines
1516.1DCETHD::BUTENHOFDave Butenhof, DECthreadsMon Apr 07 1997 13:3137
Historically, the most common cause of this bugcheck is an application that
tries to do something requiring DECthreads scheduling from within an AST: for
example, locking a mutex that's already blocked. HOWEVER, as of 6.2, the
DECthreads bugcheck log includes the current process AST state -- and the
dump you've included clearly shows that no ASTs are active.

Other possible application-related causes would include a memory corruption
that set the DECthreads scheduling spinlock. (It's a likely target for memory
corruption, since in a program that calls DECthreads a lot, the address of
the spinlock is likely to be scattered across the thread stacks, where
uninitialized pointer operations can find them pretty easily.)

Of course, I can't rule out the possibility of a DECthreads bug. But, on the
other hand, 6.2 has been out a long time now.

ALPCMAR04_062 isn't really a "bug fix" at all, by the way. It was just a
pragmatic way to address a problem the came up in using a certain layered
product under high load -- it linked against DECthreads but didn't actually
use threads most of the time, and the kernel wasn't dealing well with the
overhead of the DECthreads timeslice AST in a large number of these
processes. It would have been "hard" to fix the kernel or to modify the
layered product to avoid using DECthreads in a single threaded process while
still supporting everything it needed to support, and Webb figured out a
fairly easy fix to just defer starting the timeslicer until a thread is
created. In any process that creates at least one thread, the patch doesn't
really do anything at all.

>	Once the problem occurs, the AlphaStation is completely hung and does
>not respond to CTRL/P or even the reset button, so it is not possible to
>obtain a forced crash dump.

This, of course, implies that either something's wrong with the SYSTEM.
DECthreads is completely non-privileged user code. It seems unlikely that
your bugcheck could be a symptom of these system problems -- but then, one
never knows...

	/dave
1516.2Several problems, I'd guess...WTFN::SCALESDespair is appropriate and inevitable.Mon Apr 07 1997 14:0615
FWIW, the indicated line is in cma_thread_set_sched().

At this point, it's much more likely to be a memory corruptor in the customer's
application than a bug in DECthreads.

> Once the problem occurs, the AlphaStation is completely hung and does
> not respond to CTRL/P or even the reset button, so it is not possible to 
> obtain a forced crash dump.

Doesn't respond to the reset button??  That sounds like a hardware problem to
me...  (Either that or maybe it's something stuck in a *VERY* high interrupt
priority loop...which could be the result of hardware problems in some device.)


				Webb
1516.3PRSSOS::MAILLARDDenis MAILLARDTue Apr 08 1997 06:1714
    Re .1, .2: Thanks for the infos. I'd like it to be a hardware problem,
    but, except for the fact that even pushing the reset button does not
    get a response, I don't have anything that points in that direction.
    I'll ask for the errorlog.
    
    	Beside that, another thing that points to a software problem is
    that, if I understood correctly the customer, each time the problem
    occurred, it was the same application that was involved, and it is not
    a customer application, but a Digital layered product: RSM. I might
    have to escalate a RSM IPMT soon...
    
    	Thanks again for your help, I'll try to update this note if we make
    any progress.
    			Denis
1516.4Software can "cause" hardware problems...WTFN::SCALESDespair is appropriate and inevitable.Tue Apr 08 1997 16:1413
.3> another thing that points to a software problem

There's nothing to say that software (correctly or erroneously) cannot provoke a
hardware problem.  (The WAR_STORY notes conference has several marvelous
examples... :-) 

We were seeing a problem for awhile where one of our VAXes would lock up like
you were describing, and we think it was related to the cluster disk back-ups. 
We never did prove anything (although we swapped lots of components around).  (I
think we got rid of the problem by upgrading the machine.)


				Webb
1516.5PRSSOS::MAILLARDDenis MAILLARDWed Apr 09 1997 11:5715
    	I asked the customer for the errorlog, but he has just informed me
    that as he was going to copy it, the system disk crashed. He's going to
    restore an old backup on a new disk, but the errorlog is lost (the
    backup dates from before the first occurrence of the problem). So he
    asked me to close the call until it happens again. I'd like to think
    that it's the last we'll hear about it, but I wouldn't bet anything on
    the chances... I'll enter a new reply to this call if/when the customer
    calls me back.
    
    	BTW, when I asked him for the two other CMA_DUMP.LOG files, he told
    me that they have been automatically purged before he thought to save
    them, and he's not so sure anymore that the software running in the two
    first cases was also RSM. We'll need one or more new occurrences to be
    able to indict RSM...
    		Denis.