T.R | Title | User | Personal Name | Date | Lines |
---|
1516.1 | | DCETHD::BUTENHOF | Dave Butenhof, DECthreads | Mon Apr 07 1997 13:31 | 37 |
| Historically, the most common cause of this bugcheck is an application that
tries to do something requiring DECthreads scheduling from within an AST: for
example, locking a mutex that's already blocked. HOWEVER, as of 6.2, the
DECthreads bugcheck log includes the current process AST state -- and the
dump you've included clearly shows that no ASTs are active.
Other possible application-related causes would include a memory corruption
that set the DECthreads scheduling spinlock. (It's a likely target for memory
corruption, since in a program that calls DECthreads a lot, the address of
the spinlock is likely to be scattered across the thread stacks, where
uninitialized pointer operations can find them pretty easily.)
Of course, I can't rule out the possibility of a DECthreads bug. But, on the
other hand, 6.2 has been out a long time now.
ALPCMAR04_062 isn't really a "bug fix" at all, by the way. It was just a
pragmatic way to address a problem the came up in using a certain layered
product under high load -- it linked against DECthreads but didn't actually
use threads most of the time, and the kernel wasn't dealing well with the
overhead of the DECthreads timeslice AST in a large number of these
processes. It would have been "hard" to fix the kernel or to modify the
layered product to avoid using DECthreads in a single threaded process while
still supporting everything it needed to support, and Webb figured out a
fairly easy fix to just defer starting the timeslicer until a thread is
created. In any process that creates at least one thread, the patch doesn't
really do anything at all.
> Once the problem occurs, the AlphaStation is completely hung and does
>not respond to CTRL/P or even the reset button, so it is not possible to
>obtain a forced crash dump.
This, of course, implies that either something's wrong with the SYSTEM.
DECthreads is completely non-privileged user code. It seems unlikely that
your bugcheck could be a symptom of these system problems -- but then, one
never knows...
/dave
|
1516.2 | Several problems, I'd guess... | WTFN::SCALES | Despair is appropriate and inevitable. | Mon Apr 07 1997 14:06 | 15 |
| FWIW, the indicated line is in cma_thread_set_sched().
At this point, it's much more likely to be a memory corruptor in the customer's
application than a bug in DECthreads.
> Once the problem occurs, the AlphaStation is completely hung and does
> not respond to CTRL/P or even the reset button, so it is not possible to
> obtain a forced crash dump.
Doesn't respond to the reset button?? That sounds like a hardware problem to
me... (Either that or maybe it's something stuck in a *VERY* high interrupt
priority loop...which could be the result of hardware problems in some device.)
Webb
|
1516.3 | | PRSSOS::MAILLARD | Denis MAILLARD | Tue Apr 08 1997 06:17 | 14 |
| Re .1, .2: Thanks for the infos. I'd like it to be a hardware problem,
but, except for the fact that even pushing the reset button does not
get a response, I don't have anything that points in that direction.
I'll ask for the errorlog.
Beside that, another thing that points to a software problem is
that, if I understood correctly the customer, each time the problem
occurred, it was the same application that was involved, and it is not
a customer application, but a Digital layered product: RSM. I might
have to escalate a RSM IPMT soon...
Thanks again for your help, I'll try to update this note if we make
any progress.
Denis
|
1516.4 | Software can "cause" hardware problems... | WTFN::SCALES | Despair is appropriate and inevitable. | Tue Apr 08 1997 16:14 | 13 |
| .3> another thing that points to a software problem
There's nothing to say that software (correctly or erroneously) cannot provoke a
hardware problem. (The WAR_STORY notes conference has several marvelous
examples... :-)
We were seeing a problem for awhile where one of our VAXes would lock up like
you were describing, and we think it was related to the cluster disk back-ups.
We never did prove anything (although we swapped lots of components around). (I
think we got rid of the problem by upgrading the machine.)
Webb
|
1516.5 | | PRSSOS::MAILLARD | Denis MAILLARD | Wed Apr 09 1997 11:57 | 15 |
| I asked the customer for the errorlog, but he has just informed me
that as he was going to copy it, the system disk crashed. He's going to
restore an old backup on a new disk, but the errorlog is lost (the
backup dates from before the first occurrence of the problem). So he
asked me to close the call until it happens again. I'd like to think
that it's the last we'll hear about it, but I wouldn't bet anything on
the chances... I'll enter a new reply to this call if/when the customer
calls me back.
BTW, when I asked him for the two other CMA_DUMP.LOG files, he told
me that they have been automatically purged before he thought to save
them, and he's not so sure anymore that the software running in the two
first cases was also RSM. We'll need one or more new occurrences to be
able to indict RSM...
Denis.
|