T.R | Title | User | Personal Name | Date | Lines |
---|
8992.1 | you're on the right track... | SMURF::PETERT | rigidly defined areas of doubt and uncertainty | Fri Feb 28 1997 08:41 | 20 |
| Your assumptions of 1) different .so files, and 3) not enough file
space, are both valid. Though on 3) you should get some sort of
warning message from the system that the device is full. As for
2) it seems to me I've gone through this before and never come
up with a decent answer. Certainly one way would be to replace
the .so files on your system with their's, but that is not what
you are looking for. Checking the man page for the loader(5),
it appears the best bet would be to set the environment variable
_RLD_LIST with an explicit list of libraries, which will get
loaded in that order. There is also a mention of LD_LIBRARY_PATH
which might work, but it seems to me that this is the one that
didn't work the last time I looked at this. You should peruse
the loader man page and see if it helps out.
But if the core is corrupt, nothing is going to help. If they can
get a traceback on their system, before you take a look at it,
than the library mismatch is likely the problem.
PeterT
|
8992.2 | Turn off the setuid bit | WTFN::SCALES | Despair is appropriate and inevitable. | Fri Feb 28 1997 09:28 | 14 |
| .1> Checking the man page for the loader(5), it appears the best bet would be
.1> to set the environment variable _RLD_LIST with an explicit list of
.1> libraries, which will get loaded in that order. There is also a mention
.1> of LD_LIBRARY_PATH which might work
The key is whether the debugger image has the setuid bit set -- if it does,
then the loader will ignore LD_LIBRARY_PATH, and it probably skips _RLD_LIST,
too, for the same reason. (We use LD_LIBRARY_PATH, and we hit this problem
with Ladebug.) However, if you turn off the setuid bit, things should work
well (provided you remember to set up LD_LIBRARY_PATH ;-). By turning off
the setuid bit, you lose only the ability to do remote debugging, I think.
Webb
|
8992.3 | Thanks, | EDSCLU::KELLY | | Fri Feb 28 1997 12:44 | 5 |
| Thanks for your help and input. This is what we were looking for.
Regards,
Jim Kelly
|
8992.4 | | SMURF::DENHAM | Digital UNIX Kernel | Fri Feb 28 1997 14:13 | 7 |
| There's also a couple of possibilities. The memory the core file
reocrded may well be corrupt. In other words, if the stack is
corrupted or the code took wild jump somewhere, you can see
similar unhelpful tracebacks.
Also, before V4.0, have more that 15 or so threads is a core
dump would corrupt the core file.
|
8992.5 | | DCEIDL::BUTENHOF | Dave Butenhof, DECthreads | Mon Mar 03 1997 07:33 | 18 |
| .2: The key is whether the debugger image has the setuid bit set -- if it
.2: does, then the loader will ignore LD_LIBRARY_PATH, and it probably skips
.2: _RLD_LIST, too, for the same reason.
This is only a problem when you're trying to affect the shared libraries USED
BY THE DEBUGGER. It's not a problem when you only want to affect the shared
libraries used by the program that you're debugging. (Unless, of course, it,
too, has setuid [or setgrp] set.)
DECthreads runs into the problem Webb mentions because, with Digital UNIX
4.0, ladebug links against a library to facilitate debugging threaded
programs, and that library must match exactly with the version of DECthreads
used by the program you're debugging. Since we tend to debug against versions
of the thread library more recent than that installed on the system, we also
need ladebug to use the matching libpthreaddebug.so image -- which means
shutting off ladebug's setuid.
/dave
|
8992.6 | help! | EDSCLU::WANG | | Fri Mar 28 1997 09:13 | 16 |
| Hi,
We received a couple more core dumps from the same customer. We
also asked them to get the dbx output on "where,tlist,tstack" from their
system (V3.2C) where the core was generated. We got same results as the
base note .0 . We really cannot determine what caused the core from the
dbx information. They checked the disk was big enough for the core and the
user was root.
Why did all the core files get currupted? Are there anything
that the customer should be awared of, so the next time when the process
dies it will generate a usefull core dump?
Thanks,
Danqing
|
8992.7 | Stack overwritten? | QUARRY::petert | rigidly defined areas of doubt and uncertainty | Fri Mar 28 1997 09:40 | 13 |
| Hmmm, are you saying the customer can't do the trace backs either?
If so, it seems likely that the core file is getting corrupt, but
I'm not sure why. One possibility is that the program is crashing
because it has a memory leak someplace and it is overwriting it's
own stack. In this case, all the information that a debugger
looks for to determine the cause of the crash has been overwritten
with data which is meaningless to the debugger. This can well
be the cause of the error messages you see as the debugger starts
up. In this case, you might have more luck analyzing the program
with some of the atom tools. 3rd (or is it third?) comes to mind,
but I don't know as much about it as some others might.
PeterT
|
8992.8 | | EDSCLU::WANG | | Fri Mar 28 1997 10:48 | 7 |
| Yes, the customer can't do the trace backs either. I've never run atom
tools before, would you mind to show me how? I did "atom -tool third
lu62_server" then it created lu62_server.third*. Where do I go from
here?
Thanks for your help
|
8992.9 | | SMURF::DENHAM | Digital UNIX Kernel | Fri Mar 28 1997 11:39 | 7 |
| So, how many threads in the application? More than 15 or
so? Then you need a kernel patch to get good core files.
I'm looking in the support pool sources and I'm not seeing
the patch to kern_sig.o to fix this. I gave the support
team a fix for this problem months ago, but it never seems
to have made into the pools. Again! Argh.
|
8992.10 | | EDSCLU::WANG | | Fri Mar 28 1997 11:55 | 7 |
| Yes, we have more then 15 threads.
So if the customer install this new kernal patch, they may get good
core files? Would you please let us know where we can get this new
patch?
Thanks alot.
|
8992.11 | Looking for more info | EDSCLU::GARROD | IBM Interconnect Engineering | Fri Mar 28 1997 14:51 | 12 |
| Re .9
Please could you give more information on this fix? Ie things like what
versions of UNIX it applies to. What the patch kit ident is? Is it
it incorporated after a certain version of UNIX etc.
Also could you give more info on how we can identify if we're being hit
by the lack of this patch when we receive core files.
Thanks,
|
8992.12 | | QUARRY::petert | rigidly defined areas of doubt and uncertainty | Fri Mar 28 1997 15:00 | 11 |
| My knowledge of third is limited. I know most of what I do from reading
the third(5) man page. After producing the xxx.third file, set LD_LIBRARY_PATH
to point to the current directory (third should have produced a libc.so.third
file too, unless your application is unshared, which would seem unusual for
a threaded application.) and then run the program. It will produce a
xxx.3log file and you can preuse that for information on where you might have
memory problems and uninitialized data, etc...
But if it's the problem with the # of threads, this info may be moot.
PeterT
|
8992.13 | | SMURF::DENHAM | Digital UNIX Kernel | Fri Mar 28 1997 17:15 | 18 |
| Re. .11.
The problem applies to all releases before V4.0. There is *NO* patch
for the problem. A test patch was generated for a customer in Hong
Kong, and I foolishly assumed that this would get turned into
a V3.2-based patch. Bzzzzt. Wrong. I'm investigating what happened
there. In the meantime, send mail to [email protected] and
ask for the test patch. Tell him I sent you. ;^)
Identifying the problem isn't too hard, if you know the application
well enough to know that it generally uses more than 15-16 threads
at some point. Basically, the core file is corrupt. It will give
some tracebacks but the stacks may be pretty bizarre, and it
won't show all the stacks by a long shot.
If we had the darn patch, this would be academic. You'd apply and
then get on with life or eliminate the problem as irrelevant.
But I'm stating the obvious (out of frustration).
|