[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

8992.0. "traceback will not work" by EDSCLU::KELLY () Thu Feb 27 1997 18:24

We receive a lot of core files from customers that we can not read in dbx or
DECladebug.  The output from one of these dbx() sessions is as follows:

# dbx $lu62_bin/lu62_server /kkk/lu62v3/test/citilu62_core
dbx version 3.11.8
Type 'help' for help.
Core file created by program "lu62_server"

thread 0xfffffc000742cdc0 signal Segmentation fault at
warning: PC value 0xfffffc0011cc43c0 not valid, trying RA

warning: RA value 0xfffffc0011cc4140 not valid, trying text start

warning: text start 0x120000000 not valid, trying data start

warning: Using data start as a text address -- traceback will not work
> [., 0x140000000]      call_pal        cflush


QUESTIONS:

1) If the .so files on the customer system and the .so files on the 
   system we are using to debug the core file on do not match, could this 
   caused the above problem ??

2) If this is a probability, is there a way to change the .so files that
   dbx() uses if we have a copy of the customers .so files ??  

3) Will the core file become corrupt if there is not enough space on the
   file system that the core is built on ??





Thanks,

Jim Kelly
T.RTitleUserPersonal
Name
DateLines
8992.1you're on the right track...SMURF::PETERTrigidly defined areas of doubt and uncertaintyFri Feb 28 1997 08:4120
    Your assumptions of 1) different .so files, and 3) not enough file
    space, are both valid.  Though on 3) you should get some sort of
    warning message from the system that the device is full.  As for
    2) it seems to me I've gone through this before and never come
    up with a decent answer.  Certainly one way would be to replace
    the .so files on your system with their's, but that is not what 
    you are looking for.  Checking the man page for the loader(5),
    it appears the best bet would be to set the environment variable
    _RLD_LIST with an explicit list of libraries, which will get
    loaded in that order.  There is also a mention of LD_LIBRARY_PATH
    which might work, but it seems to me that this is the one that 
    didn't work the last time I looked at this.  You should peruse 
    the loader man page and see if it helps out.  
    
    But if the core is corrupt, nothing is going to help.  If they can
    get a traceback on their system, before you take a look at it,
    than the library mismatch is likely the problem.
    
    PeterT
    
8992.2Turn off the setuid bitWTFN::SCALESDespair is appropriate and inevitable.Fri Feb 28 1997 09:2814
.1> Checking the man page for the loader(5), it appears the best bet would be
.1> to set the environment variable _RLD_LIST with an explicit list of
.1> libraries, which will get loaded in that order.  There is also a mention
.1> of LD_LIBRARY_PATH which might work

The key is whether the debugger image has the setuid bit set -- if it does,
then the loader will ignore LD_LIBRARY_PATH, and it probably skips _RLD_LIST,
too, for the same reason.  (We use LD_LIBRARY_PATH, and we hit this problem
with Ladebug.)  However, if you turn off the setuid bit, things should work
well (provided you remember to set up LD_LIBRARY_PATH ;-).  By turning off
the setuid bit, you lose only the ability to do remote debugging, I think.


				Webb
8992.3Thanks,EDSCLU::KELLYFri Feb 28 1997 12:445
Thanks for your help and input.  This is what we were looking for.

Regards,

Jim Kelly
8992.4SMURF::DENHAMDigital UNIX KernelFri Feb 28 1997 14:137
    There's also a couple of possibilities. The memory the core file
    reocrded may well be corrupt. In other words, if the stack is
    corrupted or the code took wild jump somewhere, you can see
    similar unhelpful tracebacks. 
    
    Also, before V4.0, have more that 15 or so threads is a core
    dump would corrupt the core file.
8992.5DCEIDL::BUTENHOFDave Butenhof, DECthreadsMon Mar 03 1997 07:3318
.2: The key is whether the debugger image has the setuid bit set -- if it 
.2: does, then the loader will ignore LD_LIBRARY_PATH, and it probably skips 
.2: _RLD_LIST, too, for the same reason.

This is only a problem when you're trying to affect the shared libraries USED
BY THE DEBUGGER. It's not a problem when you only want to affect the shared
libraries used by the program that you're debugging. (Unless, of course, it,
too, has setuid [or setgrp] set.)

DECthreads runs into the problem Webb mentions because, with Digital UNIX
4.0, ladebug links against a library to facilitate debugging threaded
programs, and that library must match exactly with the version of DECthreads
used by the program you're debugging. Since we tend to debug against versions
of the thread library more recent than that installed on the system, we also
need ladebug to use the matching libpthreaddebug.so image -- which means
shutting off ladebug's setuid.

	/dave
8992.6help!EDSCLU::WANGFri Mar 28 1997 09:1316
    Hi,
    
            We received a couple more core dumps from the same customer. We
    also asked them to get the dbx output on "where,tlist,tstack" from their
    system (V3.2C) where the core was generated. We got same results as the
    base note .0 . We really cannot determine what caused the core from the
    dbx information. They checked the disk was big enough for the core and the
    user was root.
    
            Why did all the core files get currupted? Are there anything
    that the customer should be awared of, so the next time when the process 
    dies it will generate a usefull core dump?
    
    Thanks,                                 
    Danqing
    
8992.7Stack overwritten?QUARRY::petertrigidly defined areas of doubt and uncertaintyFri Mar 28 1997 09:4013
Hmmm, are you saying the customer can't do the trace backs either?
If so, it seems likely that the core file is getting corrupt, but
I'm not sure why.  One possibility is that the program is crashing 
because it has a memory leak someplace and it is overwriting it's
own stack.  In this case, all the information that a debugger 
looks for to determine the cause of the crash has been overwritten
with data which is meaningless to the debugger.  This can well
be the cause of the error messages you see as the debugger starts
up.  In this case, you might have more luck analyzing the program 
with some of the atom tools.  3rd (or is it third?) comes to mind,
but I don't know as much about it as some others might.

PeterT
8992.8EDSCLU::WANGFri Mar 28 1997 10:487
    Yes, the customer can't do the trace backs either. I've never run atom
    tools before, would you mind to show me how? I did "atom -tool third
    lu62_server" then it created lu62_server.third*. Where do I go from
    here? 
    
    Thanks for your help
    
8992.9SMURF::DENHAMDigital UNIX KernelFri Mar 28 1997 11:397
    So, how many threads in the application? More than 15 or
    so? Then you need a kernel patch to get good core files.
    
    I'm looking in the support pool sources and I'm not seeing
    the patch to kern_sig.o to fix this. I gave the support
    team a fix for this problem months ago, but it never seems
    to have made into the pools. Again! Argh.
8992.10EDSCLU::WANGFri Mar 28 1997 11:557
    Yes, we have more then 15 threads. 
    
    So if the customer install this new kernal patch, they may get good
    core files? Would you please let us know where we can get this new
    patch? 
    
    Thanks alot.
8992.11Looking for more infoEDSCLU::GARRODIBM Interconnect EngineeringFri Mar 28 1997 14:5112
    Re .9
    
    Please could you give more information on this fix? Ie things like what
    versions of UNIX it applies to. What the patch kit ident is? Is it
    it incorporated after a certain version of UNIX etc.
    
    Also could you give more info on how we can identify if we're being hit
    by the lack of this patch when we receive core files.
    
    Thanks,
    
    
8992.12QUARRY::petertrigidly defined areas of doubt and uncertaintyFri Mar 28 1997 15:0011
My knowledge of third is limited.  I know most of what I do from reading
the third(5) man page.  After producing the xxx.third file, set LD_LIBRARY_PATH
to point to the current directory (third should have produced a libc.so.third
file too, unless your application is unshared, which would seem unusual for
a threaded application.) and then run the program.  It will produce a 
xxx.3log file and you can preuse that for information on where you might have
memory problems and uninitialized data, etc...

But if it's the problem with the # of threads, this info may be moot.

PeterT
8992.13SMURF::DENHAMDigital UNIX KernelFri Mar 28 1997 17:1518
    Re. .11.
    
    The problem applies to all releases before V4.0. There is *NO* patch
    for the problem. A test patch was generated for a customer in Hong
    Kong, and I foolishly assumed that this would get turned into
    a V3.2-based patch. Bzzzzt. Wrong. I'm investigating what happened
    there. In the meantime, send mail to [email protected] and
    ask for the test patch. Tell him I sent you. ;^)
    
    Identifying the problem isn't too hard, if you know the application
    well enough to know that it generally uses more than 15-16 threads
    at some point. Basically, the core file is corrupt. It will give
    some tracebacks but the stacks may be pretty bizarre, and it
    won't show all the stacks by a long shot.
    
    If we had the darn patch, this would be academic. You'd apply and
    then get on with life or eliminate the problem as irrelevant.
    But I'm stating the obvious (out of frustration).