[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:	DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:	Welcome to the Digital UNIX Conference
Moderator:	SMURF::DENHAM

Created:	Thu Mar 16 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	10068
Total number of notes:	35879

8992.0. "traceback will not work" by EDSCLU::KELLY () Thu Feb 27 1997 18:24

We receive a lot of core files from customers that we can not read in dbx or
DECladebug.  The output from one of these dbx() sessions is as follows:

# dbx $lu62_bin/lu62_server /kkk/lu62v3/test/citilu62_core
dbx version 3.11.8
Type 'help' for help.
Core file created by program "lu62_server"

thread 0xfffffc000742cdc0 signal Segmentation fault at
warning: PC value 0xfffffc0011cc43c0 not valid, trying RA

warning: RA value 0xfffffc0011cc4140 not valid, trying text start

warning: text start 0x120000000 not valid, trying data start

warning: Using data start as a text address -- traceback will not work
> [., 0x140000000]      call_pal        cflush


QUESTIONS:

1) If the .so files on the customer system and the .so files on the 
   system we are using to debug the core file on do not match, could this 
   caused the above problem ??

2) If this is a probability, is there a way to change the .so files that
   dbx() uses if we have a copy of the customers .so files ??  

3) Will the core file become corrupt if there is not enough space on the
   file system that the core is built on ??





Thanks,

Jim Kelly

T.R	Title	User	Personal Name	Date	Lines
8992.1	you're on the right track...	SMURF::PETERT	rigidly defined areas of doubt and uncertainty	`Fri Feb 28 1997 08:41`	20
	Your assumptions of 1) different .so files, and 3) not enough file space, are both valid. Though on 3) you should get some sort of warning message from the system that the device is full. As for 2) it seems to me I've gone through this before and never come up with a decent answer. Certainly one way would be to replace the .so files on your system with their's, but that is not what you are looking for. Checking the man page for the loader(5), it appears the best bet would be to set the environment variable _RLD_LIST with an explicit list of libraries, which will get loaded in that order. There is also a mention of LD_LIBRARY_PATH which might work, but it seems to me that this is the one that didn't work the last time I looked at this. You should peruse the loader man page and see if it helps out. But if the core is corrupt, nothing is going to help. If they can get a traceback on their system, before you take a look at it, than the library mismatch is likely the problem. PeterT
8992.2	Turn off the setuid bit	WTFN::SCALES	Despair is appropriate and inevitable.	`Fri Feb 28 1997 09:28`	14
	.1> Checking the man page for the loader(5), it appears the best bet would be .1> to set the environment variable _RLD_LIST with an explicit list of .1> libraries, which will get loaded in that order. There is also a mention .1> of LD_LIBRARY_PATH which might work The key is whether the debugger image has the setuid bit set -- if it does, then the loader will ignore LD_LIBRARY_PATH, and it probably skips _RLD_LIST, too, for the same reason. (We use LD_LIBRARY_PATH, and we hit this problem with Ladebug.) However, if you turn off the setuid bit, things should work well (provided you remember to set up LD_LIBRARY_PATH ;-). By turning off the setuid bit, you lose only the ability to do remote debugging, I think. Webb
8992.3	Thanks,	EDSCLU::KELLY		`Fri Feb 28 1997 12:44`	5
	Thanks for your help and input. This is what we were looking for. Regards, Jim Kelly
8992.4		SMURF::DENHAM	Digital UNIX Kernel	`Fri Feb 28 1997 14:13`	7
	There's also a couple of possibilities. The memory the core file reocrded may well be corrupt. In other words, if the stack is corrupted or the code took wild jump somewhere, you can see similar unhelpful tracebacks. Also, before V4.0, have more that 15 or so threads is a core dump would corrupt the core file.
8992.5		DCEIDL::BUTENHOF	Dave Butenhof, DECthreads	`Mon Mar 03 1997 07:33`	18
	.2: The key is whether the debugger image has the setuid bit set -- if it .2: does, then the loader will ignore LD_LIBRARY_PATH, and it probably skips .2: _RLD_LIST, too, for the same reason. This is only a problem when you're trying to affect the shared libraries USED BY THE DEBUGGER. It's not a problem when you only want to affect the shared libraries used by the program that you're debugging. (Unless, of course, it, too, has setuid [or setgrp] set.) DECthreads runs into the problem Webb mentions because, with Digital UNIX 4.0, ladebug links against a library to facilitate debugging threaded programs, and that library must match exactly with the version of DECthreads used by the program you're debugging. Since we tend to debug against versions of the thread library more recent than that installed on the system, we also need ladebug to use the matching libpthreaddebug.so image -- which means shutting off ladebug's setuid. /dave
8992.6	help!	EDSCLU::WANG		`Fri Mar 28 1997 09:13`	16
	Hi, We received a couple more core dumps from the same customer. We also asked them to get the dbx output on "where,tlist,tstack" from their system (V3.2C) where the core was generated. We got same results as the base note .0 . We really cannot determine what caused the core from the dbx information. They checked the disk was big enough for the core and the user was root. Why did all the core files get currupted? Are there anything that the customer should be awared of, so the next time when the process dies it will generate a usefull core dump? Thanks, Danqing
8992.7	Stack overwritten?	QUARRY::petert	rigidly defined areas of doubt and uncertainty	`Fri Mar 28 1997 09:40`	13
	Hmmm, are you saying the customer can't do the trace backs either? If so, it seems likely that the core file is getting corrupt, but I'm not sure why. One possibility is that the program is crashing because it has a memory leak someplace and it is overwriting it's own stack. In this case, all the information that a debugger looks for to determine the cause of the crash has been overwritten with data which is meaningless to the debugger. This can well be the cause of the error messages you see as the debugger starts up. In this case, you might have more luck analyzing the program with some of the atom tools. 3rd (or is it third?) comes to mind, but I don't know as much about it as some others might. PeterT
8992.8		EDSCLU::WANG		`Fri Mar 28 1997 10:48`	7
	Yes, the customer can't do the trace backs either. I've never run atom tools before, would you mind to show me how? I did "atom -tool third lu62_server" then it created lu62_server.third*. Where do I go from here? Thanks for your help
8992.9		SMURF::DENHAM	Digital UNIX Kernel	`Fri Mar 28 1997 11:39`	7
	So, how many threads in the application? More than 15 or so? Then you need a kernel patch to get good core files. I'm looking in the support pool sources and I'm not seeing the patch to kern_sig.o to fix this. I gave the support team a fix for this problem months ago, but it never seems to have made into the pools. Again! Argh.
8992.10		EDSCLU::WANG		`Fri Mar 28 1997 11:55`	7
	Yes, we have more then 15 threads. So if the customer install this new kernal patch, they may get good core files? Would you please let us know where we can get this new patch? Thanks alot.
8992.11	Looking for more info	EDSCLU::GARROD	IBM Interconnect Engineering	`Fri Mar 28 1997 14:51`	12
	Re .9 Please could you give more information on this fix? Ie things like what versions of UNIX it applies to. What the patch kit ident is? Is it it incorporated after a certain version of UNIX etc. Also could you give more info on how we can identify if we're being hit by the lack of this patch when we receive core files. Thanks,
8992.12		QUARRY::petert	rigidly defined areas of doubt and uncertainty	`Fri Mar 28 1997 15:00`	11
	My knowledge of third is limited. I know most of what I do from reading the third(5) man page. After producing the xxx.third file, set LD_LIBRARY_PATH to point to the current directory (third should have produced a libc.so.third file too, unless your application is unshared, which would seem unusual for a threaded application.) and then run the program. It will produce a xxx.3log file and you can preuse that for information on where you might have memory problems and uninitialized data, etc... But if it's the problem with the # of threads, this info may be moot. PeterT
8992.13		SMURF::DENHAM	Digital UNIX Kernel	`Fri Mar 28 1997 17:15`	18
	Re. .11. The problem applies to all releases before V4.0. There is NO patch for the problem. A test patch was generated for a customer in Hong Kong, and I foolishly assumed that this would get turned into a V3.2-based patch. Bzzzzt. Wrong. I'm investigating what happened there. In the meantime, send mail to [email protected] and ask for the test patch. Tell him I sent you. ;^) Identifying the problem isn't too hard, if you know the application well enough to know that it generally uses more than 15-16 threads at some point. Basically, the core file is corrupt. It will give some tracebacks but the stacks may be pretty bizarre, and it won't show all the stacks by a long shot. If we had the darn patch, this would be academic. You'd apply and then get on with life or eliminate the problem as irrelevant. But I'm stating the obvious (out of frustration).