[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:	DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:	Welcome to the Digital UNIX Conference
Moderator:	SMURF::DENHAM

Created:	Thu Mar 16 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	10068
Total number of notes:	35879

9523.0. "Corrupted core dump" by IMPERO::OSTORERO (Per fe ven-i 'd SW a-ij va tanta drugia) Wed Apr 16 1997 14:00

  Hi all,

  One of our customer is complainig for the following problem:

  One of their production processes (rih) core-dumps at least daily.

  They would like to analyze the core dump and find out what's making havoc, 
but what they get is a corrupted core dump, as seen from the following:

=========================================================

/utenti/eostore:83> ls -l
total 4656
 -rwxr-xr-x   1 eostore  mdi       491520 Apr 16 15:51 core
 -rwxr-xr-x   1 eostore  mdi      4276224 Apr 16 15:51 rih
/utenti/eostore:84> file *
core:   core dump, core file is incomplete
rih:    COFF format alpha demand paged executable or object module not 
stripped - version 3.11-8
/utenti/eostore:85> dbx rih core
dbx version 3.11.8
Type 'help' for help.
Core file created by program "rih"


warning: cannot get register (number = 64)

signal Segmentation fault at
warning: cannot get register (number = 64)


warning: cannot get register (number = 64)


warning: PC value 0x0 not valid, trying RA

warning: cannot get register (number = 26)


warning: RA value 0x0 not valid, trying text start

warning: text start 0x120000000 not valid, trying data start

warning: Using data start as a text address -- traceback will not work
>
warning: cannot get register (number = 64)

 [., 0x140000000]       call_pal        cflush
(dbx) where

warning: cannot get register (number = 64)


warning: cannot get register (number = 26)


warning: cannot get register (number = 30)

   0 (noname)() [0x120000000]
(dbx) quit
/utenti/eostore:86>

=========================================================

  How can a core dump become corrupted ? Isn't it a "copy" of the process
runtime environment at the time of the crash ? Is it possible for the runtime
environment of the process to be so garbled ?

  We now from the manual pages there's a limit in the core size:


>  The maximum size of a core file is limited. Files that would be larger than
>  the limit are not created.

  Well, all system parameter we know of seem OK.
  Anyway, on the very same system, we get "ordinary" and much bigger cores.

  Is there anything we can check (application code, system parameter,
operating environment) in order to make the core behave like a core ?
                                           
  If needed, I can retrieve a copy of both "core" and "rih" and put'em on 
the net.

  Thanks in advance,

               Ezio (NSIS-CIS Italy)

T.R	Title	User	Personal Name	Date	Lines
9523.1	some possibilities to consider...	SMURF::PETERT	rigidly defined areas of doubt and uncertainty	`Wed Apr 16 1997 21:52`	18
	By any chance is this a threaded program? With more than 15 threads? It looks from the rev numbers on the executable and dbx that you are running a 3.2 version of Digital Unix. There was a recent case on a similar OS where just this thing happened. The other possibility is that when the program crashes, it is doing so by overwriting its stack. With an invalid stack, as it seems dbx is indicating, you'd have to catch the program before it crashes, possibly by running it under the debugger. But the debugger might not catch it until the stack was already corrupted. Overwriting of this kind can often be traced to writing beyond the bounds of an array or allocated memory, or writing into memory that was already freed. Analyzing the program with the Atom tool third, (or is it 3rd?) might help. I know how to set that up, but I'm not all that good at interpreting the results. PeterT
9523.2	More on core	IMPERO::OSTORERO	Per fe ven-i 'd SW a-ij va tanta drugia	`Thu Apr 17 1997 09:28`	183
	Hi everybody, Let me explain first and ask questions next ... sorry, explanation may be lenghty. Well, using hints I found elsewhere on this conference (i.e. looking at sys/core.h) I was able to write a "quick & dirty" program to analyze the core file. Using this to scan both a good core and the corrupted one, with some guesswork, trial and error and a flavour of good luck I was able to determine the structure of the core file. 1) One Header record containing magic # dumped process name signal that terminated the process number of sections = nscns etc 2) nscns Section Header records containing: section type (registers, stack, text etc.) section size section offset in core file etc. 3) nscns Sections containing The dead process memory data, only readable using a REAL debugger Well, what I found out is that that the corrupted file contained the following Header Record: (gdb) p core_H $1 = {magic = "Core", version = 1, nscns = 0, tid = -4397145619360, nthreads = 1, signo = 11, name = "rih\000\226\001\000\000\b\000\000\000p:\001\000"} then a number of Section Header records: (gdb) p core_SH1 $2 = {scntype = 5, c_u = {tid = -4397145619360, prot = 900891744}, vaddr = 0x0, size = 520, scnptr = 8192} (gdb) p core_SH2 $3 = {scntype = 4, c_u = {tid = -4398046511101, prot = 3}, vaddr = 0x11ffe0000, size = 131072, scnptr = 16384} (gdb) p core_SH3 $4 = {scntype = 3, c_u = {tid = -4398046511101, prot = 3}, vaddr = 0x140000000, size = 221184, scnptr = 147456} (gdb) p core_SH4 $5 = {scntype = 3, c_u = {tid = -4398046511097, prot = 7}, vaddr = 0x140036000, size = 122880, scnptr = 368640} (gdb) p core_SH5 $6 = {scntype = 0, c_u = {tid = 0, prot = 0}, vaddr = 0x0, size = 0, scnptr = 0} This told me a number of things: 1) Section 1 is of SCNREGS type (registers ?) Section 2 is stack Section 3 and 4 are SCNRGN ... what's that ? Never mind 2) Adding up scnptr and size gives scnptr of the next section ... makes sense 3) scnptr of sec. 4 + size of sec. = 368640 + 122880 = 491520 = core file size , right to the byte this means my core file contains 4 sections ! 4) Header sections from 5 up contain zeroes ... I guess I finished reading the section headers, I'm reading the fillup of the first 8192 bytes: section 1 content start at offset 8192 as stated in core_SH1.scnptr Now I'm quite confident of having cracked a part of the core structure, only one thing doesn't make sense: according to the before assumptions, core_H.nscns should read 4 instead of 0. This guess is confirmed by the fact that reading a correct core, I found 10 Section Headers and core_H.nscns was actually 10 ... was it by chance ? ;-) Next step is to use a binary editor to patch byte 7 (core_H.nscns) of the core file, I write 4 in it. Then I issue the very same commands reported in yesterday's note ... and .... YUPPIIEEEEE !!!!!! ======================================================================== /utenti/eostore:28> ls -l total 4656 -rwxr-xr-x 1 eostore mdi 491520 Apr 17 11:54 core -rwxr-xr-x 1 eostore mdi 4276224 Apr 16 15:51 rih /utenti/eostore:29> file * core: core dump, generated from 'rih' rih: COFF format alpha demand paged executable or object module not stripped - version 3.11-8 /utenti/eostore:30> dbx rih core dbx version 3.11.8 Type 'help' for help. Core file created by program "rih" signal Segmentation fault at [check_micro_cell:182 +0x8,0x12010cba0] Source not available (dbx) where > 0 check_micro_cell(prg = 0x11ffff7d8 = "rih", xref_pool = 0x11ffe6ae8, org_cgi = 0x11ffe4c88 = "2221084D2986D", dialed_digits = 0x1400351f0 = "", time_stamp = 861138604, mcvectab = 0x14df0b988, mcind = 0x11ffe4ea0, mcdeftab = 0x11ffe4fe0, error_list = 0x140016dc8, error_pageno = 0x11ffe65e0, error_lineno = 0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP: RIH Error List") ["checkmc.c":182, 0x12010cba0] 1 process_utx_moc(prg = 0x11ffff7d8 = "rih", rih_workarea = 0x11ffe5c18, rtx_gsm_base_memory = 0x140038648, rtx_supl_serv_memory = 0x1400446d0, mpufctab = 0x11ffe5a70, log_header = 0x11ffe5790, error_found = 0x11ffe5760, filtered_out = 0x11ffe5758, is_hplmn = 0x11ffe5750, tableset_no = 0, rb_id = 0x11ffe5740, rejected = 0x11ffff298, utxhdr = 0x11ffe5440, utx_gsm_base = 0x11ffe5378, utxptr = 0x14df0614a, error_list = 0x140016dc8, error_pageno = 0x11ffe65e0, error_lineno = 0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP: RIH Error List") ["utxmoc.c":1862, 0x1200f5b28] 2 process_utx_record(prg = 0x11ffff7d8 = "rih", rih_workarea = 0x11ffe5c18, rtx_gsm_base_memory = 0x140038648, rtx_gsm_base_memory_mpp = 0x1400444e8, rtx_supl_serv_memory = 0x1400446d0, mpufctab = 0x11ffe5a70, log_header = 0x11ffe5790, error_found = 0x11ffe5760, filtered_out = 0x11ffe5758, is_hplmn = 0x11ffe5750, tableset_no = 0, rb_id = 0x11ffe5740, error_list = 0x140016dc8, error_pageno = 0x11ffe65e0, error_lineno = 0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP: RIH Error List") ["prcutx.c":284, 0x1200eb714] More (n if no)?n (dbx) q /utenti/eostore:31> ======================================================================== Well well, the stack seem to be consistent and the culprit is found, line 182 of check_micro_cell.c will be checked by the programmer to see what happens there; everybody's happy ... you see the "rih" is a process that allow our customer to bill around ONE MILLION of their customers therefore this problem had some degree of importance 8*) End of explanation and start asking questions. My understanding is that the key of the corruption is the header record stating there are ZERO sections in the core where actually there were four ... can anybody explain this ? Then, why four sections only ? Other tests I did shown a core having ten. Let me close by giving some hypothesis The customer tells me that at crash time the process has > 1GB of allocated memory, he didn't check the available disk space yet. I'm thinking of this possible scenario: 1) Process does a quite common SEGV (i.e. sprintf with some null arguments) 2) Operating system starts the core generation (1GB+ of memory to flush) 3) Op. Sys. finds out disk space is not enough 4) Op. Sys. writes down the sections he can (this is OK) 5) Op. Sys. "forgets" to write the proper value of core_H.nscns (this is not OK) Can this be a possible explanation ? If yes, I believe behaviour 5 should be fixed. Thanks a lot for your attention, Ezio P.S. Wrestling this problen has been interesting and I would say somewhat fun ... being myself little acquainted with unix "weirdnesses". BUT: does anybody having a similar problem with cores go through the same cat-and-mouse game ? Is there any documentation ? Sorting through this notesfile the only useful pointer I got was core.h, that's really rather poor.
9523.3	Nice work...	QUARRY::petert	rigidly defined areas of doubt and uncertainty	`Thu Apr 17 1997 11:46`	12
	I did a bit of testing and your explanation sounds a bit plausible. I'm not familiar with the program that dumps the core in the first place, so I'm really just guessing. I was a bit concerned that even though after patching the core file, you might be mislead since you were missing data. What if the stack had more than one section for instance? A brief test shows that this is not likely to be the case, but then I only tested one core file. The core file is still incomplete of course, and many things can not be accessed by a debugger, but it looks like useful information. I don't know the kernel stand on this, but it might make a useful extension to a debugger... PeterT
9523.4		QUARRY::neth	Craig Neth	`Thu Apr 17 1997 13:14`	34
	> 3) Op. Sys. finds out disk space is not enough > > 4) Op. Sys. writes down the sections he can (this is OK) Not quite. The OS doesn't check the disk space before it starts writing, it just starts writing. At some point during that process, it discovers there isn't enough room and aborts. > 5) Op. Sys. "forgets" to write the proper value of core_H.nscns (this > is not OK) It doesn't 'forget' - the decision to not update the filehdr is quite deliberate, and likely made to protect coredump analysis tools that expect the corefile to be complete. In your case, enough was written that you could get something useful out of the partial dump, but I would be suprised if that were always the case. The coredumper really has no way of knowing if it's managed to dump out what you need... I agree with Peter that it might be a useful to have a tool (or debugger extension) that tries to 'salvage' core files, especially ones that appear to have a 'valid' stack section. Sounds like a fun 'midnight' project for someone... >Is there any documentation ? > Sorting through this notesfile the only useful pointer I got was core.h, >that's really rather poor. I read the source to the 'core()' kernel function. I suppose you could too if you got a source license and the kit. I can't comment on whether or not there is better documentation somewhere or whether or not there are plans to change what is available.
9523.5	Last thoughts	IMPERO::OSTORERO	Per fe ven-i 'd SW a-ij va tanta drugia	`Fri Apr 18 1997 08:25`	47
	Thanks Peter and Craig, Re: .3 >I was a bit concerned that even >though after patching the core file, you might be mislead since you >were missing data. What if the stack had more than one section for >instance? You're right, as I said in my reply we had "a flavour of good luck", the informations in the core where at least consistent if not complete, therefore it was enough for the programmer to find the bug ... I believe this is the very reason why a core is dumped at all. Re: .4 >Not quite. The OS doesn't check the disk space before it starts writing, it >just starts writing. At some point during that process, it discovers >there isn't enough room and aborts. Well I'd say OS is doing quite a clever job i.e. it writes down only the sections he can fit in the disk and not partial sections; on the other hand, somebody told me that if core size is limited by sys parameters and the size exceedes the parameter (what's this bloody parameter name ???) then the core is not written at all ... even this make sense, you play by the rules. >It doesn't 'forget' - the decision to not update the filehdr >is quite deliberate, and likely made to protect coredump analysis tools >that expect the corefile to be complete. Let me disagree on this, either OS write something useful or doesn't write anything at all and sends a loud warning "HEY, I couldn't write this damned thing because disk was choking!". An even better possibility would be that OS write down a readable, partial core with a warning in it "WATCH OUT: you get only this and not that, debugging a partial core may be harmful for your health". I think it's unfair for the programmer/application administrator to see imformation he so badly need (the core file) just to be disappointed later on when he finds out this information is not accessible. Yes, I believe all this is academic discussion, just wanted to be positively critical. Thanks again, Ezio
9523.6		SMURF::DENHAM	Digital UNIX Kernel	`Sat Apr 19 1997 22:23`	7
	Did anybody see any mention at all of the OS version involved here? A fair amount of work went into cleaning up the core dumping of large address spaces in V4.0. There were a number of 32-bit limitations in the core routine prior to V4. Smells like we might have hit that one, which means those changes would need to be backported, if that's the case....
9523.7	See 9523.1	MLNSI3::OSTORERO		`Mon Apr 21 1997 02:59`	13
	Yes, It happened in a V3.2 G version Digital Unix. As .1 suggested, it might be a known problem. .1>It looks from the rev numbers on the executable and dbx that you are .1>running a 3.2 version of Digital Unix. There was a recent case .1>on a similar OS where just this thing happened. So, you say it's fixed in 4.0 ? Just great! Ezio
9523.8		SMURF::DENHAM	Digital UNIX Kernel	`Mon Apr 21 1997 07:09`	6
	Yes, I know all about what .1 is saying. My point in .6 is that there are two potential forms of core file corruption in V3.2x -- "too many" threads and too much data, usually in the form of large shared memory segments. This case sounds like it could be either one. And yes both are fixed in v4.0. Only the first case, I believe, is currently being patched for v3.2.

Conference turris::digital_unix

9523.0. "Corrupted core dump" by IMPERO::OSTORERO (Per fe ven-i &#039;d SW a-ij va tanta drugia) Wed Apr 16 1997 14:00

9523.0. "Corrupted core dump" by IMPERO::OSTORERO (Per fe ven-i 'd SW a-ij va tanta drugia) Wed Apr 16 1997 14:00