[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

9523.0. "Corrupted core dump" by IMPERO::OSTORERO (Per fe ven-i 'd SW a-ij va tanta drugia) Wed Apr 16 1997 15:00

  Hi all,

  One of our customer is complainig for the following problem:

  One of their production processes (rih) core-dumps at least daily.

  They would like to analyze the core dump and find out what's making havoc, 
but what they get is a corrupted core dump, as seen from the following:

=========================================================

/utenti/eostore:83> ls -l
total 4656
 -rwxr-xr-x   1 eostore  mdi       491520 Apr 16 15:51 core
 -rwxr-xr-x   1 eostore  mdi      4276224 Apr 16 15:51 rih
/utenti/eostore:84> file *
core:   core dump, core file is incomplete
rih:    COFF format alpha demand paged executable or object module not 
stripped - version 3.11-8
/utenti/eostore:85> dbx rih core
dbx version 3.11.8
Type 'help' for help.
Core file created by program "rih"


warning: cannot get register (number = 64)

signal Segmentation fault at
warning: cannot get register (number = 64)


warning: cannot get register (number = 64)


warning: PC value 0x0 not valid, trying RA

warning: cannot get register (number = 26)


warning: RA value 0x0 not valid, trying text start

warning: text start 0x120000000 not valid, trying data start

warning: Using data start as a text address -- traceback will not work
>
warning: cannot get register (number = 64)

 [., 0x140000000]       call_pal        cflush
(dbx) where

warning: cannot get register (number = 64)


warning: cannot get register (number = 26)


warning: cannot get register (number = 30)

   0 (noname)() [0x120000000]
(dbx) quit
/utenti/eostore:86>

=========================================================

  How can a core dump become corrupted ? Isn't it a "copy" of the process
runtime environment at the time of the crash ? Is it possible for the runtime
environment of the process to be so garbled ?

  We now from the manual pages there's a limit in the core size:


>  The maximum size of a core file is limited. Files that would be larger than
>  the limit are not created.

  Well, all system parameter we know of seem OK.
  Anyway, on the very same system, we get "ordinary" and much bigger cores.

  Is there anything we can check (application code, system parameter,
operating environment) in order to make the core behave like a core ?
                                           
  If needed, I can retrieve a copy of both "core" and "rih" and put'em on 
the net.

  Thanks in advance,

               Ezio (NSIS-CIS Italy)
                                           
T.RTitleUserPersonal
Name
DateLines
9523.1some possibilities to consider...SMURF::PETERTrigidly defined areas of doubt and uncertaintyWed Apr 16 1997 22:5218
    By any chance is this a threaded program?  With more than 15 threads?
    It looks from the rev numbers on the executable and dbx that you are
    running a 3.2 version of Digital Unix.  There was a recent case 
    on a similar OS where just this thing happened.
    
    The other possibility is that when the program crashes, it is doing so
    by overwriting its stack.  With an invalid stack, as it seems 
    dbx is indicating, you'd have to catch the program before it crashes,
    possibly by running it under the debugger.  But the debugger might
    not catch it until the stack was already corrupted.
    Overwriting of this kind can often be traced to writing beyond the 
    bounds of an array or allocated memory, or writing into memory
    that was already freed.  Analyzing the program with the Atom tool
    third, (or is it 3rd?) might help.  I know how to set that up, but
    I'm not all that good at interpreting the results.
    
    PeterT
     
9523.2More on coreIMPERO::OSTOREROPer fe ven-i 'd SW a-ij va tanta drugiaThu Apr 17 1997 10:28183
  Hi everybody,

  Let me explain first and ask questions next ... sorry, explanation may be 
lenghty.

  Well, using hints I found elsewhere on this conference (i.e. looking at 
sys/core.h) I was able to write a "quick & dirty" program to analyze the core 
file.

  Using this to scan both a good core and the corrupted one, with some 
guesswork, trial and error and a flavour of good luck I was able to determine 
the structure of the core file.

  1) One Header record containing
       magic #
       dumped process name
       signal that terminated the process
       number of sections = nscns
       etc

  2) nscns Section Header records containing:
      section type (registers, stack, text etc.)
      section size
      section offset in core file
      etc.

   3) nscns Sections containing
      The dead process memory data, only  readable using a REAL debugger

  Well, what I found out is that that the corrupted file contained the following
Header Record:

(gdb) p core_H
$1 = {magic = "Core", version = 1, nscns = 0, tid = -4397145619360,
  nthreads = 1, signo = 11,
  name = "rih\000\226\001\000\000\b\000\000\000p:\001\000"}

then a number of Section Header records:

(gdb) p core_SH1
$2 = {scntype = 5, c_u = {tid = -4397145619360, prot = 900891744},
  vaddr = 0x0, size = 520, scnptr = 8192}
(gdb) p core_SH2
$3 = {scntype = 4, c_u = {tid = -4398046511101, prot = 3},
  vaddr = 0x11ffe0000, size = 131072, scnptr = 16384}
(gdb) p core_SH3
$4 = {scntype = 3, c_u = {tid = -4398046511101, prot = 3},
  vaddr = 0x140000000, size = 221184, scnptr = 147456}
(gdb) p core_SH4
$5 = {scntype = 3, c_u = {tid = -4398046511097, prot = 7},
  vaddr = 0x140036000, size = 122880, scnptr = 368640}
(gdb) p core_SH5
$6 = {scntype = 0, c_u = {tid = 0, prot = 0}, vaddr = 0x0, size = 0,
  scnptr = 0}

  This told me a number of things:

     1) Section 1 is of SCNREGS type (registers ?)
        Section 2 is stack
        Section 3 and 4 are SCNRGN ... what's that ? Never mind

     2) Adding up scnptr  and size gives scnptr  of the next section ... makes 
        sense

     3) scnptr of sec. 4 + size of sec. 
        = 368640 + 122880 = 491520 
        = core file size , right to the byte
        this means my core file contains 4 sections !

     4) Header sections from 5 up contain zeroes ... I guess I finished reading
        the section headers, I'm reading the fillup of the first 8192 bytes: 
        section 1 content start at offset 8192 as stated in core_SH1.scnptr

   Now I'm quite confident of having cracked a part of the core structure, only
one thing doesn't make sense: according to the before assumptions, 
core_H.nscns should read 4 instead of 0.

  This guess is confirmed by the fact that reading a correct core, I found
10 Section Headers and core_H.nscns  was actually 10 ... was it by chance ? ;-)

  Next step is to use a binary editor to patch byte 7 (core_H.nscns) of the 
core file, I write 4 in it.

  Then I issue the very same commands reported in yesterday's note ... and 
.... YUPPIIEEEEE !!!!!!

========================================================================
/utenti/eostore:28> ls -l
total 4656
 -rwxr-xr-x   1 eostore  mdi       491520 Apr 17 11:54 core
 -rwxr-xr-x   1 eostore  mdi      4276224 Apr 16 15:51 rih
/utenti/eostore:29> file *
core:   core dump, generated from 'rih'
rih:    COFF format alpha demand paged executable or object module not 
stripped - version 3.11-8
/utenti/eostore:30> dbx rih core
dbx version 3.11.8
Type 'help' for help.
Core file created by program "rih"

signal Segmentation fault at   [check_micro_cell:182 +0x8,0x12010cba0] 
  Source not available
(dbx) where
>  0 check_micro_cell(prg = 0x11ffff7d8 = "rih", xref_pool = 0x11ffe6ae8, 
org_cgi = 0x11ffe4c88 = "2221084D2986D", dialed_digits = 0x1400351f0 = "", 
time_stamp = 861138604, mcvectab = 0x14df0b988, mcind = 0x11ffe4ea0, 
mcdeftab = 0x11ffe4fe0, error_list = 0x140016dc8, error_pageno = 
0x11ffe65e0, error_lineno = 0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP: 
RIH Error List") ["checkmc.c":182, 0x12010cba0]
   1 process_utx_moc(prg = 0x11ffff7d8 = "rih", rih_workarea = 0x11ffe5c18, 
rtx_gsm_base_memory = 0x140038648, rtx_supl_serv_memory = 0x1400446d0, 
mpufctab = 0x11ffe5a70, log_header = 0x11ffe5790, error_found = 0x11ffe5760, 
filtered_out = 0x11ffe5758, is_hplmn = 0x11ffe5750, tableset_no = 0, rb_id = 
0x11ffe5740, rejected = 0x11ffff298, utxhdr = 0x11ffe5440, utx_gsm_base = 
0x11ffe5378, utxptr = 0x14df0614a, error_list = 0x140016dc8, error_pageno = 
0x11ffe65e0, error_lineno = 0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP: 
RIH Error List") ["utxmoc.c":1862, 0x1200f5b28]
   2 process_utx_record(prg = 0x11ffff7d8 = "rih", rih_workarea = 
0x11ffe5c18, rtx_gsm_base_memory = 0x140038648, rtx_gsm_base_memory_mpp = 
0x1400444e8, rtx_supl_serv_memory = 0x1400446d0, mpufctab = 0x11ffe5a70, 
log_header = 0x11ffe5790, error_found = 0x11ffe5760, filtered_out = 
0x11ffe5758, is_hplmn = 0x11ffe5750, tableset_no = 0, rb_id = 0x11ffe5740, 
error_list = 0x140016dc8, error_pageno = 0x11ffe65e0, error_lineno = 
0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP: RIH Error List") 
["prcutx.c":284, 0x1200eb714]
More (n if no)?n
(dbx) q
/utenti/eostore:31>
========================================================================

  Well well, the stack seem to be consistent and the culprit is found, line 
182 of check_micro_cell.c will be checked by the programmer to see what 
happens there; everybody's happy ... you see the "rih" is a process that allow
our customer to bill around ONE MILLION of their customers therefore this 
problem had some degree of importance 8*)

  End of explanation and start asking questions.

  My understanding is that the key of the corruption is the header record 
stating there are ZERO sections in the core where actually there were four ...
can anybody explain this ?

  Then, why four sections only ? Other tests I did shown a core having ten.

  Let me close by giving some hypothesis

  The customer tells me that at crash time the process has > 1GB of 
allocated memory, he didn't check the available disk space yet.
    
  I'm thinking of this possible scenario:

     1) Process does a quite common SEGV (i.e. sprintf with some null 
        arguments)

     2) Operating system starts the core generation (1GB+ of memory to 
        flush)

     3) Op. Sys. finds out disk space is not enough

     4) Op. Sys. writes down the sections he can (this is OK)

     5) Op. Sys. "forgets" to write the proper value of core_H.nscns (this 
        is not OK)

  Can this be a possible explanation ? If yes, I believe behaviour 5 should 
be fixed.

  Thanks a lot for your attention,


               Ezio


P.S.

  Wrestling this problen has been interesting and I would say somewhat fun 
... being myself little acquainted with unix "weirdnesses".

  BUT: does anybody having a similar problem with cores go through the same
cat-and-mouse game ? Is there any documentation ?
  Sorting through this notesfile the only useful pointer I got was core.h, 
that's really rather poor.
9523.3Nice work...QUARRY::petertrigidly defined areas of doubt and uncertaintyThu Apr 17 1997 12:4612
I did a bit of testing and your explanation sounds a bit plausible.
I'm not familiar with the program that dumps the core in the first
place, so I'm really just guessing.  I was a bit concerned that even 
though after patching the core file, you might be mislead since you
were missing data.  What if the stack had more than one section for
instance?  A brief test shows that this is not likely to be the 
case, but then I only tested one core file.  The core file is still
incomplete of course, and many things can not be accessed by a debugger,
but it looks like useful information.  I don't know the kernel stand
on this, but it might make a useful extension to a debugger...

PeterT
9523.4QUARRY::nethCraig NethThu Apr 17 1997 14:1434
>     3) Op. Sys. finds out disk space is not enough
>
>     4) Op. Sys. writes down the sections he can (this is OK)

Not quite. The OS doesn't check the disk space before it starts writing, it 
just starts writing.   At some point during that process, it discovers
there isn't enough room and aborts.  

>     5) Op. Sys. "forgets" to write the proper value of core_H.nscns (this 
>        is not OK)

It doesn't 'forget' - the decision to not update the filehdr
is quite deliberate, and likely made to protect coredump analysis tools
that expect the corefile to be complete.   

In your case, enough was written that you could get something useful
out of the partial dump, but I would be suprised if that were always
the case.   The coredumper really has no way of knowing if it's managed
to dump out what you need...

I agree with Peter that it might be a useful to have a tool (or debugger
extension) that tries to 'salvage' core files, especially ones that appear
to have a 'valid' stack section.    Sounds like a fun 'midnight' project
for someone...

>Is there any documentation ?
>  Sorting through this notesfile the only useful pointer I got was core.h, 
>that's really rather poor.

I read the source to the 'core()' kernel function.    I suppose you could
too if you got a source license and the kit.  I can't comment on whether or
not there is better documentation somewhere or whether or not there are
plans to change what is available.
9523.5Last thoughtsIMPERO::OSTOREROPer fe ven-i 'd SW a-ij va tanta drugiaFri Apr 18 1997 09:2547
  Thanks Peter and Craig,

Re: .3

>I was a bit concerned that even 
>though after patching the core file, you might be mislead since you
>were missing data.  What if the stack had more than one section for
>instance?  

  You're right, as I said in my reply we had "a flavour of good luck", the
informations in the core where at least consistent if not complete, therefore
it was enough for the programmer to find the bug ... I believe this is the very
reason why a core is dumped at all.

Re: .4

>Not quite. The OS doesn't check the disk space before it starts writing, it 
>just starts writing.   At some point during that process, it discovers
>there isn't enough room and aborts.  

  Well I'd say OS is doing quite a clever job i.e. it writes down only the
sections he can fit in the disk and not partial sections; on the other hand,
somebody told me that if core size is limited by sys parameters and the size
exceedes the parameter (what's this bloody parameter name ???) then the core is
not written at all ... even this make sense, you play by the rules.

>It doesn't 'forget' - the decision to not update the filehdr
>is quite deliberate, and likely made to protect coredump analysis tools
>that expect the corefile to be complete.   

  Let me disagree on this, either OS write something useful or doesn't write
anything at all and sends a loud warning "HEY, I couldn't write this damned
thing because disk was choking!".
  An even better possibility would be that OS write down a readable, partial
core with a warning in it "WATCH OUT: you get only this and not that, debugging
a partial core may be harmful for your health".

  I think it's unfair for the programmer/application administrator to see
imformation he so badly need (the core file) just to be disappointed later on
when he finds out this information is not accessible.

  Yes, I believe all this is academic discussion, just wanted to be positively
critical.

  Thanks again,

			Ezio
9523.6SMURF::DENHAMDigital UNIX KernelSat Apr 19 1997 23:237
    Did anybody see any mention at all of the OS version involved
    here? A fair amount of work went into cleaning up the
    core dumping of *large* address spaces in V4.0. There were
    a number of 32-bit limitations in the core routine prior
    to V4. Smells like we might have hit that one, which
    means those changes would need to be backported, if that's
    the case....
9523.7See 9523.1MLNSI3::OSTOREROMon Apr 21 1997 03:5913
    Yes,
    
    It happened in a V3.2 G version Digital Unix.
    
    As .1 suggested, it might be a known problem.
    
    .1>It looks from the rev numbers on the executable and dbx that you are
    .1>running a 3.2 version of Digital Unix.  There was a recent case 
    .1>on a similar OS where just this thing happened.
    
    So, you say it's fixed in 4.0 ? Just great!
    
    			Ezio
9523.8SMURF::DENHAMDigital UNIX KernelMon Apr 21 1997 08:096
    Yes, I know all about what .1 is saying. My point in .6 is
    that there are *two* potential forms of core file corruption
    in V3.2x -- "too many" threads and too much data, usually in
    the form of large shared memory segments. This case sounds like
    it could be either one. And yes both are fixed in v4.0. Only
    the first case, I believe, is currently being patched for v3.2.