|
Hi everybody,
Let me explain first and ask questions next ... sorry, explanation may be
lenghty.
Well, using hints I found elsewhere on this conference (i.e. looking at
sys/core.h) I was able to write a "quick & dirty" program to analyze the core
file.
Using this to scan both a good core and the corrupted one, with some
guesswork, trial and error and a flavour of good luck I was able to determine
the structure of the core file.
1) One Header record containing
magic #
dumped process name
signal that terminated the process
number of sections = nscns
etc
2) nscns Section Header records containing:
section type (registers, stack, text etc.)
section size
section offset in core file
etc.
3) nscns Sections containing
The dead process memory data, only readable using a REAL debugger
Well, what I found out is that that the corrupted file contained the following
Header Record:
(gdb) p core_H
$1 = {magic = "Core", version = 1, nscns = 0, tid = -4397145619360,
nthreads = 1, signo = 11,
name = "rih\000\226\001\000\000\b\000\000\000p:\001\000"}
then a number of Section Header records:
(gdb) p core_SH1
$2 = {scntype = 5, c_u = {tid = -4397145619360, prot = 900891744},
vaddr = 0x0, size = 520, scnptr = 8192}
(gdb) p core_SH2
$3 = {scntype = 4, c_u = {tid = -4398046511101, prot = 3},
vaddr = 0x11ffe0000, size = 131072, scnptr = 16384}
(gdb) p core_SH3
$4 = {scntype = 3, c_u = {tid = -4398046511101, prot = 3},
vaddr = 0x140000000, size = 221184, scnptr = 147456}
(gdb) p core_SH4
$5 = {scntype = 3, c_u = {tid = -4398046511097, prot = 7},
vaddr = 0x140036000, size = 122880, scnptr = 368640}
(gdb) p core_SH5
$6 = {scntype = 0, c_u = {tid = 0, prot = 0}, vaddr = 0x0, size = 0,
scnptr = 0}
This told me a number of things:
1) Section 1 is of SCNREGS type (registers ?)
Section 2 is stack
Section 3 and 4 are SCNRGN ... what's that ? Never mind
2) Adding up scnptr and size gives scnptr of the next section ... makes
sense
3) scnptr of sec. 4 + size of sec.
= 368640 + 122880 = 491520
= core file size , right to the byte
this means my core file contains 4 sections !
4) Header sections from 5 up contain zeroes ... I guess I finished reading
the section headers, I'm reading the fillup of the first 8192 bytes:
section 1 content start at offset 8192 as stated in core_SH1.scnptr
Now I'm quite confident of having cracked a part of the core structure, only
one thing doesn't make sense: according to the before assumptions,
core_H.nscns should read 4 instead of 0.
This guess is confirmed by the fact that reading a correct core, I found
10 Section Headers and core_H.nscns was actually 10 ... was it by chance ? ;-)
Next step is to use a binary editor to patch byte 7 (core_H.nscns) of the
core file, I write 4 in it.
Then I issue the very same commands reported in yesterday's note ... and
.... YUPPIIEEEEE !!!!!!
========================================================================
/utenti/eostore:28> ls -l
total 4656
-rwxr-xr-x 1 eostore mdi 491520 Apr 17 11:54 core
-rwxr-xr-x 1 eostore mdi 4276224 Apr 16 15:51 rih
/utenti/eostore:29> file *
core: core dump, generated from 'rih'
rih: COFF format alpha demand paged executable or object module not
stripped - version 3.11-8
/utenti/eostore:30> dbx rih core
dbx version 3.11.8
Type 'help' for help.
Core file created by program "rih"
signal Segmentation fault at [check_micro_cell:182 +0x8,0x12010cba0]
Source not available
(dbx) where
> 0 check_micro_cell(prg = 0x11ffff7d8 = "rih", xref_pool = 0x11ffe6ae8,
org_cgi = 0x11ffe4c88 = "2221084D2986D", dialed_digits = 0x1400351f0 = "",
time_stamp = 861138604, mcvectab = 0x14df0b988, mcind = 0x11ffe4ea0,
mcdeftab = 0x11ffe4fe0, error_list = 0x140016dc8, error_pageno =
0x11ffe65e0, error_lineno = 0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP:
RIH Error List") ["checkmc.c":182, 0x12010cba0]
1 process_utx_moc(prg = 0x11ffff7d8 = "rih", rih_workarea = 0x11ffe5c18,
rtx_gsm_base_memory = 0x140038648, rtx_supl_serv_memory = 0x1400446d0,
mpufctab = 0x11ffe5a70, log_header = 0x11ffe5790, error_found = 0x11ffe5760,
filtered_out = 0x11ffe5758, is_hplmn = 0x11ffe5750, tableset_no = 0, rb_id =
0x11ffe5740, rejected = 0x11ffff298, utxhdr = 0x11ffe5440, utx_gsm_base =
0x11ffe5378, utxptr = 0x14df0614a, error_list = 0x140016dc8, error_pageno =
0x11ffe65e0, error_lineno = 0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP:
RIH Error List") ["utxmoc.c":1862, 0x1200f5b28]
2 process_utx_record(prg = 0x11ffff7d8 = "rih", rih_workarea =
0x11ffe5c18, rtx_gsm_base_memory = 0x140038648, rtx_gsm_base_memory_mpp =
0x1400444e8, rtx_supl_serv_memory = 0x1400446d0, mpufctab = 0x11ffe5a70,
log_header = 0x11ffe5790, error_found = 0x11ffe5760, filtered_out =
0x11ffe5758, is_hplmn = 0x11ffe5750, tableset_no = 0, rb_id = 0x11ffe5740,
error_list = 0x140016dc8, error_pageno = 0x11ffe65e0, error_lineno =
0x11ffe65e4, err_id = 0x1400005b0 = "Omnitel MP: RIH Error List")
["prcutx.c":284, 0x1200eb714]
More (n if no)?n
(dbx) q
/utenti/eostore:31>
========================================================================
Well well, the stack seem to be consistent and the culprit is found, line
182 of check_micro_cell.c will be checked by the programmer to see what
happens there; everybody's happy ... you see the "rih" is a process that allow
our customer to bill around ONE MILLION of their customers therefore this
problem had some degree of importance 8*)
End of explanation and start asking questions.
My understanding is that the key of the corruption is the header record
stating there are ZERO sections in the core where actually there were four ...
can anybody explain this ?
Then, why four sections only ? Other tests I did shown a core having ten.
Let me close by giving some hypothesis
The customer tells me that at crash time the process has > 1GB of
allocated memory, he didn't check the available disk space yet.
I'm thinking of this possible scenario:
1) Process does a quite common SEGV (i.e. sprintf with some null
arguments)
2) Operating system starts the core generation (1GB+ of memory to
flush)
3) Op. Sys. finds out disk space is not enough
4) Op. Sys. writes down the sections he can (this is OK)
5) Op. Sys. "forgets" to write the proper value of core_H.nscns (this
is not OK)
Can this be a possible explanation ? If yes, I believe behaviour 5 should
be fixed.
Thanks a lot for your attention,
Ezio
P.S.
Wrestling this problen has been interesting and I would say somewhat fun
... being myself little acquainted with unix "weirdnesses".
BUT: does anybody having a similar problem with cores go through the same
cat-and-mouse game ? Is there any documentation ?
Sorting through this notesfile the only useful pointer I got was core.h,
that's really rather poor.
|
| Thanks Peter and Craig,
Re: .3
>I was a bit concerned that even
>though after patching the core file, you might be mislead since you
>were missing data. What if the stack had more than one section for
>instance?
You're right, as I said in my reply we had "a flavour of good luck", the
informations in the core where at least consistent if not complete, therefore
it was enough for the programmer to find the bug ... I believe this is the very
reason why a core is dumped at all.
Re: .4
>Not quite. The OS doesn't check the disk space before it starts writing, it
>just starts writing. At some point during that process, it discovers
>there isn't enough room and aborts.
Well I'd say OS is doing quite a clever job i.e. it writes down only the
sections he can fit in the disk and not partial sections; on the other hand,
somebody told me that if core size is limited by sys parameters and the size
exceedes the parameter (what's this bloody parameter name ???) then the core is
not written at all ... even this make sense, you play by the rules.
>It doesn't 'forget' - the decision to not update the filehdr
>is quite deliberate, and likely made to protect coredump analysis tools
>that expect the corefile to be complete.
Let me disagree on this, either OS write something useful or doesn't write
anything at all and sends a loud warning "HEY, I couldn't write this damned
thing because disk was choking!".
An even better possibility would be that OS write down a readable, partial
core with a warning in it "WATCH OUT: you get only this and not that, debugging
a partial core may be harmful for your health".
I think it's unfair for the programmer/application administrator to see
imformation he so badly need (the core file) just to be disappointed later on
when he finds out this information is not accessible.
Yes, I believe all this is academic discussion, just wanted to be positively
critical.
Thanks again,
Ezio
|