[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference hydra::axp-developer

Title:Alpha Developer Support
Notice:[email protected], 800-332-4786
Moderator:HYDRA::SYSTEM
Created:Mon Jun 06 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:3722
Total number of notes:11359

3231.0. "LMS Numerical Technologies (Belgium)" by RDGENG::ASAP () Tue Feb 25 1997 08:21

    Company Name :  LMS Numerical Technologies (Belgium)
    Contact Name :  Erwin Glassee
    Phone        :  +32/16/384527 (direct) or +32/16/384500
    Fax          :  +32/16/384550
    Email        :  [email protected]
    Date/Time in :  25-FEB-1997 13:20:10
    Entered by   :  Nick Hudson
    SPE center   :  REO

    Category     :  unix
    OS Version   :  4.0
    System H/W   :  


    Brief Description of Problem:
    -----------------------------

From:	ESSB::ESSB::MRGATE::"ILO::ESSC::bsanting" 25-FEB-1997 07:56:27.94
To:	RDGENG::ASAP
CC:	
Subj:	ESCALATION: POINT No., Company TO ASAP READING:    

From:	NAME: ESCTECH@ILO          
	TEL: (822-)6704          
	ADDR: ILO                  <bsanting@ESSC@ILO>
To:	ASAP@RDGENG@MRGATE

Hello - 

POINT Log Number	 20844

Company Name 	LMS Numerical Technologies (Belgium)

Engineers name	Erwin Glassee  

Telephone Number 	+32/16/384527 (direct) or +32/16/384500 

Fax Number		+32/16/384550	

E-mail Address	[email protected]

Operating System, Version	Digital Unix V4.0 386 alpha

Platform			

Problem Statement		

ASAP Membership Number: A60828

I have a mixed C++/C/FORTRAN program on Digital Unix
My version numbers are:

OS:

mario238-src/odb>uname -a
OSF1 mario V4.0 386 alpha

C++ Compiler:

mario237-src/odb>cxx -V
cxx  (cxx)
DEC C++ V5.4-006 on Digital UNIX (Alpha)

FORTRAN Compiler:

mario243-src/odb>which f77
/usr/ucb/f77
mario244-src/odb>what /usr/ucb/f77
/usr/ucb/f77:
        $RCSfile: crt0.s,v $ $Revision: 1.1.18.3 $ (DEC) $Date:
1994/08/24 20:26:25 $
        DEC Fortran Compiler Driver V4.0-1

for the C compiler this is quite a list, so I won't include that here,
unless you really need it.

When I compile a program called rsm, and then run it, I get, with three
different input files:

1) Everything runs fine.

2) SEGV inside malloc

Thread received signal SEGV
stopped at [???: ??? 0x3ff82914480]
(ladebug) where
>0  0x3ff82914480 in /usr/shlib/libc.so
#1  0x3ff82851968 in malloc(0x3ff8285196c, 0x0, 0x12002a924, 0x0, 0x0,
0x0) DebugInformationStrippedFromFile22:???
#2  0x12002a920 in MemoryMalloc(size=16) nitmem.cxx:38
#3  0x12002bd00 in FunctionListCreate(size=0) functionlist.cxx:52

3) SEGV inside malloc, but at another position.

Thread received signal SEGV
stopped at [???: ??? 0x3ff82914480]
(ladebug) where
>0  0x3ff82914480 in /usr/shlib/libc.so
#1  0x3ff82851968 in malloc(0x3ff8285196c, 0x420, 0x12002a924, 0x4, 0x0,
0x11ffffc50) DebugInformationStrippedFromFile22:???
#2  0x12002a920 in MemoryMalloc(size=4) nitmem.cxx:38
#3  0x12002a984 in operator new(size=4) nitmem.cxx:48
#4  0x12002ad4c in ((Str*)0x140068900)->Init(cp=0x12000c658="")
str.cxx:77
#5  0x12002aaf0 in ((Str*)0x140068900)->Str(cp=0x12000c658="")
str.cxx:41
#6  0x12002c3a0 in ((Project*)0x140068900)->Project() project.cxx:36

At first I tought this could be caused by calling a mixture of 
C malloc and C++ new, but I introduced a global new operator that
calls malloc, and the behaviour hasn't altered.

Could you suggest something else that can be wrong ? Is this a bug
in the standard libraries (after all, calling malloc should never
cause a SEGV, should it ?) ? On all other platforms this code
runs fine, and I have no idea what else I could try. 

I wrote some smaller tests which all run fine. It's just the
integrated executable that fails.

Greetings,

Erwin.
--


Regards,


Ben


		   QED Qualitas Est Demonstrandum
		   ==============================
Ben Santing, Technical Consultant		Phone:  DTN 822 4330
European Customer Service Centre		Phone:  DTN 822 4269
Digital Equipment International B.V.		FAX:    DTN 822 4445

	     In replying, please use [email protected]


T.RTitleUserPersonal
Name
DateLines
3231.1KZIN::HUDSONThat&#039;s what I thinkTue Feb 25 1997 08:27127
From:	DEC:.REO.REOVTX::HUDSON       "[email protected] - UK Software
Partner Engineering 830-4121" 25-FEB-1997 11:13:19.04
To:	nm%vbormc::"[email protected]"
CC:	HUDSON
Subj:	re:ASAP Question:segv inside malloc 

Hello Erwin Glassee

Thank-you for your ASAP mail question on SEGV errors from malloc() calls.

I have seen many cases of programs that fail in calls to malloc/free/etc., and
they are almost always due to programming errors (I personally have never seen
a case where the problem was caused by a bug in the operating system libraries;
the last time I can find such a thing reported for OSF was fixed in version
2.1).

> Is this a bug
>in the standard libraries (after all, calling malloc should never
>cause a SEGV, should it ?) ? On all other platforms this code
>runs fine, and I have no idea what else I could try.

The routines that manipulate the heap typically use some kind of linked list
data structure to keep track of how much memory you have requested and
free'd, in an attempt to allow for efficient memory use.  Different versions of
Unix may implement these routines in different ways, and so if your program
mistakenly overwrites data in the heap that it doesn't own, it doesn't
necessarily mean that subsequent malloc()'s will fail.  But it might do.

For example, if you malloc() a 20 element array, and write into element 25,
then it could be that

	- you won't notice anything wrong (no-one else is using that memory)
	- some data structure of yours will be corrupted
	- some data structure used by the heap routines will be corrupted

It is not possible to predict what will happen as it will depend on the state
of the heap, and what other routines have been using it recently, etc..

Given that malloc() will be traversing a linked list to find suitable bits of
memory to give you, it is quite easy to see that a SEGV could occur if one of
the pointers it is following has been corrupted.

>At first I tought this could be caused by calling a mixture of 
>C malloc and C++ new, but I introduced a global new operator that
>calls malloc, and the behaviour hasn't altered.

The C++ global "new" operator uses the standard malloc() underneath.  There is
no problem mixing "new" and "malloc()" in the same program; the only
restriction is that you can't mix use of them for the same data structure. 
E.g. if you "new" something, you must use "delete" when you've finished with
it, and if you "malloc" something, you must use "free".  You can't do:

	int	*a = malloc(sizeof(int));
	delete a;				// !! wrong !!


>Could you suggest something else that can be wrong ?

The most common things that could be happening are:

1) using more memory than you've allocated

	e.g. 
		int	*a = new int[10];

		a[10] = 1;		// !! wrong, max subscript is 9
	

2) using a pointer to memory which you've released

	e.g.

		int	*a = new int[10]
		myfunc(a);
		delete[] a;
		a[1] = 1;		// !! wrong, *a is undefined

3) free'ing memory more than once

	e.g.

		int	*a;
		a = (int *)malloc(sizeof(int));
		free(a);
		free(a);		// !! wrong, a already free'd


All of the above will compile with no errors.  At runtime, you may get a
problem or you may not; it is impossible to predict.


Unfortunately these errors are not always so easy to spot in a real
application.  So usually you will need to do some extra work.

There are a couple of things I would suggest.  One thing that some people do is
to incorporate extra error checking.  For example, in the case of C, you could
define your own versions of malloc() and free(), which keep a log of which
memory your application has been using, so that when the program fails you
could look in the log for any obvious discrepancies.

Another approach that some people use is to try and make the program fail very
soon after the illegal operation (in your case, it may be that the SEGV is
happening many millions of instructions after the error occurred, which
complicates the investigation).  What you could do is ensure that whenever you
malloc() memory, you fill it with a known pattern, say "0xff".  When you free
memory, you fill it with another known pattern, say "0xee" just before you free
it.  In a "legal" program this will not cause a problem, but because many
problems are caused because the heap contains pointers, the values "0xeeee.."
and "0xffff..." will be much more likely to cause a problem quickly.

The other thing to look at is a tool called "Third Degree", which is one of the
Atom tools.  Third Degree will keep track of when you are misusing the heap,
and will give you a list of all the things your application is doing wrong
(e.g. it will catch any of the 3 scenarios I mentioned above).  This tool is
documented in the "Programmer's Guide", chapter 7 of the Unix documentation
set.  It is a very powerful tool, but you may find it picks up a lot more
errors than you were expecting (including some memory leaks in the standard
libraries), so you might spend a long time fixing everything it reports!

I hope this information is helpful to you

Regards

Nick Hudson
Digital Software Partner Engineering.


3231.2KZIN::HUDSONThat&#039;s what I thinkWed Feb 26 1997 03:5072
From:	VBORMC::"[email protected]" "Erwin Glassee" 26-FEB-1997 08:46:57.84
To:	"[email protected] - UK Software Partner Engineering 830-4121
25-Feb-1997 1113 +0000" <[email protected]>
CC:	
Subj:	Re: ASAP Question:segv inside malloc

[email protected] - UK Software Partner Engineering 830-4121
25-Feb-1997 1113 +0000 wrote:
> 
> Hello Erwin Glassee
> 
> Thank-you for your ASAP mail question on SEGV errors from malloc() calls.
> 
> The routines that manipulate the heap typically use some kind of linked list
> data structure to keep track of how much memory you have requested and
> free'd, in an attempt to allow for efficient memory use.  Different versions of
> Unix may implement these routines in different ways, and so if your program
> mistakenly overwrites data in the heap that it doesn't own, it doesn't
> necessarily mean that subsequent malloc()'s will fail.  But it might do.

> I hope this information is helpful to you
> 
> Regards
> 
> Nick Hudson
> Digital Software Partner Engineering.

I was able to track the problem down, and it was indeed due to 
allocating insufficient memory for an array. On Dec Alpha, the end
of the array must have been located in the free block list of the 
malloc routines, while this was not the case on the other platforms,
which would explain the symptomatics of the problem.

Thank you very much for your help and your suggestions.

Regards

Erwin.
--
Erwin Glassee [email protected] 
tel. +32/16/384527 (direct) or +32/16/384500 fax +32/16/384550
LMS Numerical Technologies, Interleuvenlaan 70, 
Researchpark Haasrode Z1, B-3001 Leuven, Belgium

% ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ======
% Received: from mail.vbo.dec.com (mail.vbo.dec.com [16.36.208.34]) by
vbormc.vbo.dec.com (8.7.3/8.7) with ESMTP id JAA03713 for
<[email protected]>; Wed, 26 Feb 1997 09:43:51 +0100
% Received: from server21.digital.fr (server21.digital.fr [193.56.15.21]) by
mail.vbo.dec.com (8.7.3/8.7) with ESMTP id JAA09260 for
<[email protected]>; Wed, 26 Feb 1997 09:48:35 +0100 (MET)
% Received: from ironman.lms.be ([193.121.76.65]) by server21.digital.fr
(8.7.5/8.7) with ESMTP id JAA08453 for <[email protected]>; Wed, 26 Feb
1997 09:51:03 +0100 (MET)
% Received: from lmsnit.be (jedi.lmsnit.be [192.168.1.20]) by ironman.lms.be
with SMTP (8.7.1/8.7.1) id JAA14590 for <[email protected]>; Wed, 26
Feb 1997 09:40:35 +0100 (MET)
% Received: from joske by lmsnit.be (4.1/SMI-4.1) id AA14130; Wed, 26 Feb 97
09:40:21 +010
% Sender: [email protected]
% Message-Id: <[email protected]>
% Date: Wed, 26 Feb 1997 10:40:34 +0100
% From: Erwin Glassee <[email protected]>
% Organization: LMS Numerical Technologies
% X-Mailer: Mozilla 3.01 (X11; I; Linux 2.0.0 i686)
% Mime-Version: 1.0
% To: "[email protected] - UK Software Partner Engineering 830-4121
25-Feb-1997 1113 +0000" <[email protected]>
% Subject: Re: ASAP Question:segv inside malloc
% References: <[email protected]>
% Content-Type: text/plain; charset=us-ascii
% Content-Transfer-Encoding: 7bit