[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

9986.0. "Multithreaded application crashes on 8400, but not on 4100" by NESBIT::BGIRVAN () Thu May 29 1997 15:31

    We are benchmarking Sybase 11.0.2.2 with Open Client 10.0.0.3 on 3
    machines.
    
    Machine 1: 8400 with 4 5/300MHz CPU's, 4GB Memory, KFTHA, 2 PCI buses,
    2 HSZ50 dual pairs with 128 MB cache, 24 RZ29B's in 4 RAID 0 + 1 arrays, 
    Digital UNIX V3.2G with full DUV32GAS00001-19970501 patches applied.
    
    Machine 2: 8400 with 4 5/6/440MHz CPU's, 4GB Memory, KFTHA, 2 PCI buses, 
    2 HSZ50 dual pairs with 128 MB cache, 24 RZ29B's in 4 RAID 0 + 1 arrays, 
    Digital UNIX V3.2G with full DUV32GAS00001-19970501 patches applied.
    
    Machine 3: 4100 with 4 5/400MHz CPU's, 1 GB Memory, 2 HSZ50 dual pairs
    with 128 MB cache, 24 RZ29B's in 4 RAID 0 + 1 arrays, Digital UNIX
    V3.2G with full DUV32GAS00001-19970501 patches applied.
    
    The customer has an in-house application, which is multi-threaded
    (badly, their words not mine) running 15 threads, which sorts through the
    database and ouputs the relevant data. Their current system is an 8400
    with 4 300MHz CPU's, 2GB Memory running Digital Unix V3.2D and Sybase
    10.
    
    When the application was first run on the 440MHz system it caused a
    panic with the iether of the following errors;
    
    panic (cpu 0): pciaerror
    
    or
    
    panic (cpu 3): xpt_callback: callback_on freed CCB
    
    
    Installing the latest patch kit and recompiling and linking the
    application to the 3.2G libraries stopped the system crashes. The
    application then crashed when run but the core was always corrupted.
    It always failed at the same point in the code.
    
    We then ran it using ladebug which indicated that the code was crashing
    during a free() funtion call. I then tried setting old_obreak=0 in the
    kernel but made no difference. The call to free() was then removed from
    the code but the application then failed at the next free() call.
    
    The application behaved the same on the 300MHz system as the 440MHz
    system.
    
    But when we ran the origional code (compiled and linked against 3.2D)
    on the 4100 system it ran to completion.
    
    Has anyone any idea what is going on here?
    
    Currently we are running the 3.2G complied and linked code on the 4100
    and trying a run on the 300MHz 8400 with the memory reduced to 1GB. It
    takes a while to run the benchmark so they are set to run over night.   
    
    The kernel parameters which have been modified are;
    
    
    
    4100
    
    ipc:
      shm-max=2118123520
      sem-mni=32
      num-of-sems=120
    
    proc:
            max-proc-per-user=2048
            max-threads-per-user=2048
            per-proc-data-size=4294967296
            max-per-proc-data-size=4294967296
            per-proc-address-space=4294967296
            max-per-proc-address-space=4294967296
            per-proc-stack-size=1073741824
            max-per-proc-stack-size=1073741824
    
    rt:
      aio-max-num=1024
    
    vm:
      vm-maxvas=4294967296
      ubc-maxpercent=30
      vm-ubcseqstartpercent=20
      vm-vpagemax=131072
    
    
    
    8400's
    
    ipc:
      shm-max=2147483647
      sem-mni=1024
      num-of-sems=120
    #
    rt:
      aio-max-num=1024
      aio-task-max-num=1024
    
    vm:
      vm-maxvas=4294967296
      ubc-maxpercent=30
      vm-ubcseqstartpercent=20
      vm-vpagemax=4294967296
      vm-maxwire=2147483648
      vm-kentry_zone_size=33554432
      contig-malloc-percent=2
    
    proc:
            max-proc-per-user=5000
            max-threads-per-user=5000
            per-proc-data-size=1073741824
            max-per-proc-data-size=4294967296
            per-proc-address-space=4294967296
            max-per-proc-address-space=4294967296
            per-proc-stack-size=134217728
            max-per-proc-stack-size=1073741824
            sched-min-idle=10
    
    
    Billy.
          
T.RTitleUserPersonal
Name
DateLines
9986.1Un-official simport patches seems to address this panic problem..NNTPD::"[email protected]"SriThu May 29 1997 17:319
Hi,
	
	Simport patches that addresses some of these relevant
panics. The patches can be obtained from
decatl.alf.dec.com:/pub/patches/misc/simport_patches

Regards
Sri
[Posted by WWW Notes gateway]
9986.2Nothing new here...WTFN::SCALESDespair is appropriate and inevitable.Fri May 30 1997 19:3523
I'm afraid that there's not much information here to go on.

.0> We then ran it using ladebug which indicated that the code was crashing
.0> during a free() funtion call.

I presume you were seeing a SEGV inside free().  Can you tell if the memory
management data structures have been corrupted?  

Check the customer's code for instances of using memory after it's been freed,
freeing memory twice, writing beyond the end of an array in dynamically
allocated memory, or use of uninitialized local pointer variables.  Any of these
sorts of things in the customer's application could result in these symptoms.

.0> But when we ran the origional code (compiled and linked against 3.2D)
.0> on the 4100 system it ran to completion.

It could easily be the case that in the 3.2D image the corruption either doesn't
happen (i.e., because the timing of the threads' execution is different) or it
corrupts an otherwise benign location (because of different timing, or different
bits in the uninitialized pointer value).


				Webb
9986.3Latest UpdateNESBIT::BGIRVANMon Jun 02 1997 05:5429
    I applied the simport patch suggested by Sri. We then ran the
    executable and it ran to completion. So we ran it again to be sure, and
    this time it fell over exactly as it had before.
    
    We also managed to get it to run successfully on a 8000 without the
    patch installed, but with the memory down to 1GB, but after the run
    when I examined shared memory with 'ipcs' I got a load of rubbish
    returned to the screen, as if shared memory had been currupted. I've
    also noticed this after a failed run.
    
    One other strange parameter I've noticed, the base address of the
    kernel's vitual address space is different between the 8400's and the
    4100, ie,
    
    4100
    
    		vm-min-kernel-address = 18446744071562067968
    
    
    8000
    
    		vm-min-kernel-address = 18446744065119617024
    
    
    The 4100 address is the default according to the tuning guide, so why
    is the 8400 different and is it significant?
    
    Billy
    
9986.4ladebug traceNESBIT::BGIRVANMon Jun 02 1997 06:40398
    Here's the trace form ladebug on the problem.
    
    
    asset_type1 : Starting segval netting asset_max is [79]
    
    [asset_type1] : Retrieving segvals [exec p_rptdb_asset1_net  A0019]
    
    Executing procedure exec p_rptdb_asset1_net  A0019 and retrieving data
    
    Completed Executing procedure exec p_rptdb_asset1_net  A0019 and
    retrieving data rows 79
    
    [asset_type1] : Completed Retrieval of segvals [exec p_rptdb_asset1_net 
    A0019]
    
    [asset_type1] : Assign Category for [exec p_rptdb_asset1_net  A0019]
    
    [asset_type1] : Completed Assign Category for [exec p_rptdb_asset1_net 
    A0019]
    
    [asset_type1] : Assigning customer_type and category for [exec
    p_rptdb_asset1_net  A0019] vehicle_max = [877], customer_max=[49051]
    
    [asset_type1] : Completed Assigning customer_type and category for
    [exec p_rptdb_asset1_net  A0019]
    
    [asset_type1] : Assigning Trans Type Group for [exec p_rptdb_asset1_net 
    A0019]
    
    [asset_type1] : completed Assigning Trans Type Group for [exec
    p_rptdb_asset1_net  A0019]
    
    [asset_type1] : completed cleaning trans_type btree for [exec
    p_rptdb_asset1_net  A0019]
    
    [asset_type1] : completed cleaning customer btree for [exec
    p_rptdb_asset1_net  A0019]
    
    [asset_type1] : completed cleaning vehicle btree for [exec
    p_rptdb_asset1_net  A0019]
    
    [asset_type1] : Completed and release segval_netting mutex
    
    asset_type1 : Walking through seg_net array
    
    asset_type1 : array_number [0], accnode [31080], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [1], accnode [31081], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [2], accnode [32447], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [3], accnode [32448], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [4], accnode [32449], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [5], accnode [32450], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [6], accnode [32451], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [7], accnode [32452], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [8], accnode [32453], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [9], accnode [32454], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [10], accnode [32455], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [11], accnode [32458], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [12], accnode [32459], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [13], accnode [32460], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [14], accnode [32463], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [15], accnode [32464], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [16], accnode [32465], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [17], accnode [32466], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [18], accnode [32467], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [19], accnode [32468], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [20], accnode [32469], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [21], accnode [32471], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [22], accnode [32472], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [23], accnode [32473], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [24], accnode [32474], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [25], accnode [32475], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [26], accnode [32476], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [27], accnode [32477], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [28], accnode [32478], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [29], accnode [32479], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [30], accnode [32480], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [31], accnode [32481], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [32], accnode [32482], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [33], accnode [32483], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [34], accnode [32484], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [35], accnode [32487], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [36], accnode [32488], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [37], accnode [32489], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [38], accnode [32490], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [39], accnode [32491], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [40], accnode [32492], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [41], accnode [32493], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [42], accnode [32494], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [43], accnode [32495], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [44], accnode [32496], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [45], accnode [32497], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [46], accnode [32498], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [47], accnode [32499], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [48], accnode [32500], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [49], accnode [32501], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [50], accnode [36206], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [51], accnode [36210], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [52], accnode [60908], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [53], accnode [60915], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [54], accnode [60931], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [55], accnode [60936], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [56], accnode [81128], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [57], accnode [81130], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [58], accnode [81479], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [59], accnode [84737], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [60], accnode [84739], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [61], accnode [84742], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [62], accnode [84744], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [63], accnode [138068], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [64], accnode [138070], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [65], accnode [150283], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [66], accnode [150286], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [67], accnode [150405], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [68], accnode [164141], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [69], accnode [164143], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [70], accnode [177090], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [71], accnode [203592], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [72], accnode [205942], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [73], accnode [205944], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [74], accnode [208001], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [75], accnode [208002], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [76], accnode [212447], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [77], accnode [212449], s1 [ABS], cust_type
    [], active [1]
    
    asset_type1 : array_number [78], accnode [212526], s1 [ABS], cust_type
    [], active [1]
    
    [1] stopped at [asset_type1:2080 0x12003d390]
    
       2080         printf("asset_type1 : deallocating seg_net. pointer
    value is [%x]\n", seg_net);
    
    (ladebug) next
    asset_type1 : deallocating seg_net. pointer value is [b37a6008]
    
    stopped at [asset_type1:2081 0x12003d3ac]
    
       2081         fflush(stdout);
    
    (ladebug) next
    
    stopped at [asset_type1:2082 0x12003d3c4]
    
       2082         root_seg=*rs;
    
    (ladebug) step
    
    stopped at [asset_type1:2083 0x12003d3d0]
    
       2083         segval_dealloc(seg_net); seg_net=NULL;
    
    (ladebug) step
    
    stopped at [segval_dealloc:210 0x12002ed94]
    
    (ladebug) step
    
    stopped at [segval_dealloc:210 0x12002ed94]
    
        210         if (s != NULL)
    
    (ladebug) step
    
    stopped at [segval_dealloc:212 0x12002ed9c]
    
        212                 printf("Deallocating segval pointer value
    [%x]\n", s);
    
    (ladebug) step
    
    Deallocating segval pointer value [b37a6008]
    
    stopped at [segval_dealloc:213 0x12002edb8]
    
        213                 fflush(stdout);
    
    (ladebug) step
    
    stopped at [segval_dealloc:214 0x12002edd0]
    
        214                 free(s);
    
    (ladebug) step
    
    Thread received signal SEGV
    
    stopped at [free: ??? 0x3ff810323e4]
    
    (ladebug) listobj
    
    ObjectName                        Start Addr           Size       
    Symbols
    
                                                        (bytes)        
    Loaded
    
    ----------------------------------------------------------------------------
    
    rptbal_create                     0x120000000         303104         
    Yes
    
    /usr/shlib/libm.so              0x3ff80800000         991232         
    Yes
    
    /usr/local/sybase11/lib/libsybdb.so
    
                                    0x3ffbff40000         786432         
    Yes
    
    /usr/shlib/libpthreads.so
    
                                    0x3ff81000000         311296         
    Yes
    
    /usr/shlib/libmach.so           0x3ff81800000          65536         
    Yes
    
    /usr/shlib/libc_r.so            0x3ff82000000         589824         
    Yes
    
    /usr/shlib/libc.so              0x3ff82800000         925696         
    Yes
    
    (ladebug) where
    
    >0  0x3ff810323e4 in free(0x3ffc28002b8, 0xffffffffb37a6008, 0x40, 0x0,
    0x0, 0x8) DebugInformationStrippedFromFile92:???
    
    #1  0x12002edd8 in segval_dealloc(s=-1283825656)
    /usr/project/cord/rptdb/c/m3_alloc.c:214
    
    #2  0x12003d3d8 in asset_type1(arg=0x0)
    /usr/project/cord/rptdb/c/netting.c:2083
    
    #3  0x3ff8104285c in /usr/shlib/libpthreads.so
    
    
    
    
    
    
    
    
9986.5SYBASE parametersNESBIT::BGIRVANMon Jun 02 1997 06:41276
    And here are the SYBASE intitialization parameters.
    
    
    bggibx0015 # pwd
    
    /usr/local/sybase
    
    bggibx0015 # ls *.cfg
    
    RMTGLSPRD01.cfg
    
    bggibx0015 # cat RMTGLSPRD01.log
    
    cat: cannot open RMTGLSPRD01.log
    
    bggibx0015 # cat RMTGLSPRD01.cfg
    
    ##############################################################################
    
    #
    
    #               Configuration File for the Sybase SQL Server
    
    #
    
    #               Please read the System Administration Guide (SAG)
    
    #               before changing any of the values in this file.
    
    #
    
    ##############################################################################
    
     
    
     
    
     
    
    [Configuration Options]
    
     
    
    [General Information]
    
     
    
    [Backup/Recovery]
    
            recovery interval in minutes = DEFAULT
    
            print recovery information = DEFAULT
    
            tape retention in days = DEFAULT
    
     
    
    [Cache Manager]
    
            number of oam trips = DEFAULT
    
            number of index trips = DEFAULT
    
            procedure cache percent = DEFAULT
    
            memory alignment boundary = DEFAULT
    
     
    
    [Named Cache:default data cache]
    
            cache size = DEFAULT
    
            cache status = default data cache
    
     
    
    [4K I/O Buffer Pool]
    
            pool size = 20.0000M
    
            wash size = DEFAULT
    
     
    
    [16K I/O Buffer Pool]
    
            pool size = 20.0000M
    
            wash size = DEFAULT
    
     
    
    [Disk I/O]
    
            allow sql server async i/o = DEFAULT
    
            disk i/o structures = DEFAULT
    
            page utilization percent = DEFAULT
    
            number of devices = 25
    
            disable character set conversions = DEFAULT
    
     
    
    [Network Communication]
    
            default network packet size = DEFAULT
    
            max network packet size = DEFAULT
    
            remote server pre-read packets = DEFAULT
    
            number of remote connections = DEFAULT
    
            allow remote access = DEFAULT
    
            number of remote logins = DEFAULT
    
            number of remote sites = DEFAULT
    
            max number network listeners = DEFAULT
    
            tcp no delay = DEFAULT
    
            allow sendmsg = DEFAULT
    
            syb_sendmsg port number = DEFAULT
    
     
    
    [O/S Resources]
    
            max async i/os per engine = 1024
    
            max async i/os per server = 1024
    
     
    
    [Physical Resources]
    
     
    
    [Physical Memory]
    
            total memory = 256000
    
            additional network memory = DEFAULT
    
            lock shared memory = DEFAULT
    
            shared memory starting address = DEFAULT
    
     
    
    [Processors]
    
            max online engines = DEFAULT
    
            min online engines = DEFAULT
    
     
    
    [SQL Server Administration]
    
            number of open objects = 800
    
            number of open databases = 30
    
            audit queue size = DEFAULT
    
            default database size = DEFAULT
    
            identity burning set factor = DEFAULT
    
            allow nested triggers = DEFAULT
    
            allow updates to system tables = DEFAULT
    
            print deadlock information = DEFAULT
    
            default fill factor percent = DEFAULT
    
            number of mailboxes = DEFAULT
    
            number of messages = DEFAULT
    
            number of alarms = DEFAULT
    
            number of pre-allocated extents = DEFAULT
    
            event buffers per engine = DEFAULT
    
            cpu accounting flush interval = DEFAULT
    
            i/o accounting flush interval = DEFAULT
    
            sql server clock tick length = DEFAULT
    
            runnable process search count = DEFAULT
    
            i/o polling process count = DEFAULT
    
            time slice = DEFAULT
    
            deadlock retries = DEFAULT
    
            cpu grace time = 200
    
            number of sort buffers = DEFAULT
    
            sort page count = DEFAULT
    
            number of extent i/o buffers = DEFAULT
    
            size of auto identity column = DEFAULT
    
            identity grab size = DEFAULT
    
            lock promotion HWM = DEFAULT
    
            lock promotion LWM = DEFAULT
    
            lock promotion PCT = DEFAULT
    
            housekeeper free write percent = DEFAULT
    
            partition groups = DEFAULT
    
            partition spinlock ratio = DEFAULT
    
     
    
    [User Environment]
    
            number of user connections = 200
    
            stack size = DEFAULT
    
            stack guard size = DEFAULT
    
            systemwide password expiration = DEFAULT
    
            permission cache entries = DEFAULT
    
            user log cache size = DEFAULT
    
            user log cache spinlock ratio = DEFAULT
    
     
    
    [Lock Manager]
    
            number of locks = DEFAULT
    
            deadlock checking period = DEFAULT
    
            freelock transfer block size = DEFAULT
    
            max engine freelocks = DEFAULT
    
            address lock spinlock ratio = DEFAULT
    
            page lock spinlock ratio = DEFAULT
    
            table lock spinlock ratio = DEFAULT
    
    bggibx0015 #
    
    bggibx0015 #