[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference orarep::nomahs::rdb_60

Title:Oracle Rdb - Still a strategic database for DEC on Alpha AXP!
Notice:RDB_60 is archived, please use RDB_70..
Moderator:NOVA::SMITHISON
Created:Fri Mar 18 1994
Last Modified:Fri May 30 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5118
Total number of notes:28246

4961.0. "diofetch$fetch_one_line and corrupt frag ptrs." by M5::PSOEHL (Go see THE RELIC!!!) Fri Jan 24 1997 18:59

    
Customer was getting bugcheck dumps @ 
DIOFETCH$FETCH_ONE_LINE + 3B3.  This only occurred on one table. Customer
realized that some subset of the 7k rows in this table caused the problem. 
Other processes could continue to work against the rest of the database. 

A verify of the area returns the following:

--------------------------------------------------------------

%RMU-W-BADFRACHN, area MOVERCVAREA, page 2010, line 12
                  unexpected non-secondary fragment
                  storage record UNKNOWN,
%RMU-I-FRACHNPOS, pointed to by fragment on page 2010, line 12
%RMU-E-ERRGATFRG, error gathering fragmented record at 33:2010:12
	
----------------------------------------------------------------

If I go into alter and display page 2010 line 12 I get the following:

                            8049  151C  line 12 (33:2010:12) record type 73
                   000007D9 0011  151E  primary fragment, next is 33:2009:17
                            003C  1524  total record length is 60 bytes                         
000001  1526  control '...'
38434530313832313931303031000122  1529  data '"..1001912810EC8'
0000A000996E41B13531313037393931  1539  data '199701151An.....'
                      9F21E041B7  1549  data '7A`!.'

----------------------------------------------------------------
The "pointed to" line number:

                            0049  13C0  line 17 (33:2009:17) record type 73
                         00 0001  13C2  Control information
                                  ....  57 bytes of static data
33434B31413433313331303031000116  13C5   data '...10013134A1KC3'
9F8480414E0300873531313037393931  13D5   data '19970115...NA...'
BC414E0300838480414E030082010000  13E5   data '......NA.....NA<'
              0000FC4E4E04009F7A  13F5   data 'z...NN|..'

----------------------------------------------------------------


As I understand, the second line should say "secondary fragment" if it truly
is.  Looking at the data and the definition of the table, it seems to be 
a record, not a fragment. 

We tried to unload the table and, guess what, we couldn't. Bugcheck, A select *
caused a bugcheck.  We couldn't drop the table. Bugcheck.  My guess was that an
export/import would bugcheck and that would have taken a long time. 

I suggested that we restore from backup and rollforward to the time before the
processes started dumping. That would have involved losing several hours worth
of data, and they didn't want to do that because it would have cost them LOTS
of money in lost financial transactions and fines from the FED.  

What they did was to extract the defs, unload all of the other tables, unload
all of this table but one row (via an application), drop the db (deleting all
of the files), recreate based on the extract and rebuild the database. 

This seems to be closely related to note 3850 and 3342 .  I read the bug for 
3342 (324435).  It seems that since it was not a reproducible case, there was
no resolution.  I asked for the backup that they did before the problem
occurred and the aij, and they ended up having to delete it as a result of 
all of the unloads.  I've got a backup of the db from after the corruption
occurred and have restored it on BLANCA. 

I don't know what we can do without a reproducible case, but since this is a
gold customer, I'm raising the issue.  

Is there anything that we can do with this?

TIA
                                                              




    
T.RTitleUserPersonal
Name
DateLines
4961.1DUCATI::LASTOVICAIs it possible to be totally partial?Fri Jan 24 1997 19:253
presuming that they are already at the current version of Rdb (you
didn't mention it), it might be interesting to dump the AIJ file and
then look for all references to the DBKEY(s) of interest.
4961.2M5::PSOEHLGo see THE RELIC!!!Fri Jan 24 1997 20:036
    You're right I didn't say what version:  it's 6.1-4.  
    
    As I did mention in my note, I would like to see the AIJ as well, but
    regrettably they blew it away.  
    
    Thanks
4961.3HOTRDB::LASTOVICAIs it possible to be totally partial?Sat Jan 25 1997 01:013
    >    regrettably they blew it away.
    
    	along with all hope of figuring anything out.
4961.4Progress ...NOVA::JIANGOracle Corporation (603) 881-0815Sat Jan 25 1997 10:065
    FYI, a day-one problem in collect locked space code has been found in
    7.0 code and the fix has been backported to 6.0 and 6.1. 
    
    Our lab tests so far showed very positive results with heavy loads and
    DELPRC. The fix will be available in the next ECO.
4961.5i hope we found it...NOVA::SMITHIDon&#039;t understate or underestimate Rdb!Sat Jan 25 1997 15:1921
I'd just like to set correct expectations here.  This (and other reports) of
occassional corruption have been very hard to track down.  The problem we just
discovered *could* produce the type of problem described by this note. 
Obviously without an AIJ we can not say for sure.  However, we have some
optimism.

The problem we found deals with collecting lock free space, in the area of
reusing the LDX/TSN index entries.  Under high load environments it is
possible (although extremely unlikely) that two different processes would
reuse the same line on a page.  This requires (a) high concurrency, (b)
multiple processing trying to use the same *page* and (b) the right conditions
present in the LDX vector itself.  This final point in particular is why this
problem is so rare.

I believe this problem has existed for many versions. It is only in the last
year with faster processors and faster I/O that the symptoms have appeared.

Thanks go to Rick and Richard (and others) for working so hard on these
problems.

Ian
4961.6M5::DGROBERTThu Jan 30 1997 14:5410
    Would the bugcheck dump that Pat has for this customer be of any value
    to look at?  There were multiple processes writting to this page at the
    time.  The records that were targeted to this paged and fragmented,
    wrote to nearby pages in the buffer.  Actually it is the fragment page
    that is missing the rest of the record.  There were multiple records
    that fragmented and the fragments went to the same two pages prior to
    the target page.  If we can determine that the suspected problem caused
    this customers corruption it would be great.  It would be even greater
    if we could get them a fix.  This cost them alot, my hide will grow
    back.
4961.7HOTRDB::PMEADPaul, [email protected], 719-577-8032Thu Jan 30 1997 15:293
    If you have an AIJ then there could be some interesting tidbits. 
    Realistically, there is little chance we will really be able to say for
    sure what happened.
4961.8May have db & aijs for this...M5::BLITTINTue Feb 18 1997 11:295
    
    I've got a ct with the same bugcheck, diff offset (+34D) with the db
    and aijs available.  Waiting to see if they will be able to send these
    on tape.  RDB version is 6.0-1, but believe info may be of help.  Will
    post updates...
4961.9+8E4M5::BLITTINThu Feb 20 1997 10:014
    
    Another ct called yesterday with same bugcheck (+8E4).  Running
    a report and can dup it from interactive sql.  Bugcheck available. 
    Also on an Alpha OpenVMS 6.1; RDB6.1-04.
4961.10M5::DGROBERTMon Feb 24 1997 12:516
    It would seem to me that its possible, if the conditions are right, that
    entire record(s) could be lost due to this problem.  We are only
    seeing it discovered/reported on the retrival of fragmented records. 
    Ouch!  How does rmu/recover of an aij that contains records referencing
    the same line react?  Does it die with a specific error or write the
    last entry referencing the line?  Anyone write an ALERT on this yet?
4961.11Another onesvrav1.au.oracle.com::MBRADLEYI was dropped on my head as a baby. What&#039;s your excuse?Wed Mar 05 1997 01:1710
I have one of these on 6.0-12.

Do we have a projection on what version/ECO may resolve the known problem?

Is Eng. interested in the DB and AIJ (which would be fairly sizable in this 
case - 8M rows for the table)?

Cheers,

Mark.
4961.12Another one bites the dustNOMAHS::SECRISTRdb WWS; [email protected]Thu Mar 13 1997 16:209
    
    I've got a customer who just got bit for the second time by this,
    allegedly in the same table !  Rdb V6.1-1 and VMS 6.1 with ACMS.  
    Always at DIOFETCH$FETCH_ONE_LINE + 3A5.  Has anyone submitted a
    reproduceable case yet ?
    
    Regards,
    rcs
    
4961.13diofetch$fetch_one_line + 3a5NOMAHS::SECRISTRdb WWS; [email protected]Fri Mar 14 1997 15:589
    
    BUGCHECK AT DIOFETCH$FETCH_ONE_LINE + 3A5 has bit the same table
    twice, and contention and fragmentation are a factor just like
    in bug 352454, only this is a VAX.  This may be worthy of an
    attempt to reproduce it if anyone wants the table information,
    etc.
    
    Regards,
    rcs
4961.14Which ECO pleasesvrav1.au.oracle.com::MBRADLEYI was dropped on my head as a baby. What&#039;s your excuse?Sun Mar 16 1997 21:5715
>     <<< Note 4961.4 by NOVA::JIANG "Oracle Corporation (603) 881-0815" >>>
>                               -< Progress ... >-
>
>    FYI, a day-one problem in collect locked space code has been found in
>    7.0 code and the fix has been backported to 6.0 and 6.1. 
>    
>    Our lab tests so far showed very positive results with heavy loads and
>    DELPRC. The fix will be available in the next ECO.

I have seen this on 6.0-12, adn the customer was wondering which ECO may 
fix the problem?

Thanks,

Mark.
4961.15Anybody seen this in V7.0 ?NOMAHS::SECRISTRdb WWS; [email protected]Mon Mar 17 1997 09:2210
    
    ; I have seen this on 6.0-12, adn the customer was wondering which
    ; ECO may fix the problem? 
    
    I have seen this on 6.1-1 and I have a customer that wonders that
    same thing ;-)
    
    Regards,
    rcs
    
4961.16M5::LWILCOXChocolate in January!!Tue Mar 18 1997 13:3012
    <<< Note 4961.13 by NOMAHS::SECRIST "Rdb WWS; [email protected]" >>>
                       -< diofetch$fetch_one_line + 3a5 >-

    
>>    This may be worthy of an
>>    attempt to reproduce it if anyone wants the table information,
>>    etc.
    
Richard, I suspect that *you* will be the one tasked with this if it needs
to be bugged.

:-).