[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference orarep::nomahs::rdb_60

Title:	Oracle Rdb - Still a strategic database for DEC on Alpha AXP!
Notice:	RDB_60 is archived, please use RDB_70..
Moderator:	NOVA::SMITHISON

Created:	Fri Mar 18 1994
Last Modified:	Thu May 29 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5118
Total number of notes:	28246

4961.0. "diofetch$fetch_one_line and corrupt frag ptrs." by M5::PSOEHL (Go see THE RELIC!!!) Fri Jan 24 1997 18:59

    
Customer was getting bugcheck dumps @ 
DIOFETCH$FETCH_ONE_LINE + 3B3.  This only occurred on one table. Customer
realized that some subset of the 7k rows in this table caused the problem. 
Other processes could continue to work against the rest of the database. 

A verify of the area returns the following:

--------------------------------------------------------------

%RMU-W-BADFRACHN, area MOVERCVAREA, page 2010, line 12
                  unexpected non-secondary fragment
                  storage record UNKNOWN,
%RMU-I-FRACHNPOS, pointed to by fragment on page 2010, line 12
%RMU-E-ERRGATFRG, error gathering fragmented record at 33:2010:12
	
----------------------------------------------------------------

If I go into alter and display page 2010 line 12 I get the following:

                            8049  151C  line 12 (33:2010:12) record type 73
                   000007D9 0011  151E  primary fragment, next is 33:2009:17
                            003C  1524  total record length is 60 bytes                         
000001  1526  control '...'
38434530313832313931303031000122  1529  data '"..1001912810EC8'
0000A000996E41B13531313037393931  1539  data '199701151An.....'
                      9F21E041B7  1549  data '7A`!.'

----------------------------------------------------------------
The "pointed to" line number:

                            0049  13C0  line 17 (33:2009:17) record type 73
                         00 0001  13C2  Control information
                                  ....  57 bytes of static data
33434B31413433313331303031000116  13C5   data '...10013134A1KC3'
9F8480414E0300873531313037393931  13D5   data '19970115...NA...'
BC414E0300838480414E030082010000  13E5   data '......NA.....NA<'
              0000FC4E4E04009F7A  13F5   data 'z...NN|..'

----------------------------------------------------------------


As I understand, the second line should say "secondary fragment" if it truly
is.  Looking at the data and the definition of the table, it seems to be 
a record, not a fragment. 

We tried to unload the table and, guess what, we couldn't. Bugcheck, A select *
caused a bugcheck.  We couldn't drop the table. Bugcheck.  My guess was that an
export/import would bugcheck and that would have taken a long time. 

I suggested that we restore from backup and rollforward to the time before the
processes started dumping. That would have involved losing several hours worth
of data, and they didn't want to do that because it would have cost them LOTS
of money in lost financial transactions and fines from the FED.  

What they did was to extract the defs, unload all of the other tables, unload
all of this table but one row (via an application), drop the db (deleting all
of the files), recreate based on the extract and rebuild the database. 

This seems to be closely related to note 3850 and 3342 .  I read the bug for 
3342 (324435).  It seems that since it was not a reproducible case, there was
no resolution.  I asked for the backup that they did before the problem
occurred and the aij, and they ended up having to delete it as a result of 
all of the unloads.  I've got a backup of the db from after the corruption
occurred and have restored it on BLANCA. 

I don't know what we can do without a reproducible case, but since this is a
gold customer, I'm raising the issue.  

Is there anything that we can do with this?

TIA

T.R	Title	User	Personal Name	Date	Lines
4961.1		DUCATI::LASTOVICA	Is it possible to be totally partial?	`Fri Jan 24 1997 19:25`	3
	presuming that they are already at the current version of Rdb (you didn't mention it), it might be interesting to dump the AIJ file and then look for all references to the DBKEY(s) of interest.
4961.2		M5::PSOEHL	Go see THE RELIC!!!	`Fri Jan 24 1997 20:03`	6
	You're right I didn't say what version: it's 6.1-4. As I did mention in my note, I would like to see the AIJ as well, but regrettably they blew it away. Thanks
4961.3		HOTRDB::LASTOVICA	Is it possible to be totally partial?	`Sat Jan 25 1997 01:01`	3
	> regrettably they blew it away. along with all hope of figuring anything out.
4961.4	Progress ...	NOVA::JIANG	Oracle Corporation (603) 881-0815	`Sat Jan 25 1997 10:06`	5
	FYI, a day-one problem in collect locked space code has been found in 7.0 code and the fix has been backported to 6.0 and 6.1. Our lab tests so far showed very positive results with heavy loads and DELPRC. The fix will be available in the next ECO.
4961.5	i hope we found it...	NOVA::SMITHI	Don't understate or underestimate Rdb!	`Sat Jan 25 1997 15:19`	21
	I'd just like to set correct expectations here. This (and other reports) of occassional corruption have been very hard to track down. The problem we just discovered could produce the type of problem described by this note. Obviously without an AIJ we can not say for sure. However, we have some optimism. The problem we found deals with collecting lock free space, in the area of reusing the LDX/TSN index entries. Under high load environments it is possible (although extremely unlikely) that two different processes would reuse the same line on a page. This requires (a) high concurrency, (b) multiple processing trying to use the same page and (b) the right conditions present in the LDX vector itself. This final point in particular is why this problem is so rare. I believe this problem has existed for many versions. It is only in the last year with faster processors and faster I/O that the symptoms have appeared. Thanks go to Rick and Richard (and others) for working so hard on these problems. Ian
4961.6		M5::DGROBERT		`Thu Jan 30 1997 14:54`	10
	Would the bugcheck dump that Pat has for this customer be of any value to look at? There were multiple processes writting to this page at the time. The records that were targeted to this paged and fragmented, wrote to nearby pages in the buffer. Actually it is the fragment page that is missing the rest of the record. There were multiple records that fragmented and the fragments went to the same two pages prior to the target page. If we can determine that the suspected problem caused this customers corruption it would be great. It would be even greater if we could get them a fix. This cost them alot, my hide will grow back.
4961.7		HOTRDB::PMEAD	Paul, [email protected], 719-577-8032	`Thu Jan 30 1997 15:29`	3
	If you have an AIJ then there could be some interesting tidbits. Realistically, there is little chance we will really be able to say for sure what happened.
4961.8	May have db & aijs for this...	M5::BLITTIN		`Tue Feb 18 1997 11:29`	5
	I've got a ct with the same bugcheck, diff offset (+34D) with the db and aijs available. Waiting to see if they will be able to send these on tape. RDB version is 6.0-1, but believe info may be of help. Will post updates...
4961.9	+8E4	M5::BLITTIN		`Thu Feb 20 1997 10:01`	4
	Another ct called yesterday with same bugcheck (+8E4). Running a report and can dup it from interactive sql. Bugcheck available. Also on an Alpha OpenVMS 6.1; RDB6.1-04.
4961.10		M5::DGROBERT		`Mon Feb 24 1997 12:51`	6
	It would seem to me that its possible, if the conditions are right, that entire record(s) could be lost due to this problem. We are only seeing it discovered/reported on the retrival of fragmented records. Ouch! How does rmu/recover of an aij that contains records referencing the same line react? Does it die with a specific error or write the last entry referencing the line? Anyone write an ALERT on this yet?
4961.11	Another one	svrav1.au.oracle.com::MBRADLEY	I was dropped on my head as a baby. What's your excuse?	`Wed Mar 05 1997 01:17`	10
	I have one of these on 6.0-12. Do we have a projection on what version/ECO may resolve the known problem? Is Eng. interested in the DB and AIJ (which would be fairly sizable in this case - 8M rows for the table)? Cheers, Mark.
4961.12	Another one bites the dust	NOMAHS::SECRIST	Rdb WWS; [email protected]	`Thu Mar 13 1997 16:20`	9
	I've got a customer who just got bit for the second time by this, allegedly in the same table ! Rdb V6.1-1 and VMS 6.1 with ACMS. Always at DIOFETCH$FETCH_ONE_LINE + 3A5. Has anyone submitted a reproduceable case yet ? Regards, rcs
4961.13	diofetch$fetch_one_line + 3a5	NOMAHS::SECRIST	Rdb WWS; [email protected]	`Fri Mar 14 1997 15:58`	9
	BUGCHECK AT DIOFETCH$FETCH_ONE_LINE + 3A5 has bit the same table twice, and contention and fragmentation are a factor just like in bug 352454, only this is a VAX. This may be worthy of an attempt to reproduce it if anyone wants the table information, etc. Regards, rcs
4961.14	Which ECO please	svrav1.au.oracle.com::MBRADLEY	I was dropped on my head as a baby. What's your excuse?	`Sun Mar 16 1997 21:57`	15
	> <<< Note 4961.4 by NOVA::JIANG "Oracle Corporation (603) 881-0815" >>> > -< Progress ... >- > > FYI, a day-one problem in collect locked space code has been found in > 7.0 code and the fix has been backported to 6.0 and 6.1. > > Our lab tests so far showed very positive results with heavy loads and > DELPRC. The fix will be available in the next ECO. I have seen this on 6.0-12, adn the customer was wondering which ECO may fix the problem? Thanks, Mark.
4961.15	Anybody seen this in V7.0 ?	NOMAHS::SECRIST	Rdb WWS; [email protected]	`Mon Mar 17 1997 09:22`	10
	; I have seen this on 6.0-12, adn the customer was wondering which ; ECO may fix the problem? I have seen this on 6.1-1 and I have a customer that wonders that same thing ;-) Regards, rcs
4961.16		M5::LWILCOX	Chocolate in January!!	`Tue Mar 18 1997 13:30`	12
	<<< Note 4961.13 by NOMAHS::SECRIST "Rdb WWS; [email protected]" >>> -< diofetch$fetch_one_line + 3a5 >- >> This may be worthy of an >> attempt to reproduce it if anyone wants the table information, >> etc. Richard, I suspect that you will be the one tasked with this if it needs to be bugged. :-).