[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference eps::oracle

Title:	Oracle
Notice:	For product status see topics: UNIX 1008, OpenVMS 1009, NT 1010
Moderator:	EPS::VANDENHEUVEL

Created:	Fri Aug 10 1990
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1574
Total number of notes:	4428

1541.0. "Oracle OPS + DU 4.0B + TCR 1.4 problems" by VAXRIO::LEO () Mon Apr 07 1997 13:37

Hi ,

	This is Leo from Digital Brazil.

	I have just installed a TCR Productions Server plus OPS environment in
a very big Medical Insurance company called UNIMED.

	It was 2 weeks ago. Until one week ago every thing was just fine, but
after that some strange problems have started.

	The Digital UNIX version is 4.0B, the Oracle version is 7.3.2.3.0 and
the TCR Production Server version is 1.4.

	The hardware is formed by 2 Alphaservers 4100, each one with 1 cpu 
5/400 and 512 MB of memory.

	Memory channel boards ares well installed and working properly. I have 
checked the rigth position to them on the PCI bus and the VH jumpers as well.

	The KZPSA are working fine and I have verified them using cnfgdiag that
comes with the board diskette.

	The clu_ivp and cnxshow commands are saying that the cluster is fine.
The tie breaker disk is properly configured.

	The ntpsetup was done and the clock of both machines are well 
synchronized.

	I have defined IPs for memory channel and for the systems (using 
different subnet masks).

	I created some DRD services using shared raw devices. In the near 
future the customer will start to use LSM volumes on the DRD services instead
of using just raw partitions. LSM will be really necessary in order to have
more partitions available to Oracle objects.

	If I turn off one machine all the services that were served by this 
machine become available on the other one and everything in the cluster side 
perpective looks fine.
	
	I have verified and tested all the shared and local disks and all of 
them are ok.

	Two weeks ago when I have installed OPS I have faced the first problem.

	At both Oracle accounts I have entered the following command :

	svrmgrl > startup parallel

	The two instances (of the same database) were started but the svrmgrl
propmts were not returned. The instances were not opened.

	I have contacted Oracle and received a patch that fixed it.

	After that everything was working fine.

	But as I said one week later something strange started to  happen.

	Oracle has started to loose indexes and data. Indexes are crashing 
frequently. 

	On the first week the level of concurrency was not very big. A lot of
hard tasks were being done but there was not a lot of concurrency on them.

	On the second week (when the problem has started) more users started 
to use the environment and much more concurrency on the shared disks are 
ocuuring right now.

	Two times per day (at least) Oracle is still loosing and crashing
indexes and data.

	All the patches available for the Digital UNIX 4.0B were already
applied at the customer site. But I didn't see among them any one that 
could really be related to this kind of problem.

	I have contacted Oracle and some Oracle patches were applied as well.

	The only message that I got from the Digital UNIX is :

	cdisk_op_spin ...
	unit reserved ...

	I have checked this message and it seems to be a normal one. It seems
to be just a warning (patrol agent) telling that this disk is being served
by other machine.

	Have you ever seem a problem like that before ?

	Do you think that is there any fix (patch) available for 
Oracle 7.3.2.3.0 regarding this behavior ?

	And Digital UNIX fixes ?

	This cluster is holding all the UNIMED's production environment that 
means that they are really worry about that.

	I have already applied all these 6 Oracle patches :

	424307 - fixes a svrmgrm bug
	425425 - fixes the 424581 base bug
	433173 - fixes the 350174 base bug
	397524 - No description found
	424355 - It seems to be equivalent to 406711(DU 4.0a). The only 
		 difference seems to be that this one is to version 4.0b.
		 This patch is intended to solve OPS problems on our cluster
		 environment.
	420001 - It seems to be equivalent to 396674(DU 4.0a). The only 
		 difference seems to be that this one is to version 4.0b.
		 This patch allows that the database be opened on both machines
		 after issuing the "startup parallel" command.

	Is there any other important patch to be applied ?
	
	What else can I do ? The production site is completely stopped right
now.
	
	Could you give me any hint ?

	Any help or pointer would be very appreciated.
	
	Best regards,

	Leo
	Digital Technical Support
	Brazil

T.R	Title	User	Date	Lines
1541.1	What is crashing ?	AXPBIZ::RANJAN	`Wed Apr 09 1997 12:49`	13
	Leo, Could you be a little more specific about the table and index crashing (as you mentioned) ? Is the index getting corrupt ? Or using an index in a query never returns anything ? Or does it return garbage ? Are there any error numbers associated ? Did you create the index with parallel option or unrecoverable option ? I assume all your drds are accessible at all times. Please post the relevant error messages or data corruption symptoms/messages. We are using OPS for long on a 2 8400s and we haven't seen any corruptions so far. - Ranjan.
1541.2	The data corruption is still there ...	VAXRIO::LEO	`Wed Apr 09 1997 14:01`	43
	Hi Ranjan, Thank you for your prompt reply. > Is this index getting corrupt ? Yes. > Index in a query never returns anything ? No, that is not the problem. > Does it return garbage ? No. > Are there any error numbers associated ? Yes. A lot of them. The first one that seems to generate all other is the ORA-600. The more common argument to ORA-600 error is 12700. It seems that the indexes start to corrupt and then there is a cascade corruption. > No there is neither parallel nor unrecoverable option associated with the indexes. > Post the relevant error messages ... A lot of ORA600 and core files are generated. In order to detect the problem we have already try to use Oracle on exclusive mode using DRD services spreaded between the 2 cluster members. It didn't work. After that we decided to turn one machine off and realocate all the drd services to a single server. It didnt' work. We have tried also the drd-data-compare=3 in the /etc/sysconfigtab of both servers in order to know if the problem was generated by an hardware or software problem. The data corruption is still there even using drd-data-compare=3 wich indicates that a software problem is more possible than a HW one. Do you have any idea ? Best regards, Leo Digital Technical Support
1541.3	Removing DRD	VAXRIO::LEO	`Wed Apr 09 1997 15:16`	15
	Hi Ranjan, What I'm doing rigth now is storing all Oracle data in raw devices without using DRD anymore. It's intended to check if there is any compatibility problem between Oracle 7 and DRD devices. Any idea ? Best regards, Leo Digital Technical Support
1541.4		VAXRIO::LEO	`Mon Apr 21 1997 13:36`	19
	Hi Ranjan, I have changed the new-wired-method parameter from 1 (the default on either Digital Unix 4.0 or 4.0a or 4.0b) to 0. It seems to solve the problem. I did it five days ago and so far so good. I think this hint can fix several data corruption problem I have faced using Informix and Sybase on Digital Unix 4.0 or above. There is a real problem of inconsistency of shared memory generated when we enable the new-wired-method. Best regards, Leo Digital Technical Support Brazil