[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

1991.0. "Data corruption with Oracle. Help !!" by VAXRIO::63008::lamotte (Alexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTE) Mon Apr 07 1997 09:59


   
      Hi,

         I have a custumer who has a trucluster configuration with two AlphaServer 4000
running Dunix 4.0b and TCR prodution server 1.4. since they started their oracle 7.3.2.3.0,
they have been reporting problens of data corruption in their database.
     Does anyone know if there is a patch to be applied in Dunix or oracle to prevent this
problen to happen ??
   Any clue will be wellcomed.

T.R	Title	User	Personal Name	Date	Lines
1991.1		KITCHE::schott	Eric R. Schott USG Product Management	`Mon Apr 07 1997 10:36`	9
	Oracle has several patches for their database...you should ensure both the OS and Oracle are to proper patch levels. Oracle should be able to provide the oracle data. Patch info for Digital UNIX is in http://webkits.zk3.dec.com/
1991.2	All patches are already installed.	VAXRIO::63008::lamotte	Alexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTE	`Mon Apr 07 1997 12:40`	11
	Hi, Thank you for your prompt reply. All the custumer site is stopped now waiting on us. We have already applied the OPS patches: 424307, 425425, 433173, 397524, 424355 and 420001. They seams to be all we have as patch for OPS. All patches for Dunix 4.0b are also installed. Does anyone more have suggestions ??? we are in a critical situation. best regards.
1991.3	All Patches?	KYOSS1::GREEN		`Mon Apr 07 1997 15:17`	5
	Did you us "dupatch" to install "ALL" patches. We just upgraded to 4.0b and TCR 1.4. Customer also applied a patch to Oracle. All seems to be running fine. The patch to Oracle was supplied by Oracle.
1991.4	What oracle patche ???	VAXRIO::63008::lamotte	Alexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTE	`Mon Apr 07 1997 16:31`	8
	Hi, Do you know what patch was installed on the system ??? Are your custumer running Dunix 4.0b without any patch ?? Thanks for your attention.
1991.5	try this first	SMURF::MARSHALL	Rob Marshall - USEG	`Mon Apr 07 1997 22:18`	35
	Hi, Have you tried setting drd-data-compare to see if data is being corrupted as it is transferred over the MEMORY CHANNEL? To set the value so that corrupted data will cause a system to panic (which is better than corrupting the data on disk), you will need to edit /etc/sysconfigtab on ALL the members so that it looks something like: drd: drd-data-compare=3 Then reboot all of the members and see if you start getting panics because of corrupted data. If you do, then your problem is most likely bad hardware, either a MEMORY CHANNEL board, hub line card (you didn't say if they are using a real hub, or virtual hub) or possibly a bad PCI backplane. This is the first thing you need to check. This will be the major deciding factor as to whether this is a hardware, or software, problem. I have seen a number of situations where a bad backplane caused data to be corrupted and caused TruCluster to crash, etc. Another thing, do you have rev 11 MEMORY CHANNEL boards, or rev 14? If you end up replacing the MEMORY CHANNEL modules, try to replace them with rev. 14 boards. This isn't essential, but may make things better for the customer in the long run. Another question is: has this ever worked? In other words, was this working before, and just recently started having problems? Or, has the customer always had this problem? Rob Marshall USEG
1991.6	The data/index corruption is still there ...	VAXRIO::LEO		`Wed Apr 09 1997 13:53`	28
	Hi Rob, This is Leo from Digital Brazil. I am working with Lamotte at the same customer. We have already introduced the drd-data-compare=3 line on /etc/sysconfigtab of both cluster members. After that, the customer has rebooted both machines and the data corruption ocuurred again without displaying any panic message. Is that mean we can forget about hardware problems ? So the customer has started to work on exclusive mode (without OPS). He was using Oracle on exclusive mode but the DRD services were spreaded between both cluster members. The data corruption was still there. After that the customer has turned one machine off and rigth now all DRD services are being offered by the second cluster member. But the data/indexes corruptions are still there. Do you have any idea ? Regards, Leo Digital Technical Support to the second cluster member.
1991.7	Urgent support needed ...	VAXRIO::LEO		`Wed Apr 09 1997 16:40`	18
	Hi, Is there any patch available to be applied on TCR Production Server 1.4 on Digital UNIX version 4.0B ? I have applied all Digital Unix v4.0B patches and all Oracle 7 patches available as well. Is there any compatibility problem between Oracle 7.3.2.3 and the DRD services offered by TCR 1.4 ? What else can I do ? Best regards, Leo Digital Technical Support
1991.8	test different components	usr406.zko.dec.com::Marshall	Rob Marshall	`Wed Apr 09 1997 21:53`	72
	Hi Leo, Have you tried simply writing data, and comparing it, to the disks without using a DRD? We need to find out if the problem is with DRD or somewhere else. You are going to have to step through each of the components one at a time. First try writing to a disk on the same bus and see if the data gets corrupted. Next try using dd to write, and then read back, a file written to a DRD disk. Compare the file read with the file you wrote, and then increase the I/O load with multiple dd's all writing/reading/comparing. Make sure, though, that when writing to the DRD disk, you don't write on any customer data. It would be best to create a DRD on a new disk just for this test. I will attach a shell script that I just used to do this at the end of this note. What I did was created two 8k files and used those (I'll also put a short ksh loop example at the end to show you how I did it) as test files. The problem here is that it is unclear where things are getting corrupted. Is it the disk? The BA356? The bus? The controller?... So, the only way to find out is to try different tests at each level to see where the problem is reproducible. My first inclination would be that it is not a DRD problem, but that's because I haven't seen any problems like this. Rob These are real simple, and you may do better to make your own, but... ------------------------------ create two 8k files ------------------------------- #!/usr/bin/ksh rm -f 8kfile 8kfile2 integer c=1 while [[ $c -le 8192 ]] do print -n "a" >> 8kfile print -n "b" >> 8kfile2 ((c=c+1)) ((m=c%64)) if [[ $m -eq 0 ]] then print >> 8kfile print >> 8kfile2 ((c=c+1)) print "c=$c" fi done ------------------------------ write/read/compare to DRD ------------------------------ #!/usr/bin/ksh MAX=100 integer c=1 while [[ $c -le $MAX ]] do dd if=8kfile of=/dev/rdrd/drd1 seek=10000 2>/dev/null dd if=/dev/rdrd/drd1 of=read8k iseek=10000 count=16 2>/dev/null dd if=8kfile2 of=/dev/rdrd/drd1 oseek=11000 2>/dev/null dd if=/dev/rdrd/drd1 of=read8k2 iseek=11000 count=16 2>/dev/null print "Diff'ing the files...number of iterations: $c" diff 8kfile read8k diff 8kfile2 read8k2 ((c=c+1)) done
1991.9	The production environment is stopped ...	VAXRIO::LEO		`Thu Apr 10 1997 08:30`	69
	Hi Rob, First of all I would like to thank you for all your interest. We are trying to identify and isolate the real problem. We have Oracle 7.3.2.3 with all available patches installed by Oracle. We have Digital Unix 4.0B with all patches installed as well (dupatch). The documentation says that TCR v1.4 is supported by Digital Unix 4.0B. Is there any special patch to TCR V1.4 ? Could you please inform me about the Oracle, TCR and Digital Unix versions that you have running on the 8200 cluster ? In this meanwhile we are trying to figure out what is really going on. We are organizing several different tests in order to have a better understanding about this very uncommon problem. The configurations tested were: 1- Run Oracle exclusive server (no OPS) with data distributed by DRD offered by both systems. -> Data was corrupted 2- Run Oracle exclusive server (no OPS) with data located in DRD devices offered by a single system. -> Data was corrupted 3- Run Oracle exclusive server (no OPS) with data located in raw devices (no DRD configured). Notice that this way the customer is using only half of the resources availble (CPU and memory). -> They are preparing this configuration right now The customer still can't run his applications. He is trying different configurations, as described above, with no success. This is a very important customer used as reference for TruCluster environment here in Brazil and all their applications are based in this database. We still don't know if the configuration described in item 3 will corrupt data but I can assure you that the performance will be too poor. I know that until now we cannot garantee that it's a Digital problem or an Oracle problem but we need to work on it to be able to identify the real cause of all this data corruption. Don't forget that the customer is still down and is taking all Trucluster resources out of his configuration to try to run his applications, even precariously. Could you please tell me if is there any place in the whole world that use TCR 1.4 + DU 4.0B + Oracle 7.3.2.3 ? Do you know other OPS configurations over DU all around the world ? If so could you please tell me the TCR, DU and Oracle versions that they are using ? This info is being required by the customer. I really need this answer. Best regards, Leo Digital Technical Support Brazil
1991.10		KITCHE::schott	Eric R. Schott USG Product Management	`Thu Apr 10 1997 13:19`	19
	Hi Have you checked all firmware revs and board revs on the systems (including disks)? Have you checked all cable lengths? Anything in the error logs? Are you using LSM to mirror the data on an HSZ? Do you have the latest HSZ patches... My guess is you have a hardware problem somewhere... Have you run sys_check http://www-unix.zk3.dec.com/tuning/tools/sys_check/sys_check.html?
1991.11	configuration info	NETRIX::"[email protected]"	Brian Stevens	`Thu Apr 10 1997 13:20`	14
	I am aware of successful OPS installations with TCR 1.4 and DU 4.0A. I doubt highly that this is a 4.0B introduced problem. Would you be able to supply the configuration information? Especially for the tables that you know to have been corrupted? For example, what drd device, whether over lsm, and underlying hardware. Is their hardware raid involved? If so, which controller? We have seen corruption with HSZ40. Regards, Brian Stevens [Posted by WWW Notes gateway]
1991.12	Barcelona similar problem	VAXRIO::LEO		`Thu Apr 10 1997 13:38`	22
	Hi, > .11 As far as I know the "Instituto Municipal de Informatica" located at Barcelona/Spain had a similar problem, getting corruption data on drd devices. They were using Digital Unix 4.0B, TCR 1.4 and Oracle 7.3.2.3. They have decided to do a downgrade of Digital Unix from 4.0B to 4.0A and the problem has been solved. What I'm trying right now is to confirm this information that I have received from my local Oracle Support. Did you know something about that ? Regards, Leo
1991.13		NETRIX::"[email protected]"	Brian Stevens	`Thu Apr 10 1997 13:55`	9
	I wasn't aware of the Barcelona problem. Bernard Laforgue just completed a benchmark in Valbonne. I was going back through mail and saw it was with 4.0B and TCR 1.4. They had memory channel failover problems, but not data corruption. You might send him mail to see what oracle version and patches they used. I still suspect hardware though. Brian [Posted by WWW Notes gateway]
1991.14		VAXRIO::LEO		`Thu Apr 10 1997 16:54`	34
	Hi Eric, > .10 KZPSA-BB -> Hardware revision P01. Firmware revision A10. RZ29B-VW -> Firmware revision DEC 0016. The BN21K-03 cables (between KZPSA-BB) have 3 meters. No messages related to the problem in the error messages. Neither HSZ40 nor LSM are being used yet. We didn't receive the HSZ40 so far due to some importing problems. It means that they don't have any kind of RAID configured rigth now. They are using just BA356-JC (with DWZZB-VW and H885-AA) as disk cabinet. Do you think we have any problem on this kind of configuration and firmware revisions ? Thank you in advance, Regards, Leo Digital Technical Support Brazil
1991.15		VAXRIO::LEO		`Mon Apr 21 1997 13:38`	21
	Hi, We have changed the new-wired-method parameter from the default 1 to 0. It seems to solve the problem. We did it five days ago and so far everything is going fine. We have data corruption problems with other databases such as Informix and Sybases using either Digital Unix 4.0 or 4.0a or 4.0b. I think setting new-wired-method to 0 can fix several data corruption problems generated by inconcistency on shared memory. Best regards, Digital Technical Support Brazil
1991.16	Urgent, please help !!	HGOM22::MAHUAHSIN		`Wed May 21 1997 11:47`	15
	Hi: We learned all lots from this 1991. Currently we have a case using AS2100 TruCluster and have done everything mentioned on 1991. Something very strange: We tested Bob's testing program using 8M instead of 8K. This program was crashed teh system. We changed a new system and this program ran fine. Unforunately, when the OPS startup and the system crashed again. The crashed system always kept on the same system no matter where the OPS service located. This is an urgent case and please help. Regards, Hua-Hsin Ma
1991.17		dust.zk3.dec.com::Marshall	Rob Marshall USEG	`Wed May 21 1997 13:58`	20
	Hi, I have sent you mail. Unfortunately, there is not enough information in your note to be able to help. As a minimum we need to know what kinds of panics you are getting. Also be sure that, if you have set drd-data-compare on one of the systems, this must be set on all of them. If not, you could get panics because one system is looking for the checksum, but the other one hasn't calculated it. So, please be sure it is set on all the members before doing any testing. Also be sure that you turned off the new wire method. This is what fixed the problem originally brought up in this note. To guarantee a response to your problem, please open an IPMT case. Rob Marshall USEG