[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

1991.0. "Data corruption with Oracle. Help !!" by VAXRIO::63008::lamotte (Alexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTE) Mon Apr 07 1997 10:59


   
      Hi,

         I have a custumer who has a trucluster configuration with two AlphaServer 4000
running Dunix 4.0b and TCR prodution server 1.4. since they started their oracle 7.3.2.3.0,
they have been reporting problens of data corruption in their database.
     Does anyone know if there is a patch to be applied in Dunix or oracle to prevent this
problen to happen ??
   Any clue will be wellcomed.

T.RTitleUserPersonal
Name
DateLines
1991.1KITCHE::schottEric R. Schott USG Product ManagementMon Apr 07 1997 11:369
Oracle has several patches for their database...you should ensure
both the OS and Oracle are to proper patch levels.

Oracle should be able to provide the oracle data.

Patch info for Digital UNIX is in

http://webkits.zk3.dec.com/

1991.2All patches are already installed.VAXRIO::63008::lamotteAlexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTEMon Apr 07 1997 13:4011

     Hi,

       Thank you for your prompt reply. All the custumer site is stopped now waiting on us.
 We have already applied the OPS patches: 424307, 425425, 433173, 397524, 424355 and 420001.
 They seams to be all we have as patch for OPS. All patches for Dunix 4.0b are also 
installed.
    Does anyone more have suggestions ??? we are in a critical situation.
 best regards.
  
1991.3All Patches?KYOSS1::GREENMon Apr 07 1997 16:175
    	Did you us "dupatch" to install "ALL" patches.
    	We just upgraded to 4.0b and TCR 1.4. Customer also applied a patch
    to Oracle. All seems to be running fine.
    	The patch to Oracle was supplied by Oracle.
    
1991.4What oracle patche ???VAXRIO::63008::lamotteAlexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTEMon Apr 07 1997 17:318

    Hi,

        Do you know what patch was installed on the system ???  
  Are your custumer running Dunix 4.0b without any patch ??

  Thanks for your attention.
1991.5try this firstSMURF::MARSHALLRob Marshall - USEGMon Apr 07 1997 23:1835
    Hi,
    
    Have you tried setting drd-data-compare to see if data is being
    corrupted as it is transferred over the MEMORY CHANNEL?
    
    To set the value so that corrupted data will cause a system to panic
    (which is better than corrupting the data on disk), you will need to
    edit /etc/sysconfigtab on *ALL* the members so that it looks something
    like:
    
    drd:
        drd-data-compare=3
    
    Then reboot all of the members and see if you start getting panics
    because of corrupted data.  If you do, then your problem is most likely
    bad hardware, either a MEMORY CHANNEL board, hub line card (you didn't 
    say if they are using a real hub, or virtual hub) or possibly a bad PCI
    backplane.
    
    This is the first thing you need to check.  This will be the major
    deciding factor as to whether this is a hardware, or software, problem.
    I have seen a number of situations where a bad backplane caused data
    to be corrupted and caused TruCluster to crash, etc.
    
    Another thing, do you have rev 11 MEMORY CHANNEL boards, or rev 14?  If
    you end up replacing the MEMORY CHANNEL modules, try to replace them
    with rev. 14 boards.  This isn't essential, but may make things better
    for the customer in the long run.
    
    Another question is: has this ever worked?  In other words, was this
    working before, and just recently started having problems?  Or, has the
    customer always had this problem?
    
    Rob Marshall
    USEG
1991.6The data/index corruption is still there ...VAXRIO::LEOWed Apr 09 1997 14:5328
    Hi Rob,
    
    This is Leo from Digital Brazil.
    I am working with Lamotte at the same customer. We have already
    introduced the drd-data-compare=3 line on /etc/sysconfigtab of both
    cluster members. After that, the customer has rebooted both machines and
    the data corruption ocuurred again without displaying any panic
    message. Is that mean we can forget about hardware problems ?
    
    So the customer has started to  work on exclusive mode (without OPS).
    He was using Oracle on exclusive mode but the DRD services were
    spreaded between both cluster members. The data corruption was still
    there.
    
       
    After that the customer has turned one machine off and rigth now all DRD 
    services are being offered by the second cluster member.
    But the data/indexes corruptions are still there.
    
    Do you have any idea ?
    
    Regards,
    
    Leo 
    Digital Technical Support
    
    to the second cluster member.
    
1991.7Urgent support needed ...VAXRIO::LEOWed Apr 09 1997 17:4018
    Hi,
    
    Is there any patch available to be applied on TCR Production Server 1.4
    on Digital UNIX version 4.0B ?
    
    I have applied all Digital Unix v4.0B patches and all Oracle 7 patches
    available as well.
    
    Is there any compatibility problem between Oracle 7.3.2.3 and the DRD
    services offered by TCR 1.4 ?
    
    What else can I do ?
    
    Best regards,
    
    Leo
    Digital Technical Support
    
1991.8test different componentsusr406.zko.dec.com::MarshallRob MarshallWed Apr 09 1997 22:5372
Hi Leo,

Have you tried simply writing data, and comparing it, to the disks without
using a DRD?  We need to find out if the problem is with DRD or somewhere
else.

You are going to have to step through each of the components one at a time.
First try writing to a disk on the same bus and see if the data gets 
corrupted.  Next try using dd to write, and then read back, a file written
to a DRD disk.  Compare the file read with the file you wrote, and then
increase the I/O load with multiple dd's all writing/reading/comparing.
Make sure, though, that when writing to the DRD disk, you don't write on
any customer data.  It would be best to create a DRD on a new disk just
for this test.

I will attach a shell script that I just used to do this at the end of this
note.  What I did was created two 8k files and used those (I'll also put a
short ksh loop example at the end to show you how I did it) as test files.

The problem here is that it is unclear where things are getting corrupted.
Is it the disk?  The BA356?  The bus?  The controller?...  So, the only way
to find out is to try different tests at each level to see where the problem
is reproducible.

My first inclination would be that it is not a DRD problem, but that's because
I haven't seen any problems like this.

Rob

These are real simple, and you may do better to make your own, but...

------------------------------ create two 8k files -------------------------------
#!/usr/bin/ksh

rm -f 8kfile 8kfile2

integer c=1
while [[ $c -le 8192 ]]
do
        print -n "a" >> 8kfile
        print -n "b" >> 8kfile2
        ((c=c+1))
        ((m=c%64))
        if [[ $m -eq 0 ]]
        then
                print >> 8kfile
                print >> 8kfile2
                ((c=c+1))
                print "c=$c"
        fi
done

------------------------------ write/read/compare to DRD ------------------------------ 
#!/usr/bin/ksh
MAX=100
integer c=1

while [[ $c -le $MAX ]]
do
        dd if=8kfile of=/dev/rdrd/drd1 seek=10000 2>/dev/null
        dd if=/dev/rdrd/drd1 of=read8k iseek=10000 count=16 2>/dev/null

        dd if=8kfile2 of=/dev/rdrd/drd1 oseek=11000 2>/dev/null
        dd if=/dev/rdrd/drd1 of=read8k2 iseek=11000 count=16 2>/dev/null

        print "Diff'ing the files...number of iterations: $c"
        diff 8kfile read8k
        diff 8kfile2 read8k2

        ((c=c+1))
done

1991.9The production environment is stopped ...VAXRIO::LEOThu Apr 10 1997 09:3069
Hi Rob,

	First of all I would like to thank you for all your interest.

	We are trying to identify and isolate the real problem.

	We have Oracle 7.3.2.3 with all available patches installed by
Oracle. We have Digital Unix 4.0B with all patches installed as well (dupatch).

	The documentation says that TCR v1.4 is supported by Digital Unix 4.0B.
	
	Is there any special patch to TCR V1.4 ?

	Could you please inform me about the Oracle, TCR and Digital Unix 
versions that you have running on the 8200 cluster ?

	In this meanwhile we are trying to figure out what is really going on.

	We are organizing several different tests in order to have a better 
understanding about this very uncommon problem.

 The configurations tested were:

   1- Run Oracle exclusive server (no OPS) with data distributed by DRD
      offered by both systems.
      -> Data was corrupted

   2- Run Oracle exclusive server (no OPS) with data located in DRD devices
      offered by a single system.
      -> Data was corrupted

   3- Run Oracle exclusive server (no OPS) with data located in raw devices
      (no DRD configured). Notice that this way the customer is using only
      half of the resources availble (CPU and memory).
      -> They are preparing this configuration  right now


  The customer still can't run his applications.
  He is trying different configurations, as described above, with no success.
  This is a very important customer used as reference for TruCluster 
  environment here in Brazil and all their applications are based in this
  database.
   We still don't know if the configuration described in item 3 will 
  corrupt data but I can assure you that the performance will be too
  poor.

  I know that until now we cannot garantee that it's a Digital problem
  or an Oracle problem but we need to work on it to be able to identify
  the real cause of all this data corruption. Don't forget that the customer
  is still down and is taking all Trucluster resources out of his configuration
  to try to run his applications, even precariously.

    Could you please tell me if is there any place in the whole world that
    use TCR 1.4 + DU 4.0B + Oracle 7.3.2.3 ?
    
    Do you know other OPS configurations over DU all around the world ?
    
    If so could you please tell me the TCR, DU and Oracle versions that
    they are using ?
    
    This info is being required by the customer.
    
    I really need this answer. 
    
  Best regards,

  Leo
  Digital Technical Support 
  Brazil 
1991.10KITCHE::schottEric R. Schott USG Product ManagementThu Apr 10 1997 14:1919
Hi

 Have you checked all firmware revs and board revs on
the systems (including disks)?

Have you checked all cable lengths?

Anything in the error logs?

Are you using LSM to mirror the data on an HSZ?

Do you have the latest HSZ patches...

My guess is you have a hardware problem somewhere...

Have you run sys_check

http://www-unix.zk3.dec.com/tuning/tools/sys_check/sys_check.html?

1991.11configuration infoNETRIX::"[email protected]"Brian StevensThu Apr 10 1997 14:2014
I am aware of successful OPS installations with TCR 1.4 and 
DU 4.0A. I doubt highly that this is a 4.0B introduced problem.

Would you be able to supply the configuration information? Especially
for the tables that you know to have been corrupted? For example,
what drd device, whether over lsm, and underlying hardware. Is
their hardware raid involved? If so, which controller?

We have seen corruption with HSZ40.

Regards,
Brian Stevens
[Posted by WWW Notes gateway]
1991.12Barcelona similar problemVAXRIO::LEOThu Apr 10 1997 14:3822
    Hi,
    
    > .11
    
    As far as I know the "Instituto Municipal de Informatica" located at
    Barcelona/Spain had a similar problem, getting corruption data on 
    drd devices. They were using Digital Unix 4.0B, TCR 1.4 and 
    Oracle 7.3.2.3. 
    
    They have decided to do a downgrade of Digital Unix from 4.0B to 4.0A
    and the problem has been solved.
    
    What I'm trying right now is to confirm this information that I have
    received from my local Oracle Support.
    
    Did you know something about that ?
    
    Regards,
    
    Leo
    
    
1991.13NETRIX::"[email protected]"Brian StevensThu Apr 10 1997 14:559
I wasn't aware of the Barcelona problem. Bernard Laforgue just
completed a benchmark in Valbonne. I was going back through
mail and saw it was with 4.0B and TCR 1.4. They had memory
channel failover problems, but not data corruption. You might
send him mail to see what oracle version and patches they 
used. I still suspect hardware though.

Brian
[Posted by WWW Notes gateway]
1991.14VAXRIO::LEOThu Apr 10 1997 17:5434
    Hi Eric,
    
    > .10
    
    KZPSA-BB -> Hardware revision P01.
    		Firmware revision A10.
    
    RZ29B-VW -> Firmware revision DEC 0016.
    
    The BN21K-03 cables (between KZPSA-BB) have 3 meters.
    
    No messages related to the problem in the error messages.
    
    Neither HSZ40 nor LSM are being used yet. We didn't receive the HSZ40
    so far due to some importing problems. It means that they don't have 
    any kind of RAID configured rigth now. 
    
    They are using just BA356-JC (with DWZZB-VW and H885-AA) as disk
    cabinet.
    
    Do you think we have any problem on this kind of configuration and 
    firmware revisions ?
    
    Thank you in advance,
    
    Regards,
    
    Leo
    Digital Technical Support
    Brazil
    
    
    
          
1991.15VAXRIO::LEOMon Apr 21 1997 14:3821
    Hi,
    
    	We have changed the new-wired-method parameter from the default 1
    to 0.
    
    	It seems to solve the problem.
    
        We did it five days ago and so far everything is going fine.
    
    	We have data corruption problems with other databases such as
    Informix and Sybases using either Digital Unix 4.0 or  4.0a or 4.0b.
    	
    	I think setting new-wired-method to 0 can fix several data
    corruption problems generated by inconcistency on shared memory.
    
    	Best regards,
    
    	Digital Technical Support
    	Brazil
    
    
1991.16Urgent, please help !!HGOM22::MAHUAHSINWed May 21 1997 12:4715
    Hi:
    
    We learned all lots from this 1991. Currently we have a case using
    AS2100 TruCluster and have done everything mentioned on 1991.
    Something very strange: We tested Bob's testing program using 8M
    instead of 8K. This program was crashed teh system. We changed a new
    system and this program ran fine. Unforunately, when the OPS startup
    and the system crashed again. The crashed system always kept on the
    same system no matter where the OPS service located.
    
    This is an urgent case and please help.
    
    
    Regards,
    Hua-Hsin Ma
1991.17dust.zk3.dec.com::MarshallRob Marshall USEGWed May 21 1997 14:5820
Hi,

I have sent you mail.  Unfortunately, there is not enough information
in your note to be able to help.  As a minimum we need to know what
kinds of panics you are getting.

Also be sure that, if you have set drd-data-compare on one of the
systems, this *must* be set on all of them.  If not, you could get
panics because one system is looking for the checksum, but the other
one hasn't calculated it.  So, please be sure it is set on all the
members before doing any testing.

Also be sure that you turned off the new wire method.  This is what
fixed the problem originally brought up in this note.

To guarantee a response to your problem, please open an IPMT case.

Rob Marshall
USEG