Title: | Oracle |
Notice: | For product status see topics: UNIX 1008, OpenVMS 1009, NT 1010 |
Moderator: | EPS::VANDENHEUVEL |
Created: | Fri Aug 10 1990 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1574 |
Total number of notes: | 4428 |
Hi , This is Leo from Digital Brazil. I have just installed a TCR Productions Server plus OPS environment in a very big Medical Insurance company called UNIMED. It was 2 weeks ago. Until one week ago every thing was just fine, but after that some strange problems have started. The Digital UNIX version is 4.0B, the Oracle version is 7.3.2.3.0 and the TCR Production Server version is 1.4. The hardware is formed by 2 Alphaservers 4100, each one with 1 cpu 5/400 and 512 MB of memory. Memory channel boards ares well installed and working properly. I have checked the rigth position to them on the PCI bus and the VH jumpers as well. The KZPSA are working fine and I have verified them using cnfgdiag that comes with the board diskette. The clu_ivp and cnxshow commands are saying that the cluster is fine. The tie breaker disk is properly configured. The ntpsetup was done and the clock of both machines are well synchronized. I have defined IPs for memory channel and for the systems (using different subnet masks). I created some DRD services using shared raw devices. In the near future the customer will start to use LSM volumes on the DRD services instead of using just raw partitions. LSM will be really necessary in order to have more partitions available to Oracle objects. If I turn off one machine all the services that were served by this machine become available on the other one and everything in the cluster side perpective looks fine. I have verified and tested all the shared and local disks and all of them are ok. Two weeks ago when I have installed OPS I have faced the first problem. At both Oracle accounts I have entered the following command : svrmgrl > startup parallel The two instances (of the same database) were started but the svrmgrl propmts were not returned. The instances were not opened. I have contacted Oracle and received a patch that fixed it. After that everything was working fine. But as I said one week later something strange started to happen. Oracle has started to loose indexes and data. Indexes are crashing frequently. On the first week the level of concurrency was not very big. A lot of hard tasks were being done but there was not a lot of concurrency on them. On the second week (when the problem has started) more users started to use the environment and much more concurrency on the shared disks are ocuuring right now. Two times per day (at least) Oracle is still loosing and crashing indexes and data. All the patches available for the Digital UNIX 4.0B were already applied at the customer site. But I didn't see among them any one that could really be related to this kind of problem. I have contacted Oracle and some Oracle patches were applied as well. The only message that I got from the Digital UNIX is : cdisk_op_spin ... unit reserved ... I have checked this message and it seems to be a normal one. It seems to be just a warning (patrol agent) telling that this disk is being served by other machine. Have you ever seem a problem like that before ? Do you think that is there any fix (patch) available for Oracle 7.3.2.3.0 regarding this behavior ? And Digital UNIX fixes ? This cluster is holding all the UNIMED's production environment that means that they are really worry about that. I have already applied all these 6 Oracle patches : 424307 - fixes a svrmgrm bug 425425 - fixes the 424581 base bug 433173 - fixes the 350174 base bug 397524 - No description found 424355 - It seems to be equivalent to 406711(DU 4.0a). The only difference seems to be that this one is to version 4.0b. This patch is intended to solve OPS problems on our cluster environment. 420001 - It seems to be equivalent to 396674(DU 4.0a). The only difference seems to be that this one is to version 4.0b. This patch allows that the database be opened on both machines after issuing the "startup parallel" command. Is there any other important patch to be applied ? What else can I do ? The production site is completely stopped right now. Could you give me any hint ? Any help or pointer would be very appreciated. Best regards, Leo Digital Technical Support Brazil
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
1541.1 | What is crashing ? | AXPBIZ::RANJAN | Wed Apr 09 1997 13:49 | 13 | |
Leo, Could you be a little more specific about the table and index crashing (as you mentioned) ? Is the index getting corrupt ? Or using an index in a query never returns anything ? Or does it return garbage ? Are there any error numbers associated ? Did you create the index with parallel option or unrecoverable option ? I assume all your drds are accessible at all times. Please post the relevant error messages or data corruption symptoms/messages. We are using OPS for long on a 2 8400s and we haven't seen any corruptions so far. - Ranjan. | |||||
1541.2 | The data corruption is still there ... | VAXRIO::LEO | Wed Apr 09 1997 15:01 | 43 | |
Hi Ranjan, Thank you for your prompt reply. > Is this index getting corrupt ? Yes. > Index in a query never returns anything ? No, that is not the problem. > Does it return garbage ? No. > Are there any error numbers associated ? Yes. A lot of them. The first one that seems to generate all other is the ORA-600. The more common argument to ORA-600 error is 12700. It seems that the indexes start to corrupt and then there is a cascade corruption. > No there is neither parallel nor unrecoverable option associated with the indexes. > Post the relevant error messages ... A lot of ORA600 and core files are generated. In order to detect the problem we have already try to use Oracle on exclusive mode using DRD services spreaded between the 2 cluster members. It didn't work. After that we decided to turn one machine off and realocate all the drd services to a single server. It didnt' work. We have tried also the drd-data-compare=3 in the /etc/sysconfigtab of both servers in order to know if the problem was generated by an hardware or software problem. The data corruption is still there even using drd-data-compare=3 wich indicates that a software problem is more possible than a HW one. Do you have any idea ? Best regards, Leo Digital Technical Support | |||||
1541.3 | Removing DRD | VAXRIO::LEO | Wed Apr 09 1997 16:16 | 15 | |
Hi Ranjan, What I'm doing rigth now is storing all Oracle data in raw devices without using DRD anymore. It's intended to check if there is any compatibility problem between Oracle 7 and DRD devices. Any idea ? Best regards, Leo Digital Technical Support | |||||
1541.4 | VAXRIO::LEO | Mon Apr 21 1997 14:36 | 19 | ||
Hi Ranjan, I have changed the new-wired-method parameter from 1 (the default on either Digital Unix 4.0 or 4.0a or 4.0b) to 0. It seems to solve the problem. I did it five days ago and so far so good. I think this hint can fix several data corruption problem I have faced using Informix and Sybase on Digital Unix 4.0 or above. There is a real problem of inconsistency of shared memory generated when we enable the new-wired-method. Best regards, Leo Digital Technical Support Brazil |