T.R | Title | User | Personal Name | Date | Lines |
---|
1991.1 | | KITCHE::schott | Eric R. Schott USG Product Management | Mon Apr 07 1997 11:36 | 9 |
| Oracle has several patches for their database...you should ensure
both the OS and Oracle are to proper patch levels.
Oracle should be able to provide the oracle data.
Patch info for Digital UNIX is in
http://webkits.zk3.dec.com/
|
1991.2 | All patches are already installed. | VAXRIO::63008::lamotte | Alexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTE | Mon Apr 07 1997 13:40 | 11 |
|
Hi,
Thank you for your prompt reply. All the custumer site is stopped now waiting on us.
We have already applied the OPS patches: 424307, 425425, 433173, 397524, 424355 and 420001.
They seams to be all we have as patch for OPS. All patches for Dunix 4.0b are also
installed.
Does anyone more have suggestions ??? we are in a critical situation.
best regards.
|
1991.3 | All Patches? | KYOSS1::GREEN | | Mon Apr 07 1997 16:17 | 5 |
| Did you us "dupatch" to install "ALL" patches.
We just upgraded to 4.0b and TCR 1.4. Customer also applied a patch
to Oracle. All seems to be running fine.
The patch to Oracle was supplied by Oracle.
|
1991.4 | What oracle patche ??? | VAXRIO::63008::lamotte | Alexandre Lamotte - MCS/BRASIL Mail to VAXRIO::LAMOTTE | Mon Apr 07 1997 17:31 | 8 |
|
Hi,
Do you know what patch was installed on the system ???
Are your custumer running Dunix 4.0b without any patch ??
Thanks for your attention.
|
1991.5 | try this first | SMURF::MARSHALL | Rob Marshall - USEG | Mon Apr 07 1997 23:18 | 35 |
| Hi,
Have you tried setting drd-data-compare to see if data is being
corrupted as it is transferred over the MEMORY CHANNEL?
To set the value so that corrupted data will cause a system to panic
(which is better than corrupting the data on disk), you will need to
edit /etc/sysconfigtab on *ALL* the members so that it looks something
like:
drd:
drd-data-compare=3
Then reboot all of the members and see if you start getting panics
because of corrupted data. If you do, then your problem is most likely
bad hardware, either a MEMORY CHANNEL board, hub line card (you didn't
say if they are using a real hub, or virtual hub) or possibly a bad PCI
backplane.
This is the first thing you need to check. This will be the major
deciding factor as to whether this is a hardware, or software, problem.
I have seen a number of situations where a bad backplane caused data
to be corrupted and caused TruCluster to crash, etc.
Another thing, do you have rev 11 MEMORY CHANNEL boards, or rev 14? If
you end up replacing the MEMORY CHANNEL modules, try to replace them
with rev. 14 boards. This isn't essential, but may make things better
for the customer in the long run.
Another question is: has this ever worked? In other words, was this
working before, and just recently started having problems? Or, has the
customer always had this problem?
Rob Marshall
USEG
|
1991.6 | The data/index corruption is still there ... | VAXRIO::LEO | | Wed Apr 09 1997 14:53 | 28 |
| Hi Rob,
This is Leo from Digital Brazil.
I am working with Lamotte at the same customer. We have already
introduced the drd-data-compare=3 line on /etc/sysconfigtab of both
cluster members. After that, the customer has rebooted both machines and
the data corruption ocuurred again without displaying any panic
message. Is that mean we can forget about hardware problems ?
So the customer has started to work on exclusive mode (without OPS).
He was using Oracle on exclusive mode but the DRD services were
spreaded between both cluster members. The data corruption was still
there.
After that the customer has turned one machine off and rigth now all DRD
services are being offered by the second cluster member.
But the data/indexes corruptions are still there.
Do you have any idea ?
Regards,
Leo
Digital Technical Support
to the second cluster member.
|
1991.7 | Urgent support needed ... | VAXRIO::LEO | | Wed Apr 09 1997 17:40 | 18 |
| Hi,
Is there any patch available to be applied on TCR Production Server 1.4
on Digital UNIX version 4.0B ?
I have applied all Digital Unix v4.0B patches and all Oracle 7 patches
available as well.
Is there any compatibility problem between Oracle 7.3.2.3 and the DRD
services offered by TCR 1.4 ?
What else can I do ?
Best regards,
Leo
Digital Technical Support
|
1991.8 | test different components | usr406.zko.dec.com::Marshall | Rob Marshall | Wed Apr 09 1997 22:53 | 72 |
| Hi Leo,
Have you tried simply writing data, and comparing it, to the disks without
using a DRD? We need to find out if the problem is with DRD or somewhere
else.
You are going to have to step through each of the components one at a time.
First try writing to a disk on the same bus and see if the data gets
corrupted. Next try using dd to write, and then read back, a file written
to a DRD disk. Compare the file read with the file you wrote, and then
increase the I/O load with multiple dd's all writing/reading/comparing.
Make sure, though, that when writing to the DRD disk, you don't write on
any customer data. It would be best to create a DRD on a new disk just
for this test.
I will attach a shell script that I just used to do this at the end of this
note. What I did was created two 8k files and used those (I'll also put a
short ksh loop example at the end to show you how I did it) as test files.
The problem here is that it is unclear where things are getting corrupted.
Is it the disk? The BA356? The bus? The controller?... So, the only way
to find out is to try different tests at each level to see where the problem
is reproducible.
My first inclination would be that it is not a DRD problem, but that's because
I haven't seen any problems like this.
Rob
These are real simple, and you may do better to make your own, but...
------------------------------ create two 8k files -------------------------------
#!/usr/bin/ksh
rm -f 8kfile 8kfile2
integer c=1
while [[ $c -le 8192 ]]
do
print -n "a" >> 8kfile
print -n "b" >> 8kfile2
((c=c+1))
((m=c%64))
if [[ $m -eq 0 ]]
then
print >> 8kfile
print >> 8kfile2
((c=c+1))
print "c=$c"
fi
done
------------------------------ write/read/compare to DRD ------------------------------
#!/usr/bin/ksh
MAX=100
integer c=1
while [[ $c -le $MAX ]]
do
dd if=8kfile of=/dev/rdrd/drd1 seek=10000 2>/dev/null
dd if=/dev/rdrd/drd1 of=read8k iseek=10000 count=16 2>/dev/null
dd if=8kfile2 of=/dev/rdrd/drd1 oseek=11000 2>/dev/null
dd if=/dev/rdrd/drd1 of=read8k2 iseek=11000 count=16 2>/dev/null
print "Diff'ing the files...number of iterations: $c"
diff 8kfile read8k
diff 8kfile2 read8k2
((c=c+1))
done
|
1991.9 | The production environment is stopped ... | VAXRIO::LEO | | Thu Apr 10 1997 09:30 | 69 |
| Hi Rob,
First of all I would like to thank you for all your interest.
We are trying to identify and isolate the real problem.
We have Oracle 7.3.2.3 with all available patches installed by
Oracle. We have Digital Unix 4.0B with all patches installed as well (dupatch).
The documentation says that TCR v1.4 is supported by Digital Unix 4.0B.
Is there any special patch to TCR V1.4 ?
Could you please inform me about the Oracle, TCR and Digital Unix
versions that you have running on the 8200 cluster ?
In this meanwhile we are trying to figure out what is really going on.
We are organizing several different tests in order to have a better
understanding about this very uncommon problem.
The configurations tested were:
1- Run Oracle exclusive server (no OPS) with data distributed by DRD
offered by both systems.
-> Data was corrupted
2- Run Oracle exclusive server (no OPS) with data located in DRD devices
offered by a single system.
-> Data was corrupted
3- Run Oracle exclusive server (no OPS) with data located in raw devices
(no DRD configured). Notice that this way the customer is using only
half of the resources availble (CPU and memory).
-> They are preparing this configuration right now
The customer still can't run his applications.
He is trying different configurations, as described above, with no success.
This is a very important customer used as reference for TruCluster
environment here in Brazil and all their applications are based in this
database.
We still don't know if the configuration described in item 3 will
corrupt data but I can assure you that the performance will be too
poor.
I know that until now we cannot garantee that it's a Digital problem
or an Oracle problem but we need to work on it to be able to identify
the real cause of all this data corruption. Don't forget that the customer
is still down and is taking all Trucluster resources out of his configuration
to try to run his applications, even precariously.
Could you please tell me if is there any place in the whole world that
use TCR 1.4 + DU 4.0B + Oracle 7.3.2.3 ?
Do you know other OPS configurations over DU all around the world ?
If so could you please tell me the TCR, DU and Oracle versions that
they are using ?
This info is being required by the customer.
I really need this answer.
Best regards,
Leo
Digital Technical Support
Brazil
|
1991.10 | | KITCHE::schott | Eric R. Schott USG Product Management | Thu Apr 10 1997 14:19 | 19 |
| Hi
Have you checked all firmware revs and board revs on
the systems (including disks)?
Have you checked all cable lengths?
Anything in the error logs?
Are you using LSM to mirror the data on an HSZ?
Do you have the latest HSZ patches...
My guess is you have a hardware problem somewhere...
Have you run sys_check
http://www-unix.zk3.dec.com/tuning/tools/sys_check/sys_check.html?
|
1991.11 | configuration info | NETRIX::"[email protected]" | Brian Stevens | Thu Apr 10 1997 14:20 | 14 |
|
I am aware of successful OPS installations with TCR 1.4 and
DU 4.0A. I doubt highly that this is a 4.0B introduced problem.
Would you be able to supply the configuration information? Especially
for the tables that you know to have been corrupted? For example,
what drd device, whether over lsm, and underlying hardware. Is
their hardware raid involved? If so, which controller?
We have seen corruption with HSZ40.
Regards,
Brian Stevens
[Posted by WWW Notes gateway]
|
1991.12 | Barcelona similar problem | VAXRIO::LEO | | Thu Apr 10 1997 14:38 | 22 |
| Hi,
> .11
As far as I know the "Instituto Municipal de Informatica" located at
Barcelona/Spain had a similar problem, getting corruption data on
drd devices. They were using Digital Unix 4.0B, TCR 1.4 and
Oracle 7.3.2.3.
They have decided to do a downgrade of Digital Unix from 4.0B to 4.0A
and the problem has been solved.
What I'm trying right now is to confirm this information that I have
received from my local Oracle Support.
Did you know something about that ?
Regards,
Leo
|
1991.13 | | NETRIX::"[email protected]" | Brian Stevens | Thu Apr 10 1997 14:55 | 9 |
| I wasn't aware of the Barcelona problem. Bernard Laforgue just
completed a benchmark in Valbonne. I was going back through
mail and saw it was with 4.0B and TCR 1.4. They had memory
channel failover problems, but not data corruption. You might
send him mail to see what oracle version and patches they
used. I still suspect hardware though.
Brian
[Posted by WWW Notes gateway]
|
1991.14 | | VAXRIO::LEO | | Thu Apr 10 1997 17:54 | 34 |
| Hi Eric,
> .10
KZPSA-BB -> Hardware revision P01.
Firmware revision A10.
RZ29B-VW -> Firmware revision DEC 0016.
The BN21K-03 cables (between KZPSA-BB) have 3 meters.
No messages related to the problem in the error messages.
Neither HSZ40 nor LSM are being used yet. We didn't receive the HSZ40
so far due to some importing problems. It means that they don't have
any kind of RAID configured rigth now.
They are using just BA356-JC (with DWZZB-VW and H885-AA) as disk
cabinet.
Do you think we have any problem on this kind of configuration and
firmware revisions ?
Thank you in advance,
Regards,
Leo
Digital Technical Support
Brazil
|
1991.15 | | VAXRIO::LEO | | Mon Apr 21 1997 14:38 | 21 |
| Hi,
We have changed the new-wired-method parameter from the default 1
to 0.
It seems to solve the problem.
We did it five days ago and so far everything is going fine.
We have data corruption problems with other databases such as
Informix and Sybases using either Digital Unix 4.0 or 4.0a or 4.0b.
I think setting new-wired-method to 0 can fix several data
corruption problems generated by inconcistency on shared memory.
Best regards,
Digital Technical Support
Brazil
|
1991.16 | Urgent, please help !! | HGOM22::MAHUAHSIN | | Wed May 21 1997 12:47 | 15 |
| Hi:
We learned all lots from this 1991. Currently we have a case using
AS2100 TruCluster and have done everything mentioned on 1991.
Something very strange: We tested Bob's testing program using 8M
instead of 8K. This program was crashed teh system. We changed a new
system and this program ran fine. Unforunately, when the OPS startup
and the system crashed again. The crashed system always kept on the
same system no matter where the OPS service located.
This is an urgent case and please help.
Regards,
Hua-Hsin Ma
|
1991.17 | | dust.zk3.dec.com::Marshall | Rob Marshall USEG | Wed May 21 1997 14:58 | 20 |
| Hi,
I have sent you mail. Unfortunately, there is not enough information
in your note to be able to help. As a minimum we need to know what
kinds of panics you are getting.
Also be sure that, if you have set drd-data-compare on one of the
systems, this *must* be set on all of them. If not, you could get
panics because one system is looking for the checksum, but the other
one hasn't calculated it. So, please be sure it is set on all the
members before doing any testing.
Also be sure that you turned off the new wire method. This is what
fixed the problem originally brought up in this note.
To guarantee a response to your problem, please open an IPMT case.
Rob Marshall
USEG
|