T.R | Title | User | Personal Name | Date | Lines |
---|
2068.1 | more info | BACHUS::DEVOS | Manu Devos NSIS Brussels 856-7539 | Mon May 19 1997 10:59 | 25 |
| Hi,
We need more information to help you.
When the two systems are up and running, can you
1) Stop the service ?
2) Start the service ?
3) Relocate the service to the other member ?
Is it a message "device busy" in the daemon.log ?
When you try "init 0", do you wait for a long time before doing reset ?
(ASE should stop every services running on that host and it can take
time. Also, if something is going wrong, the ASE internal timeout is
quite long).
Try to explain us if it has worked before, if something has been
changed or what you think had triggered the problem.
What is working and what is not working ?
Manu.
|
2068.2 | | BACHUS::DEVOS | Manu Devos NSIS Brussels 856-7539 | Mon May 19 1997 11:26 | 18 |
| Hi again,
I just read the note 1975.2 which seems to concern the same customer
and saw that it is a SAP config. Did you finally resolve the original
problem ?
If the start/stop of the service is NOT working when the two members
are connected, then you can try to disconnected the shared scsi bus
cables from the other node and repeat the test. If it works, then you
likely have either a mis-configured system (same scsi id for the
controllers ?) or a hardware problem on one of the SCSI Buses.
Try to have a progressive approach from a working start point till you
see the problem appearing. The last step could give you the begin of
the solution.
Manu.
|
2068.3 | | MEOC02::LEE | | Tue May 20 1997 04:07 | 96 |
| Hi,
We (...the local CSS, TSC, NSIS) have a long and painful saga with
this particular DECsafe installation. The cluster was put in about
18 months ago. When it was first installed, DECsafe did work to a
certain extent. It will failover the disks from HX11 to HX12 but
there were some bugs in the ASE SAP scripts that prevented the SAP
R/3 application from starting on HX12. The customer employs about 70
SAP contractors and did not want to hand the machines over to test
the scripts then.
Six months down the track, we got a call from their Sys Admin. He tried
adding some disks to the cluster and it somehow got into a hung state.
They eventually got HX11 up and running by disabling DECsafe on HX12.
HX11 will panic whenever HX12 is rebooted with DECsafe turned on.
Since then, DECsafe on HX12 has been turned off. Despite the problems,
the machines have been upgraded:-
DU3.2C -> DU3.2G
SAP 3.0C-> SAP 3.0D+
HX12 which was a AS2100 was replaced with an AS8400.
(we preserved the LSM/DECsafe environment from the AS2100 by
moving the system disks across and the placement of the SCSI
controllers. The new HX12 came up without a problem...with
DECsafe turned off, that is).
During the Easter Weekend, we turned on DECsafe on HX12 and of course,
it caused HX11 to panic. We also got ourselves into a knot...as described
note 1975.
Last weekend, we applied the DECsafe V1.3 patches, deleted, re-added
HX12. This fixed the problems with the machine panics. However, we tripped
over again...with another problem.
This is wait happened....
The DECsafe patches solved the problem with the machine
panics. But, we encountered another DECsafe problem. With
DECsafe turned on both HX11 and HX12, we cannot shutdown
HX12 cleanly via an 'init 0'. It appears to hang and requires
a physical reset. The LSM disks on HX12 needs to resync on
the next reboot.
Test Results
When Chris & I finally got the nod, we did the following:-
1) applied the following DECsafe V1.3 patches:-
ASE130-005
ASE130-013
ASE130-014
ASE130-015
2) Rebuild the kernels for HX11 and HX12.
3) Shutdown HX11 and HX12
4) Boot HX11, disable the ASE service, sapdb
5) Boot HX12, turn on and reinitialize DECsafe.
6) On HX11, delete member HX12 and re-add HX12.
Everything seems OK at this point. We now have
DECsafe up and running on both HX11 and HX12.
7) On HX11, enable the ASE service. This mounted
the shared File Systems, started ORACLE and the
SAP R/3 Application.
8) We decided to reboot HX11 and HX12 to check if
they will both come up without any problems.
We issued an 'init 0' on HX12 and waited, waited
and waited........... (20 minutes)
After physically resetting HX12, the following
messages were logged in daemon.log:-
May 17 18:49:52 hx12 DECsafe: local Asemgr Error:
Can't connect to HSM
May 17 18:49:52 hx12 DECsafe: local Asemgr notice:
msgSvcOpenChannel: Agent not in target's port map
May 17 18:49:52 hx12 DECsafe: deregister_fd: not registered
May 17 18:49:52 hx12 DECsafe: local Asemgr notice:
can't connect to local agent, retrying...
May 17 18:49:57 hx12 DECsafe: local Asemgr notice:
msgSvcOpenChannel: Agent not in target's port map
May 17 18:49:57 hx12 DECsafe: deregister_fd: not registered
May 17 18:49:57 hx12 DECsafe: local Asemgr notice:
can't connect to local agent, retrying...
The last 3 messages were repeated at 5 seconds interval.
9) HX12 will still hang even if shutdown DECsafe first via the commands
/sbin/init.d/asemember stop
/sbin/init.d/aseam stop
before the 'init 0'
10)With DECsafe turned off on HX12, it will reboot cleanly.
I re-checked the SCSI controllers, cabling and disks this morning
and I am certain that they are correctly configured. The shared
disks are on SCSI buses 2,3,4 & 5 on both machines.
Thanks for the replies....and please keep it coming.
|
2068.4 | Some comments and suggestions ... | BACHUS::DEVOS | Manu Devos NSIS Brussels 856-7539 | Tue May 20 1997 07:58 | 122 |
| < Hi,
<
< We (...the local CSS, TSC, NSIS) have a long and painful saga with
< this particular DECsafe installation. The cluster was put in about<
< 18 months ago. When it was first installed, DECsafe did work to a
< certain extent. It will failover the disks from HX11 to HX12 but
< there were some bugs in the ASE SAP scripts that prevented the SAP
< R/3 application from starting on HX12. The customer employs about 70
< SAP contractors and did not want to hand the machines over to test
< the scripts then.
So, you are still using the original (wrong) scripts ? or did you change them?
< Six months down the track, we got a call from their Sys Admin. He tried
< adding some disks to the cluster and it somehow got into a hung state.
As the stop/start scripts are needed in each service modification (You know the
famous sequence: stopping-deleting-adding-starting the service), a not working
stop script can lead to an apparent "hung" state (which generally exits after a
very long time [36']).
< They eventually got HX11 up and running by disabling DECsafe on HX12.
< HX11 will panic whenever HX12 is rebooted with DECsafe turned on.
If HX12 is causing HX11 to panic when it booted, this is caused either by a
mis-configured SCSI controller-id/termination OR an ASE database OUT of sync
with the running system. They can thus try each to start a director and
consequently to start the service.
< Since then, DECsafe on HX12 has been turned off. Despite the problems,
< the machines have been upgraded:-
< DU3.2C -> DU3.2G
< SAP 3.0C-> SAP 3.0D+
< HX12 which was a AS2100 was replaced with an AS8400.
This is not a criticism (I too know the pressure a customer can place on us!),
but consecutive changes to a NOT-WORKING environment can only lead to virtually
impossible cure of the problems. The step by step (and thus lengthy process) is
the only valid approach to solve complicated problems.
< (we preserved the LSM/DECsafe environment from the AS2100 by
< moving the system disks across and the placement of the SCSI
< controllers. The new HX12 came up without a problem...with
< DECsafe turned off, that is).
???, very strange to mee!!! It is so simple to delete the Member from the
cluster and then to re-add it. You are then sure that the last (most up to date)
version of the ASE database is used on the new (added) system. But, maybe you
had tried that because ASE had been disabled on that system ???
<
< During the Easter Weekend, we turned on DECsafe on HX12 and of course,
< it caused HX11 to panic. We also got ourselves into a knot...as described
< note 1975.
<
< Last weekend, we applied the DECsafe V1.3 patches, deleted, re-added
< HX12. This fixed the problems with the machine panics. However, we tripped
< over again...with another problem.
<
< This is wait happened....
<
< The DECsafe patches solved the problem with the machine
< panics. But, we encountered another DECsafe problem. With
< DECsafe turned on both HX11 and HX12, we cannot shutdown
< HX12 cleanly via an 'init 0'. It appears to hang and requires
< a physical reset.
< The LSM disks on HX12 needs to resync on the next reboot.
The resync is absolutely normal and expected. A LSM mirrored volume which is not
stopped (init 0 should stop it, but you reseted the system before it reach that
point), should always resynchronize its mirrors.
SO, your last problem is 'init 0' not finishing in a reasonable time.
To debug that problem, I suggest you modify the /sbin/rc0 script such that you
can monitor at the console the progression of 'init 0'. Edit the file to place
an 'echo $f' just before the invocation of the stop shell script. See the next
example:
if [ -d /sbin/rc0.d ]; then
# KILL procedure
for f in /sbin/rc0.d/K*
do
if [ -s $f ]; then
echo "Starting $f ..." <--------- Here!
/sbin/sh $f stop
fi
done
...
You can now try a init 0 and see on the console the name of each stop script
just before it is executed. Once you find the blocking script, insert a "set -x"
in that script to be able to monitor its execution at the console. I am sure you
can find the offending command/problem...
<
< 9) HX12 will still hang even if shutdown DECsafe first via the commands
< /sbin/init.d/asemember stop
< /sbin/init.d/aseam stop
< before the 'init 0'
This prove that DECsafe is not a player here !!!
< 10)With DECsafe turned off on HX12, it will reboot cleanly.
What do you mean by "reboot cleanly" ? Do you mean that 'init 0' is working when
you booted the system with ASE=off and NOT working when booted with ASE=on or
that you can boot "cleanly" when ASE=off ?
<
< I re-checked the SCSI controllers, cabling and disks this morning
< and I am certain that they are correctly configured. The shared
< disks are on SCSI buses 2,3,4 & 5 on both machines.
<
< Thanks for the replies....and please keep it coming.
<
<
Finally, try to run all your test in a script session, so you can give us all
the evidences.
Manu.
|
2068.5 | Let me explain... | MEOC02::LEE | | Tue May 20 1997 21:00 | 39 |
| Hi,
I have another 12 hour window this weekend for further tests. I am
trying to get as many ideas as I can to help pinpoint or solve the
problem.
<<So, you are still using the original (wrong) scripts ? or did you
<<change them?
Yes, we made some minor changes to rc_service and db_service at this
point in time. The script manually maintains a list of file system to
mount.We intend to make some additional changes latter.
<<What do you mean by "reboot cleanly" ? Do you mean that 'init 0' is
<<working when
<<you booted the system with ASE=off and NOT working when booted with
<<ASE=on or
<<that you can boot "cleanly" when ASE=off ?
<<This prove that DECsafe is not a player here !!!
When HX12 is booted with ASE=off, init 0 will halt the system. The chevron
prompt '>>> ' is displayed at the console. HX12 will boot and shutdown
without a problem.
If I now set ASE=on, by booting HX12 into single user mode, editing
rc.config, HX12 will startup fine. No resync of local LSM disk.
If an 'init 0' is issued, HX12 will appear hung during the execution of
the K* scripts. (I will workout which one this weekend). I been quick to
jump into conclusion that it was DECsafe related as the 'init 0' executes
the same K* scripts. At every 5 sec internal, the LED on the system disk
flashes and seems to go on and on....until I hit <CONTROL P> at the console.
This activity corresponded with the entries in daemon.log.
Rebooting at this point (ASE=on) will bring HX12 up and LSM needs to
resync the disk, as expected. An 'init 0' now gets HX12 into the same
hung state.
Thanks...for the patience
|
2068.6 | Further clarification | MEOC02::JANKOWSKI | | Wed May 21 1997 08:49 | 55 |
| Hi,
I am the other guy working on the system with Kay Lee - the author of
.0
I would like to provide the following clarification.
We only have one service - sapdb.
This service is normally working on HX11 as this is the preferred
member.
The status of the service is that if cleanly brings up SAP when it
starts and it cleanly stops SAP when the it stops.
We have *not* got to the stage yet that we actually tested the failover
to HX12. This is our objective but we need to get to clean state first and
be very careful - this is a production system with lots of storage.
At the moment we stabilised the system - we have DECsafe running on
both machines and the services comes up good on HX11 on startup.
To progress further we need to be able to shut down HX12 cleanly.
If we cannot do it we will have 200Gb of LSM disks resyncing
and it takes about 8hrs to do.
Note that our window is 12 hours.
Just to summarize:
The current problem is:
HX12 will not complete - init 0 - with the errors as per .0
with *no* service running on HX12.
However if we disable DECsafe by settting up ASE=off in rc.config
then after next reboot the machine will shut down cleanly.
Also note that just shutting down the daemons by running asemember stop
and aseam stop does not remove the problem.
This is strange.
Our current plan for our 12hr window is:
0. Activate HX12 (ASE=ON)
1. delete HX12 from ASE configuration.
2. Remove ASE susbsets from HX12
3. reinstall ASE on HX12 and apply patches.
4. add HX12 to existing ASE configuration
Any comments?
Chris Jankowski
Melbourne Australia
|
2068.7 | | BACHUS::DEVOS | Manu Devos NSIS Brussels 856-7539 | Wed May 21 1997 18:13 | 29 |
| Hi again Chris and Klay,
I don't know if this can help you, but I strongly recommend the
following approach:
As your time window only 12 hours, I suggest that you stop the
application by calling the DECsafe stop script (outside of ASE, of
course), and if it is successfull, that you stop the ASE service with
asemgr. Then you can try "init 0" and check if it works. If it is
hanging, at least your LSM volumes will not have to be resynchronized.
Then, you can narrow the problem by modifying the /sbin/rc0 script as
described some replies before.
By the way, I was just thinking if "init 0" on hx12 was done when the
sapdb service was running on itself or on HX11??? If the service is
running on HX11, "initing 0" and resetting HX12 SHOULD NOT CAUSE the
service to resynchronize its mirrors !!!
Also, I was facing a serious problem sometime ago, to debug an ASE 1.2A
cluster to which someone applied the ASE 1.3 patches !!! ARe the
patches you have installed applying to your ASE version?
Did you rebuilt the kernel after the patches ?
Forgive my naive questions, but they are only intended to help you !
Manu.
|
2068.8 | Further clarification. | MEOC02::JANKOWSKI | | Wed May 21 1997 21:26 | 16 |
| Re. 7
The ASE is V1.3 and the patches are for V1.3
Kernel has been rebuilt.
The sapdb service runs on HX11.
(unsuccessful) shutdown of HX12 causes only the local system disks
to be resynchronized after a forced halt.
However, our test plan calls for failover of the service to HX12.
We would prefer to do this when the machine can be shutdown cleanly
as otherwise we may be left with those disks there and having
to resynchronize them.
Cheers,
Chris
|
2068.9 | broken script | GIDDAY::SCHWARZ | | Mon May 26 1997 01:35 | 20 |
|
I went to site on Saturday to help isolate the shutdown problems.
Thanks to Manu for his suggestions - they were great. It turned out
that in one of the sap shutdown scripts there was a call to asemgr.
Unfortunately the sap script was called AFTER asemember stop had been
run. Thus with the aseagent not running asemgr just sat there and the
shutdown did not complete. This call to asemgr was not supposed to be
called during shutdown - only during boot. Modifying the script to
reflect this allowed the system to shutdown cleanly.
Lesson to learn:
1) separate you boot and shutdown scripts
2) check the order in which things happen before assuming your scripts
work.
Kym Schwarz
Unix Support
CSC Sydney
|
2068.10 | other results of the recent debugging session. | MEOC02::JANKOWSKI | | Thu May 29 1997 05:58 | 16 |
| As per .9 the immediate problem of not being able to shutdown is
solved. Thanks to Manu for his excellent suggestions in .4.
The fact that the machine would shutdown cleanly with ASE disabled
was what put us on a wrong track.
Anyway, we also made excellent progress on debugging and testing
of the start and stop scripts.
At the moment we can failover manually from HX11 to HX12 and back
reliably. We also get correct actions when we boot and shutdown
machines in all combinations of situations and order.
Regards,
Chris Jankowski
Melbourne Australia
|
2068.11 | are you using the "official" scripts | BACHUS::DEVOS | Manu Devos NSIS Brussels 856-7539 | Thu May 29 1997 18:04 | 19 |
| Hi Chris and the team ...
Firts, I am glad to ear good news. Notes conferences are of great help
for all of us!
> Anyway, we also made excellent progress on debugging and testing
> of the start and stop scripts.
> At the moment we can failover manually from HX11 to HX12 and back
> reliably. We also get correct actions when we boot and shutdown
> machines in all combinations of situations and order.
Do you know that DIGITAL-SAP team in Waldorf (Germany) produced
"official" and "supported" start and stop scripts for DECsafe. I think
the teal leader is Thomas Heinz. I don't have his e-mail address here
at home, but I am sure that, if you mail to "Marc Dubois @BRO" the
qquestion, he can answer.
Regards, Manu.
|