T.R | Title | User | Personal Name | Date | Lines |
---|
2099.1 | any disk access failures ??? | BACHUS::DEVOS | Manu Devos NSIS Brussels 856-7539 | Thu May 29 1997 19:08 | 16 |
|
My first reaction on your note is that you have a single point of
failure in your drawing, as there is only one SCSI bus..
Now, concerning the described behaviour, you must provide us with more
information. As you correctly mentioned, the AM detects when the
"other" member is no longer respondding to SCSI "pings". BUT, this is
NOT enough to cause a failover. After all, maybe this bus is not used
by any service running on that host. A failover is only started if an
access to a shared data is failing, and only if this shared data is not
accessible from another disk ( I.E. from another plex of the LSM
volume). So, you should tell us if you receive also a notification that
ASE is not able to access a specific disk.
Regards, Manu
|
2099.2 | ASE AND SCSI BUS PARTITION | SOSGPX::FIORINI | | Tue Jun 03 1997 07:35 | 38 |
|
Hi Manu,
thanks for the answer to my note.
As you have noticed there is a single point of failure, because each
system has only one SCSI interface, and the two shared disks (one is
used as mirroring) are on that interface.
Now I cannot change the configuration, but I have informed the
Customer, and I think the configuration will be changed in a short
time.
In the actual configuration, only one service (that use both disks) has
been defined, and the first system gives the service.
The second system is in "stand by" and will take the service if the
first system fails.
If the SCSI cable is disconnected from the system that gives the
service, that system is not able to access any data on the shared bus.
AM detects the failure, and notify that there is a SCSI bus partition.
There is not any entry in the errorlog regarding the access to the
disks (the loggin severity level has been set to log notice, warning
and errors).
The Customer's application is run via the start action script.
In it there are two lines that point to two other two scripts in order to
run the database (oracle) and the application.
Once the application is started it accesses the data on the shared
disks continuosly.
I think that ASE does not reallocate the service because it does not
ping the disks itself, and does not know that the application is unable
to reach the data.
Is there any possibility to solve the problem?
Regards, Moreno
|
2099.3 | How is the service defined? | HERON::BLOMBERG | Trapped inside the universe | Tue Jun 03 1997 09:23 | 1 |
|
|
2099.4 | .0: It seems that ASE1.4 can't handle SCSI single point failure | EPS::NGUYEN | Without fools there would be no wisdom. | Tue Jun 03 1997 11:17 | 18 |
|
Hi there,
I've got the same problem when my application service does not fail over when
the SCSI bus is disconnected. My software versions are the same as those in
.0, and I use HSZ40 instead of the BA box. I configured the application
favor member "system1,system2" and do NOT fail over to the higher favor
member automatically.
In order to activate the fail over in the old version of ASE1.3, I used to
"cd" into the directory on the shared disk and "ls", but it seems does NOT
work anymore, the application can not to data on the shared disk and just
hang there without failing over to the other system.
Any recommendations/suggestions are highly appreciated.
Regards,
Gina Nguyen
|
2099.5 | | COMICS::CORNEJ | What's an Architect? | Tue Jun 03 1997 12:24 | 12 |
| >In order to activate the fail over in the old version of ASE1.3, I used
>to "cd" into the directory on the shared disk and "ls", but it seems does
>NOT work anymore, the application can not to data on the shared disk and
>just hang there without failing over to the other system.
The service will not fail over if you "cd" to the filesystem on the
shared disk (look in daemon.log - it will show the umount failing
because the device is still busy).
Jc
|
2099.6 | SCSI full partition changed in 1.4 ? | BACHUS::DEVOS | Manu Devos NSIS Brussels 856-7539 | Wed Jun 04 1997 17:34 | 20 |
| Hi Jc,
> The service will not fail over if you "cd" to the filesystem on the
> shared disk (look in daemon.log - it will show the umount failing
> because the device is still busy).
I think .4 wanted to say that after having disonnected the SCSI bus
cable, he had to do a "cd - ls" to cause an IO on the disconnected disk
to cause the failover. This simply confirms my answer saying that a
cable disconnection is not sufficient to cause a failover, an IO is
also needed..
But, the interesting information in .0, .3 and .4 seems that it does
not work anymore with version 1.4...
Is it any change in version 1.4 concerning the SCSI BUS full partition
inregards of version 1.3 ?
Manu.
|
2099.7 | :-) | COMICS::CORNEJ | What's an Architect? | Thu Jun 05 1997 08:45 | 5 |
| Ooops! Sorry about that. It is what comes from doing what I wrote in
.4 too often that day myself:-)
Jc
|
2099.8 | ScSi fail over for less than 1-minute | EPS::NGUYEN | Without fools there would be no wisdom. | Thu Jun 05 1997 10:29 | 23 |
| Hello there,
Thank you very much for discussion on the case.
What .6 have written is precisely what I mean. Since the ASE already
discovered that there is a failure in the SCSI, it won't hurt it to "cd"
in the directory (after all, what that be is an empty bucket to mount
the shared disk). It only doesn't work as .5 suggested in the normal
condition.
Well, after some more extensively testing, I've found out that the
system will fail over if any "IO" action is quick enough (I mean within
less than 2 minutes or so) around the SCSI failure time. The longer
the time, the more difficult it is for the system to do anything.
I mean again is that after "1" minute, one should expect that the
system would hang. However, I believe that if our customers buy our
product, they would expect that it would work unless it's stating
otherwise. What if the SCSI fails at night or sometime when there are
not many IO trafics?
Any suggestions from the product team?
Gina Nguyen
|
2099.9 | | KITCHE::schott | Eric R. Schott USG Product Management | Thu Jun 05 1997 14:15 | 9 |
| Hi
If you have behavior that you think is incorrect (or different between
releases), I suggest you file a QAR or CLD/IPMT.
The system should not hang...so I think this is a serious problem. I
think to get the attention this deserves, you should escalate as
required by your customer.
|