T.R | Title | User | Personal Name | Date | Lines |
---|
659.1 | | MSE1::PCOTE | Rebuilt NT: 163, Rebuilt VMS:1 | Thu Feb 27 1997 10:42 | 15 |
|
> Is that true ? Why the recommendations of use 2 network cards at the
> 2 Servers of the cluster ?
The network is the interconnect for the cluster 'heartbeat'. If
you rely only on one network connection and that happens to
fail then anarchy will prevail as the cluster software attempts
to decide who should take ownership of the shared disks. The 2nd
(private) network alleviates this.
Your point concerning a disk failover if the serving host incurs
a network failure is well taken. I don't understand why the cluster
software (or some value added sofware) could not discern this
situation. Just a simple matter of programming, right :-)
|
659.2 | Some points to discuss about net card failure... | MDR01::MONJE | MCS Madrid | Thu Feb 27 1997 17:12 | 57 |
| Mmmh...
Two points to discuss on that:
1/ Which difference is between a CPU fail (i.e. NT shutdown) and a
network card fail. In the two cases one server of the cluster is
down for the other one and the available server would be to detect the
failure a take ownership of all disks availables at the cluster ?
I think there is a different behaviour in those two cases. Maybe the
way how the two servers interchange cluster protocol information
between them?
2/ Today I have be able to simulated a network card failure in a
NT cluster (thanks to my colleges) with Digital Clusters for
NT beta 1.1 SW.
I disconnected the network card from the LAN at one of the cluster
servers!
Around 1 min. 30 sg. after failure the server disconneted from LAN
notifies the other server is down. At 2 minutes after failure the second
server (connected to the LAN), identifies the first server is down but
doesn't take any failover disk action.
I've tried a manual failover but doesn't work at the first time. After
some operations I have identify a workaround to get the manual failover
when one server has a network card failure (don't know if the problem
is dueto I've used the beta v1.1 of cluster software):
a. At server where network card fails, do a manual failover
operation from "manual failover" at Cluster administrator tool.
The operation gives an error message of network communication
not available and can't do the operation.
b. At second server, connected to the LAN, do a "online disk"
operation from "disk failover" menu option at Cluster
administrator tool. Server takes control of all cluster disks
availables.
Very important to do the failover command at first server cluster disk
only one time. If you do the failover operation more than one
time and then try a "online disk" from second server it doesn't
work (error message: "disk is in use") .
Any ideas ??
Thanks and best regards,
Antonio M:-)
|
659.3 | network must be running for manual failover | AEOENG::16.40.240.154::annecy::lehy | | Fri Feb 28 1997 03:07 | 41 |
| As far as I understand, the second network (private) is recommended for
the following reasons (Carl will correct me if I am wrong)
1- cluster heartbeats will fail over to the second network if communication
over the first one is broken. Here, first network does not mean enterprise
network, and second network does not mean private network. (first is the
first network listed in the network bindings, second ...)
2- The 1.0 documentation says that at least one network must be running for
a manual failover to succeed. (ie the cluster members must be able to
exchange data over the net).
A- If you have to networks
if we have a broken network adapter on a server, the cluster software
will use the second network for its cluster communications. But it will not
trigger any failover (this has been discussed elsewhere in this notesfile)
Because of this, you may have to perform a manual failover (if the broken
adapter is the one connecting a cluster server to the enterprise network,
and if this server as online groups)
B- One network
You only have one net, and this net is broken. Here I am not sure. My
understanding is that no failover will happen. Each cluster member will
react the same way:
b1- try to allocate offline shared disks
(Q: by sending which type of SCSI command ?)
b2- I suspect/expect the allocation to fail because the offline disks are
mounted on the other server
try again B1 and B2, for ever ? Role of the quorum disk ?
Chris
|
659.4 | I'll try to explain a bit. | MSE1::MASTRANGELO | | Fri Feb 28 1997 09:34 | 47 |
|
I'll try to explain the case where there is only one network between
the two servers.
In this case, if serverA loses network connectivity (i.e. it gets
disconnected from the network) to serverB, the following happens:
A connection timeout will occur - the communication infrastructure will
wait for a defined amount of time before declaring the network
connection between the servers is down. This interval of time is
defined by
CurrentControlSet
\Services
\ClusterFailoverManager\Parameters\ConnectionTimeout
The default for ConnectionTimeout is 30 seconds and it is initially not
defined. You have to create this key if you want to change it.
Next, a stabilization delay will occur - this allows the infrastructure
to attempt to communicate over a second adapter before declaring the
other server is down. This delay is enabled by default. Enabling or
disabling this delay is controlled by
CurrentControlSet
\Services
\ClusterFailoverManager
\Parameters
\DisableFailoverNetDelay
This registry key is also initially not defined.
After the connection timeout and failover net delay have expired a
disk arbitration will occur. ServerA will issue a SCSI bus reset to
clear the reservation on the disk and wait to see if the other server
"expresses interest" in the disk it will grab the disk again if the
other server does not and thus a failover will not occur.
DiskArbitrationInterval is a registry parameter (also under
ClusterFailoverManager\Parameters and initially not defined) which is
used to control how frequently the owning server polls the disk to
verify its ownership.
It makes no sense to migrate resources to the other server since there
is no way to guarantee that the network problem does not also affect
the other server.
Looking at the network card counters to determine if it is a problem
with the card has been discussed but my guess is that this technique
will not be implemented.
|