[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::winnt-clusters

Title:	WinNT-Clusters
Notice:	Info directories moved to DECWET::SHARE1$:[NT_CLSTR]
Moderator:	DECWET::CAPPELLOF

Created:	Thu Oct 19 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	863
Total number of notes:	3478

659.0. "Network card, not failover ? Do a manual recover ?" by MDR01::MONJE (MCS Madrid) Wed Feb 26 1997 06:40

    Hi,
    
    If I understand from some previous topics related with the "network card
    fail" at one of the CPUs of the NT Cluster: There isn't failover in
    this case.
    
    If the NT cluster servers have one network card and that network card 
    fails in one of the servers, the NT cluster SW doesn't automatic failover,
    because the Cluster SW at the second CPU does not identify that problem 
    and it doesn't take automatic control of the disks served from the
    first  CPU, and makes a failover.
    
    Is that true ? Why the recommendations of use 2 network cards at the
    2 Servers of the cluster ?
    
    The answer to previous not automatic failover could be:
    Monitoring the cluster and when we detect one Server down, 
    do a manual failover to the second CPU of the cluster ??
    
    
    Thanks adn best regards,
    
    Antonio M:-)

T.R	Title	User	Personal Name	Date	Lines
659.1		MSE1::PCOTE	Rebuilt NT: 163, Rebuilt VMS:1	`Thu Feb 27 1997 10:42`	15
	> Is that true ? Why the recommendations of use 2 network cards at the > 2 Servers of the cluster ? The network is the interconnect for the cluster 'heartbeat'. If you rely only on one network connection and that happens to fail then anarchy will prevail as the cluster software attempts to decide who should take ownership of the shared disks. The 2nd (private) network alleviates this. Your point concerning a disk failover if the serving host incurs a network failure is well taken. I don't understand why the cluster software (or some value added sofware) could not discern this situation. Just a simple matter of programming, right :-)
659.2	Some points to discuss about net card failure...	MDR01::MONJE	MCS Madrid	`Thu Feb 27 1997 17:12`	57
	Mmmh... Two points to discuss on that: 1/ Which difference is between a CPU fail (i.e. NT shutdown) and a network card fail. In the two cases one server of the cluster is down for the other one and the available server would be to detect the failure a take ownership of all disks availables at the cluster ? I think there is a different behaviour in those two cases. Maybe the way how the two servers interchange cluster protocol information between them? 2/ Today I have be able to simulated a network card failure in a NT cluster (thanks to my colleges) with Digital Clusters for NT beta 1.1 SW. I disconnected the network card from the LAN at one of the cluster servers! Around 1 min. 30 sg. after failure the server disconneted from LAN notifies the other server is down. At 2 minutes after failure the second server (connected to the LAN), identifies the first server is down but doesn't take any failover disk action. I've tried a manual failover but doesn't work at the first time. After some operations I have identify a workaround to get the manual failover when one server has a network card failure (don't know if the problem is dueto I've used the beta v1.1 of cluster software): a. At server where network card fails, do a manual failover operation from "manual failover" at Cluster administrator tool. The operation gives an error message of network communication not available and can't do the operation. b. At second server, connected to the LAN, do a "online disk" operation from "disk failover" menu option at Cluster administrator tool. Server takes control of all cluster disks availables. Very important to do the failover command at first server cluster disk only one time. If you do the failover operation more than one time and then try a "online disk" from second server it doesn't work (error message: "disk is in use") . Any ideas ?? Thanks and best regards, Antonio M:-)
659.3	network must be running for manual failover	AEOENG::16.40.240.154::annecy::lehy		`Fri Feb 28 1997 03:07`	41
	As far as I understand, the second network (private) is recommended for the following reasons (Carl will correct me if I am wrong) 1- cluster heartbeats will fail over to the second network if communication over the first one is broken. Here, first network does not mean enterprise network, and second network does not mean private network. (first is the first network listed in the network bindings, second ...) 2- The 1.0 documentation says that at least one network must be running for a manual failover to succeed. (ie the cluster members must be able to exchange data over the net). A- If you have to networks if we have a broken network adapter on a server, the cluster software will use the second network for its cluster communications. But it will not trigger any failover (this has been discussed elsewhere in this notesfile) Because of this, you may have to perform a manual failover (if the broken adapter is the one connecting a cluster server to the enterprise network, and if this server as online groups) B- One network You only have one net, and this net is broken. Here I am not sure. My understanding is that no failover will happen. Each cluster member will react the same way: b1- try to allocate offline shared disks (Q: by sending which type of SCSI command ?) b2- I suspect/expect the allocation to fail because the offline disks are mounted on the other server try again B1 and B2, for ever ? Role of the quorum disk ? Chris
659.4	I'll try to explain a bit.	MSE1::MASTRANGELO		`Fri Feb 28 1997 09:34`	47
	I'll try to explain the case where there is only one network between the two servers. In this case, if serverA loses network connectivity (i.e. it gets disconnected from the network) to serverB, the following happens: A connection timeout will occur - the communication infrastructure will wait for a defined amount of time before declaring the network connection between the servers is down. This interval of time is defined by CurrentControlSet \Services \ClusterFailoverManager\Parameters\ConnectionTimeout The default for ConnectionTimeout is 30 seconds and it is initially not defined. You have to create this key if you want to change it. Next, a stabilization delay will occur - this allows the infrastructure to attempt to communicate over a second adapter before declaring the other server is down. This delay is enabled by default. Enabling or disabling this delay is controlled by CurrentControlSet \Services \ClusterFailoverManager \Parameters \DisableFailoverNetDelay This registry key is also initially not defined. After the connection timeout and failover net delay have expired a disk arbitration will occur. ServerA will issue a SCSI bus reset to clear the reservation on the disk and wait to see if the other server "expresses interest" in the disk it will grab the disk again if the other server does not and thus a failover will not occur. DiskArbitrationInterval is a registry parameter (also under ClusterFailoverManager\Parameters and initially not defined) which is used to control how frequently the owning server polls the disk to verify its ownership. It makes no sense to migrate resources to the other server since there is no way to guarantee that the network problem does not also affect the other server. Looking at the network card counters to determine if it is a problem with the card has been discussed but my guess is that this technique will not be implemented.