[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::winnt-clusters

Title:WinNT-Clusters
Notice:Info directories moved to DECWET::SHARE1$:[NT_CLSTR]
Moderator:DECWET::CAPPELLOF
Created:Thu Oct 19 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:863
Total number of notes:3478

659.0. "Network card, not failover ? Do a manual recover ?" by MDR01::MONJE (MCS Madrid) Wed Feb 26 1997 06:40

    Hi,
    
    If I understand from some previous topics related with the "network card
    fail" at one of the CPUs of the NT Cluster: There isn't failover in
    this case.
    
    If the NT cluster servers have one network card and that network card 
    fails in one of the servers, the NT cluster SW doesn't automatic failover,
    because the Cluster SW at the second CPU does not identify that problem 
    and it doesn't take automatic control of the disks served from the
    first  CPU, and makes a failover.
    
    Is that true ? Why the recommendations of use 2 network cards at the
    2 Servers of the cluster ?
    
    The answer to previous not automatic failover could be:
    Monitoring the cluster and when we detect one Server down, 
    do a manual failover to the second CPU of the cluster ??
    
    
    Thanks adn best regards,
    
    Antonio M:-)
    
      
T.RTitleUserPersonal
Name
DateLines
659.1MSE1::PCOTERebuilt NT: 163, Rebuilt VMS:1Thu Feb 27 1997 10:4215
>    Is that true ? Why the recommendations of use 2 network cards at the
>    2 Servers of the cluster ?

     The network is the interconnect for the cluster 'heartbeat'. If 
     you rely only on one network connection and that happens to
     fail then anarchy will prevail as the cluster software attempts
     to decide who should take ownership of the shared disks. The 2nd
     (private) network alleviates this.

     Your point concerning a disk failover if the serving host incurs
     a network failure is well taken. I don't understand why the cluster
     software (or some value added sofware) could not discern this
     situation. Just a simple matter of programming, right :-)

659.2Some points to discuss about net card failure...MDR01::MONJEMCS MadridThu Feb 27 1997 17:1257
    Mmmh...
    
    Two points to discuss on that:
    
    1/ Which difference is between a CPU fail (i.e. NT shutdown) and a
    network card fail. In the two cases one server of the cluster is
    down for the other one and the available server would be to detect the 
    failure a take ownership of all disks availables at the cluster ?
    
    I think there is a different behaviour in those two cases. Maybe the
    way how the two servers interchange cluster protocol information
    between them?
    
    
    2/ Today I have be able to simulated a network card failure in a 
    NT cluster (thanks to my colleges) with Digital Clusters for 
    NT beta 1.1 SW. 
    I disconnected the network card from the LAN at one of the cluster 
    servers!
    
    Around 1 min. 30 sg. after failure the server disconneted from LAN
    notifies the other server is down. At 2 minutes after failure the second
    server (connected to the LAN), identifies the first server is down but
    doesn't take any failover disk action.
    
    I've tried a manual failover but doesn't work at the first time. After 
    some operations I have identify a workaround to get the manual failover 
    when one server has a network card failure (don't know if the problem 
    is dueto I've used the beta v1.1 of cluster software):
    
    	a. At server where network card fails, do a manual failover
           operation from "manual failover" at Cluster administrator tool. 
           The operation gives an error message of network communication 
           not available and can't do the operation.
    
    
      	b. At second server, connected to the LAN, do a "online disk"
           operation from "disk failover" menu option at Cluster 
     	   administrator tool. Server takes control of all cluster disks
    	   availables.
    
    Very important to do the failover command at first server cluster disk 
    only one time. If you do the failover operation more than one
    time and then try a "online disk" from second server it doesn't
    work (error message: "disk is in use") .
    
    Any ideas ??
    
    
    
    Thanks and best regards,
    
    Antonio M:-)
    
    
    
    
659.3network must be running for manual failoverAEOENG::16.40.240.154::annecy::lehyFri Feb 28 1997 03:0741
As far as I understand, the second network (private) is recommended for
the following reasons (Carl will correct me if I am wrong)

1- cluster heartbeats will fail over to the second network if communication
over the first one is broken. Here, first network does not mean enterprise
network, and second network does not mean private network. (first is the
first network listed in the network bindings, second ...)

2- The 1.0 documentation says that at least one network must be running for 
a manual failover to succeed. (ie the cluster members must be able to 
exchange data over the net).

A- If you have to networks

if we have a broken network adapter on a server, the cluster software
will use the second network for its cluster communications. But it will not
trigger any failover (this has been discussed elsewhere in this notesfile)

Because of this, you may have to perform a manual failover (if the broken
adapter is the one connecting a cluster server to the enterprise network,
and if this server as online groups)



B- One network
You only have one net, and this net is broken. Here I am not sure. My 
understanding is that no failover will happen. Each cluster member will
react the same way:

b1- try to allocate offline shared disks 
	(Q: by sending which type of SCSI command ?)

b2- I suspect/expect the allocation to fail because the offline disks are
mounted on the other server

try again B1 and B2, for ever ? Role of the quorum disk ?

Chris



659.4I'll try to explain a bit.MSE1::MASTRANGELOFri Feb 28 1997 09:3447
    
    I'll try to explain the case where there is only one network between
    the two servers.
    
    In this case, if serverA loses network connectivity (i.e. it gets
    disconnected from the network) to serverB, the following happens:
    
    A connection timeout will occur - the communication infrastructure will
    wait for a defined amount of time before declaring the network
    connection between the servers is down.  This interval of time is
    defined by
    CurrentControlSet
    	\Services
    	    \ClusterFailoverManager\Parameters\ConnectionTimeout
    
    The default for ConnectionTimeout is 30 seconds and it is initially not
    defined.  You have to create this key if you want to change it.
    
    Next, a stabilization delay will occur - this allows the infrastructure
    to attempt to communicate over a second adapter before declaring the
    other server is down.  This delay is enabled by default.  Enabling or
    disabling this delay is controlled by
    CurrentControlSet
        \Services
    	    \ClusterFailoverManager
    		\Parameters
    	            \DisableFailoverNetDelay
    This registry key is also initially not defined.
    
    After the connection timeout and failover net delay have expired a
    disk arbitration will occur.  ServerA will issue a SCSI bus reset to
    clear the reservation on the disk and wait to see if the other server 
    "expresses interest" in the disk it will grab the disk again if the
    other server does not and thus a failover will not occur.
    
    DiskArbitrationInterval is a registry parameter (also under
    ClusterFailoverManager\Parameters and initially not defined) which is
    used to control how frequently the owning server polls the disk to
    verify its ownership.
    
    It makes no sense to migrate resources to the other server since there
    is no way to guarantee that the network problem does not also affect
    the other server.
    
    Looking at the network card counters to determine if it is a problem
    with the card has been discussed but my guess is that this technique
    will not be implemented.