[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::winnt-clusters

Title:WinNT-Clusters
Notice:Info directories moved to DECWET::SHARE1$:[NT_CLSTR]
Moderator:DECWET::CAPPELLOF
Created:Thu Oct 19 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:863
Total number of notes:3478

726.0. "Dual HSZ40 and Battery-Fail?" by UTRTSC::VISSER () Thu Apr 03 1997 04:36

    
    Anybody with suggestions or solutions on this issue?
    
CONFIGURATION:

Windows NT V3.51
Digital Clusters for Windows NT 1.0-1134 SP2.

 Term+ +---------------------   -----   -----------------+ +Term
      \|                     \ /     \ /                 |/
  +----+-----+            +---+---+---+---+         +----+-----+
  |  KZPSA   |            |HSZ40-B|HSZ40-B|         |  KZPSA   |
  |          |            |       |       |         |          |
  |          |            |ID=4,5 |ID=4,5 |         |          |
  |          |            |Pref=4 |Pref=5 |         |          |
  |          |            |       |       |         |          |
  | AS1000A  |            |32mbWBC|32mbWBC|         | AS1000A  |
  +----------+            +-------+-------+         +----------+
                           2x RAID sets:
                             D400, D500
                          HSOF V3.0Z-2

The following has been observed:
HSZ40 ID#5 suffered a Battery-failure, which caused this HSZ40 to
do a full stop (goes down, as expected).
RAIDset D500 failed over to the other HSZ40 (ID#4) and this unit
was available AFTER this incident.
However, D400 became unaccessable and the associated application became
unavailable. From the HSZ point of view we do not have any indication
why D400 became unavailable.

WHAT is the expected behaviour in such a cluster-environment, if of one
HSZ40 the battery fails and a failover of the associated unit(s) on the
HSZ's occur?

Failover of UNITS from one to another HSZ40 SHOULD be transparent for
the WNT cluster-software, EXCEPT when the failover would take TOO long
or any other complications occur.

Which recommendations can be thought of to approach this problem?
(The cluster is a HIGHLY CRITICAL production environment, where we cannot
play around....).

			Jan Visser


T.RTitleUserPersonal
Name
DateLines
726.1BSS::F_BLANDOJe suis grand, beau, et fort!Thu Apr 03 1997 12:591
What version of HSZDISK.SYS?
726.2MSE1::PCOTEpress one now for personal nameThu Apr 03 1997 18:0548

  A low battery condition will cause an hsz controller failover
  to occur (in dual-redundant configurations) with firmware V3.0
  or greater. Older versions of the HSOF did not support this.
  (see the extract below).

  HSZ controller failovers should be transparent to Windows NT 
  and to the NT cluster software. This is one of the functions of
  the hszdisk.sys filter driver. 

  What was the state of D400 from the HSZ's point of view ? What
  it operative ? If so, could NT still see the disk partition(s)
  associated with the storageset ? What was the status of the
  cluster failover group that was associated with D400 ? 

  You should check the FMlog files for any anomalies around the time
  of the hsz failover. 

  Post a note in the hsz40_product notes conference since this
  seems to be a hsz failover issue and not a cluster issue.

  Better yet, upgrade to the hsz50. The cache battery (ECB) design
  is MUCH improved. 







HSZ40 Array Controller Operating Software (HSOF), Version 3.0 

SPD 53.54.09 

DESCRIPTION 



Cache Battery Diagnostic 

Software Version 3.0 checks the condition of the optional write-back cache 
batteries every 24 hours. If a low capacity or failure is detected, write-back
cache data is flushed from cache and depending on the pre-defined cache policy,
selected RAIDsets and disk mirrorsets may become inoperative. In dual redundant
configurations, failover to the redundant controller will occur.
Refer to the HSZ40 Array Controller Operating System Software Release Notes,
EK-HSZ40-RN. K01, for further information. 
726.3DetailsUTRTSC::VISSERWed Apr 30 1997 11:1525
    
    From the HSZ point of vieuw (during the problem), d400 was
    AVAILABLE.
    From the Cluster point of view, D400 was OFFLINE on BOTH systems
    and when trying to force online, an UNKNOWN ERROR code popped up.
    
    Looking through the FMlogs (where there is a lot of information),
    the cluster-software tries to failover the D400 Shared disk to the
    other system, which also fails. Timeout's occur and error-threshold 
    is exceeded.
    
    For a still unknown reason, during the start of the problems, BOTH
    system's FMlogs report "This system has lost connectivity to node
    <other_node>".
    They do have a TAPE on the shared SCSI-bus and are doing a backup of
    the Shared disk on the other system via the network.
    Although I did NOT find any statement that TAPES are NOT supported
    on the shared bus, I assume the release notes must be read as: it
    does NOT mention any tapes, so NOT supported on the shared bus.
    
    Hszdisk.sys is at V2.51 and we found it's NOT stable. V2.71 behaves
    much better.
    So that is at least ONE step which MUST be done.
    
    			Jan Visser.