[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::winnt-clusters

Title:	WinNT-Clusters
Notice:	Info directories moved to DECWET::SHARE1$:[NT_CLSTR]
Moderator:	DECWET::CAPPELLOF

Created:	Thu Oct 19 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	863
Total number of notes:	3478

726.0. "Dual HSZ40 and Battery-Fail?" by UTRTSC::VISSER () Thu Apr 03 1997 03:36

    
    Anybody with suggestions or solutions on this issue?
    
CONFIGURATION:

Windows NT V3.51
Digital Clusters for Windows NT 1.0-1134 SP2.

 Term+ +---------------------   -----   -----------------+ +Term
      \|                     \ /     \ /                 |/
  +----+-----+            +---+---+---+---+         +----+-----+
  |  KZPSA   |            |HSZ40-B|HSZ40-B|         |  KZPSA   |
  |          |            |       |       |         |          |
  |          |            |ID=4,5 |ID=4,5 |         |          |
  |          |            |Pref=4 |Pref=5 |         |          |
  |          |            |       |       |         |          |
  | AS1000A  |            |32mbWBC|32mbWBC|         | AS1000A  |
  +----------+            +-------+-------+         +----------+
                           2x RAID sets:
                             D400, D500
                          HSOF V3.0Z-2

The following has been observed:
HSZ40 ID#5 suffered a Battery-failure, which caused this HSZ40 to
do a full stop (goes down, as expected).
RAIDset D500 failed over to the other HSZ40 (ID#4) and this unit
was available AFTER this incident.
However, D400 became unaccessable and the associated application became
unavailable. From the HSZ point of view we do not have any indication
why D400 became unavailable.

WHAT is the expected behaviour in such a cluster-environment, if of one
HSZ40 the battery fails and a failover of the associated unit(s) on the
HSZ's occur?

Failover of UNITS from one to another HSZ40 SHOULD be transparent for
the WNT cluster-software, EXCEPT when the failover would take TOO long
or any other complications occur.

Which recommendations can be thought of to approach this problem?
(The cluster is a HIGHLY CRITICAL production environment, where we cannot
play around....).

			Jan Visser

T.R	Title	User	Personal Name	Date	Lines
726.1		BSS::F_BLANDO	Je suis grand, beau, et fort!	`Thu Apr 03 1997 11:59`	1
	What version of HSZDISK.SYS?
726.2		MSE1::PCOTE	press one now for personal name	`Thu Apr 03 1997 17:05`	48
	A low battery condition will cause an hsz controller failover to occur (in dual-redundant configurations) with firmware V3.0 or greater. Older versions of the HSOF did not support this. (see the extract below). HSZ controller failovers should be transparent to Windows NT and to the NT cluster software. This is one of the functions of the hszdisk.sys filter driver. What was the state of D400 from the HSZ's point of view ? What it operative ? If so, could NT still see the disk partition(s) associated with the storageset ? What was the status of the cluster failover group that was associated with D400 ? You should check the FMlog files for any anomalies around the time of the hsz failover. Post a note in the hsz40_product notes conference since this seems to be a hsz failover issue and not a cluster issue. Better yet, upgrade to the hsz50. The cache battery (ECB) design is MUCH improved. HSZ40 Array Controller Operating Software (HSOF), Version 3.0 SPD 53.54.09 DESCRIPTION Cache Battery Diagnostic Software Version 3.0 checks the condition of the optional write-back cache batteries every 24 hours. If a low capacity or failure is detected, write-back cache data is flushed from cache and depending on the pre-defined cache policy, selected RAIDsets and disk mirrorsets may become inoperative. In dual redundant configurations, failover to the redundant controller will occur. Refer to the HSZ40 Array Controller Operating System Software Release Notes, EK-HSZ40-RN. K01, for further information.
726.3	Details	UTRTSC::VISSER		`Wed Apr 30 1997 10:15`	25
	From the HSZ point of vieuw (during the problem), d400 was AVAILABLE. From the Cluster point of view, D400 was OFFLINE on BOTH systems and when trying to force online, an UNKNOWN ERROR code popped up. Looking through the FMlogs (where there is a lot of information), the cluster-software tries to failover the D400 Shared disk to the other system, which also fails. Timeout's occur and error-threshold is exceeded. For a still unknown reason, during the start of the problems, BOTH system's FMlogs report "This system has lost connectivity to node <other_node>". They do have a TAPE on the shared SCSI-bus and are doing a backup of the Shared disk on the other system via the network. Although I did NOT find any statement that TAPES are NOT supported on the shared bus, I assume the release notes must be read as: it does NOT mention any tapes, so NOT supported on the shared bus. Hszdisk.sys is at V2.51 and we found it's NOT stable. V2.71 behaves much better. So that is at least ONE step which MUST be done. Jan Visser.