[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ssdevo::hsz40_product

Title:HSZ40 Product Conference
Moderator:SSDEVO::EDMONDS
Created:Mon Apr 11 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:902
Total number of notes:3319

831.0. "Hsz40 behaviour when battery fails" by ROMTSS::MATTACCHIONE () Wed Apr 02 1997 11:29

Hi,

I have have several basic questions about hsz40.

1) In redudancy mode how I can see which controller
   have the actual access to the disk?

2) Using the cache policy B and firmware version 2.7
   when a battery fails why some disk Raid5 and mirror are not switched
   to  the good controller?

3) In this condition I have to shutdown the two controllers to replace the
   defective batteries.
   (This is the procedure that is given to replace the batteries).

   Then:
   Why I (customer) have to use a redudancies configuration if I
   have to shut the two controllers ???
   
   In an emergency situation I have shut the defective controller,
   disconnect the trilink adapter and the system Dec Unix 3.2g is gone
   in crash. Then effective I MUST shut both controllers !

How everyone can understand this behaviour is very critical for our
big customers.


Thanks for any suggestions and comments.

Gabriele 
                                                                    
T.RTitleUserPersonal
Name
DateLines
831.1some thoughtsUTOPIE::OETTLhide bug until worst timeWed Apr 02 1997 14:5352
> 1) In redudancy mode how I can see which controller
>    have the actual access to the disk?

HSZ> sho unit full

Unit XXX .... ONLINE to THIS CONTROLLER (OTHER CONTROLLER)

> 2) Using the cache policy B and firmware version 2.7
>    when a battery fails why some disk Raid5 and mirror are not switched
>    to  the good controller?

Bug in V2.7Z. Corrected in V3.0Z.
The units will stay online, if battery is GOOD or LOW with CACHE_PLOICY=B.
RAID and mirrorsets need battery backed up cache because of data integrity
problems, that may occur in case of a power fail while a write is in progress.

> 3) In this condition I have to shutdown the two controllers to replace the
>    defective batteries.
>    (This is the procedure that is given to replace the batteries).

No. Use "set this preferred_id=(all_the_id's_you_have_configured)" on the HSZ
with the good batteries or issue a shutdown to the HSZ with the failed
batteries.
I think the shutdown is the best method, because you don't have to reconfigure
your controllers after the swap.

>   Then:
>   Why I (customer) have to use a redudancies configuration if I
>   have to shut the two controllers ???

Where is this stated?

>   In an emergency situation I have shut the defective controller,
>   disconnect the trilink adapter and the system Dec Unix 3.2g is gone
>   in crash. Then effective I MUST shut both controllers !
I think the OS will panic most likely (ADVFS domain panic only, if you're with
V3.2g, and lucky :-) ) when your batteries go defective.
To swap the batteries, I would use C_SWAP.

> How everyone can understand this behaviour is very critical for our
> big customers.


> Thanks for any suggestions and comments.

Use V3.0-3 immediately.


> Gabriele 
                                                                    
�tzi
831.2Failover time?UTRTSC::VISSERThu Apr 03 1997 02:5220
    
    Recently we had an incident with a dual-redundant HSZ40-B, running
    V30Z-2 in a Windows NT cluster environment.
    One HSZ40 failed due to a Bad Battery and, as expected with HSOF V30Z,
    it went down ("//"-LED solid on).
    The other HSZ40 took over, BUT the RAID5-set on that other controller
    became unavailable for the application for some unknown reason.
    Also the WNT Cluster Manager got confused somehow because of this,
    trying to get this UNIT back on any of the two Alpha-systems.
    
    Question:
    	If a Battery failes on ONE HSZ40, HOW LONG will any UNIT be
    	unavailable due to house-keeping being done by the controller?
    	If this is a SHORT time (seconds), then a Unit-failover from one
    	to another HSZ SHOULD be TRANSPARANT for the Windoes NT clsuter
    	software. If it takes TOO long, timeout's will occur within the
    	WNT cluster Manager and he will start working to get the unit back.
    
    		Jan Visser.
    
831.3Big doubt !NETRIX::"[email protected]"Thu Apr 03 1997 13:2913
Thanks for your reply,
but a big doubt remains like to the collegues at topic 1202

Replacing the battery backup we have to follow the instruction in 
document AV-QPDXA-TE where having mirrorset raid5 in redundancy modewe have
to shut both controllers before the replacement of the bad batteries or a
c_swap / controller shutdown procedure is certificate to replace them ????

This is too much important to don't have a sure answer.

Thanks Gabriele
  
[Posted by WWW Notes gateway]