[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:VAX and Alpha VMS
Notice:This is a new VMSnotes, please read note 2.1
Moderator:VAXAXP::BERNARDO
Created:Wed Jan 22 1997
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:703
Total number of notes:3722

399.0. "Shadowing is not removing a "bad" member of a shadowset..." by BACHUS::CAERELS () Fri Mar 28 1997 10:15

Hello,

My customer (Volvo) lost 250 cars due to Volvume shadowing not doing what it is should do: if a member
of a shadowset has problems, remove it from the shadowset and continue working on 1 member.

Scenario:

		        
	+-------+-O-O-O-O			+-------+-O-O-O-O
	| H | H |				| H | H |
	| S | S |				| S | S |
	| J | J |				| J | J |
	| 4 | 4 |				| 4 | 4 |
	| 0 | 0 |				| 0 | 0 |
	|   |   |				|   |   |
	| A | B |				| C | D |
	+-------+				+-------+
		\				/
		 \-------------O---------------/
		 /			       \
		/			        \
	+-------+				+-------+
	|  VAX  |				|  VAX  |
	+-------+				+-------+

	Disks are shadowed by VMS between HSJA/B and HSJC/D.
	One disk on HSJA generated errors, resulting in a mount verification on the shadowset. This mount
	verification never completed although only one member of the shadowset was "bad". The customer didn't
	think of powering off the HSJ, rebooted one node holding its' pagefiles on that disk resulting in a
	SHADDETINCON bugcheck on all other members.

	Question: Why did shadowing not remove this physical member from its' shadowset ?

	I've included a SDA output of one of the nodes that had this SHADDETINCON crash. You can see the
	dsa1110 device in mount verification, with one member having a MVIRP (special mount verification IRP),
	the other member seems to be OK...

thank you for any input...

Rik Caerels

                                                                                
DSA1110                                 HSX00             UCB address:  9F21D9C0

Device status:   00064810 online,valid,mntverip,lcl_valid,supmvmsg
Characteristics: 1C4D4008 dir,fod,shr,avl,mnt,elg,idv,odv,rnd
                 00082021 clu,mscp,loc,vrt

Owner UIC [000001,000004]   Operation count        852   ORB address    9EB00380
      PID        00000000   Error count              0   DDB address    9DC00780
Alloc. lock ID   080002D7   Reference count          1   DDT address    9FCCC018
Alloc. class            0   Online count             1   VCB address    9F1E3500
Class/Type          01/8D   BOFF                  0000   CRB address    9DC00A00
Def. buf. size        512   Byte count            7E00   PDT address    9DC44990
DEVDEPEND        0BCE1055   SVAPTE            A6DE6468   CDDB address   9DC44040
DEVDEPND2        00000000   DEVSTS                010C   SHAD address   9F389440
FLCK index             34   RWAITCNT              0001   I/O wait queue    empty
DLCK address     00000000                                                       

Shadow Virtual Unit DEVSTS status:   010C nocnvrt,du_shmv_strtd,mscp_mntverip

		----- Shadow Descriptor Block (SHAD) 9F389440 -----

Virtual Unit status:              0000 

Members                0    Act user IRPs          1    VU UCB          9F21D9C0
Devices                0    SCB LBN         00000000    MMB             050003EE
Fcpy Targets         192    Generation Num  9F3897C8    Master FL       30313131
Mcpy Targets           0                    9FCD5BEC    Restart FL      564E4944
Last Read Index        0    Virtual Unit Id 00000001                            
Master Index           0                    050003EB                            

	    ----- SHAD Device summary for DSA1110  -----


		--- Primary Class Driver Data Block (CDDB) 9DC44040 ---

Status:              0000 
Status2:             0000 
Controller Flags:    00D0 cf_this,cf_misc,cf_attn

Allocation class       0    CDRP Queue         empty    DDB address     9DC00780
System ID       00000000    Restart Queue      empty    CRB address     9DC00A00
                    0000    DAP Count              0    CDDB link       9DC4F7C0
Contrl. ID      00000000    Contr. timeout         0    PDT address     00000000
                00000000    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DC44180
MSCP Cmd status 00000000                                                        

	*** I/O request queue is empty ***

		--- Volume Control Block (VCB) 9F1E3500 ---

Volume: PR110DISK        Lock name: PR110DISK   
Status:  A0 extfid,system
Status2: 14 mountver,nohighwater
Status3: 00000000 
Shadow status: 01 shadmast

Mount count            1    Rel. volume            0    AQB address     9DCE6E40
Transactions           1    Max. files        410947    RVT address     9F21D9C0
Free blocks       856632    Rsvd. files           10    FCB queue       9F21DB00

OpenVMS (TM) VAX V6.1     -- System Dump Analysis				28-MAR-1997 15:46:33.93			Page 2
I/O data structures



Window size            7    Cluster size           4    Cache blk.      9F368D40
Vol. lock ID    050003F7    Def. extend sz.        5    Shadow mem. FL  9F1E3600
Shadow lock ID  080003FC    Record size            0    Shadow mem. BL  9F1E3700

	    --- Shadow set DSA1110 member summary ---

Volume: PR110DISK   

Physical unit     Primary path      Secondary path    Member status
-------------     ------------      --------------    -------------
$1$DUA110         HSJ10             HSJ11             Merge copy in progress    
$1$DUA210         HSJ20             HSJ21             Merge copy in progress    

		    --- ACP Queue Block (AQB) 9DCE6E40 ---

ACP requests are serviced by the eXtended Qio Processor (XQP)

Status: 14 defsys,xqioproc

Mount count            4    ACP type           f11v2    Linkage         9DC44000
                            ACP class            157    Request queue   00000000

	*** ACP request queue is empty ***
NVJ$DUA110 (HSJ11$DUA110)               HSX00             UCB address:  9DC52B40

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count        639   ORB address    9DC5B500
      PID        00000000   Error count              0   DDB address    9BFCE7F0
Alloc. lock ID   050003BF   Reference count          1   DDT address    9FD22C98
Alloc. class            1   Online count             1   VCB address    9F1E3600
Class/Type          01/8D   BOFF                  0080   CRB address    9DC42F80
Def. buf. size        512   Byte count            0200   PDT address    9DC44990
DEVDEPEND        0BCE1055   SVAPTE            CE549698   CDDB address   9DC4F7C0
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F389440
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9DCE8BC0
DLCK address     00000000                                2P_DDB address 9DCE7A80
                                                         I/O wait queue    empty
				I/O request queue
				-----------------

STATE CDRP/IRP    PID   MODE CHAN  FUNC    WCB     EFN    AST     IOSB    STATUS

 C    9EDEF820  9FCCD90A  K  0000  000C  9EDEF640   0  9EDEF700  00000000  2102
	readpblk func,physio,mvirp 


OpenVMS (TM) VAX V6.1     -- System Dump Analysis				28-MAR-1997 15:46:33.93			Page 3
I/O data structures




		--- Volume Control Block (VCB) 9F1E3600 ---

Volume: PR110DISK     (Member of shadow set DSA1110)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9DC52B40    Virtual unit UCB  9F21D9C0
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F1E3500
AQB address     9DCE6E40                  00000000    Shadow member FL  9F1E3700
RVT address     9F21D9C0                              Shadow member BL  9F1E359C
NVJ$DUA210 (HSJ21$DUA210)               HSX00             UCB address:  9DC5CD80

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count        634   ORB address    9DCE0D80
      PID        00000000   Error count              0   DDB address    9DCE1580
Alloc. lock ID   050003CB   Reference count          1   DDT address    9FD22C98
Alloc. class            1   Online count             1   VCB address    9F1E3700
Class/Type          01/8D   BOFF                  0080   CRB address    9DCE2C80
Def. buf. size        512   Byte count            0200   PDT address    9DC44990
DEVDEPEND        0BCE1055   SVAPTE            CE54969C   CDDB address   9DCE37C0
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F389440
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9F355800
DLCK address     00000000                                2P_DDB address 9DDB3F80
                                                         I/O wait queue    empty

	*** I/O request queue is empty ***

		--- Volume Control Block (VCB) 9F1E3700 ---

Volume: PR110DISK     (Member of shadow set DSA1110)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9DC5CD80    Virtual unit UCB  9F21D9C0
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F1E3500
AQB address     9DCE6E40                  00000000    Shadow member FL  9F1E359C
RVT address     9F21D9C0                              Shadow member BL  9F1E3600
    
T.RTitleUserPersonal
Name
DateLines
399.1VMSSG::FRIEDRICHSAsk me about Young EaglesFri Mar 28 1997 12:2750
   (.0 reformatted)
    
                       <<< Note 399.0 by BACHUS::CAERELS >>>
        -< Shadowing is not removing a "bad" member of a shadowset... >-


Hello,

My customer (Volvo) lost 250 cars due to Volvume shadowing not doing what it is
should do: if a member of a shadowset has problems, remove it from the shadowset
and continue working on 1 member.

Scenario:

		        
	+-------+-O-O-O-O			+-------+-O-O-O-O
	| H | H |				| H | H |
	| S | S |				| S | S |
	| J | J |				| J | J |
	| 4 | 4 |				| 4 | 4 |
	| 0 | 0 |				| 0 | 0 |
	|   |   |				|   |   |
	| A | B |				| C | D |
	+-------+				+-------+
		\				/
		 \-------------O---------------/
		 /			       \
		/			        \
	+-------+				+-------+
	|  VAX  |				|  VAX  |
	+-------+				+-------+

	Disks are shadowed by VMS between HSJA/B and HSJC/D.
One disk on HSJA generated errors, resulting in a mount verification on the
shadowset. This mount verification never completed although only one member of
the shadowset was "bad". The customer didn't think of powering off the HSJ,
rebooted one node holding its' pagefiles on that disk resulting in a
SHADDETINCON bugcheck on all other members.

Question: Why did shadowing not remove this physical member from its' shadowset
?

I've included a SDA output of one of the nodes that had this SHADDETINCON crash.
You can see the dsa1110 device in mount verification, with one member having a
MVIRP (special mount verification IRP), the other member seems to be OK...

thank you for any input...

Rik Caerels
    
399.2IPMT if you really want it looked at!VMSSG::FRIEDRICHSAsk me about Young EaglesFri Mar 28 1997 12:377
    What version of VMS are you running?  patches?
    
    What is SHADOW_MBR_TMO?  What is MVTIMOUT?
    
    Cheers,
    jeff
    
399.3Can you post SHO DEV for DUA110 and DUA210 from Crash?CSC32::M_DIFABIOMOVL #OPINION,EXE$GL_BLAKHOLEFri Mar 28 1997 19:3211
    Wish you had done a SHO DEV DUA for each of the physical disks. Then
    we would have gotten the Primary CDDB info for each. Mount verify, so
    were likely doing PACKACK's to a controller. I won't speculate without
    looking at the dump, but if you have the dump could you post the output
    of SHO DEV DUA110 and DUA210. And do you have VAXDRIV04_070 installed?
    
      For shadowing to expell a member, we would need to get some response/
    end message from the HSJ during the PACKACK. We check if it's time to
    expell a member when we get a response from a command. 
    
                      Mark d. 
399.4Sounds normal to meRICKS::OPPMon Mar 31 1997 21:3413
    	In my experience with fault-tolerant VAX systems, the behavior
    you described for Host-Based Volume Shadowing is *normal* assuming
    you've used default SYSGEN parameters.  This is because VMS tries 
    to preserve data integrity in favor of response time w.r.t. to disk
    errors, such as bad block replacements, etc.  You need to determine
    how aggressive your customer needs/wants to be in the other direction.
    For example, the SYSGEN shadow member time-out parameter can be sig-
    nificantly reduced from the default (formerly 20 seconds).  
    
    Regards,
    
    Greg
    
399.5Still need a response from the I/OCSC32::M_DIFABIOMOVL #OPINION,EXE$GL_BLAKHOLETue Apr 01 1997 13:344
    ...But if there is no response from the I/O a shorter SHADOW_MBR_TMO
    buys you nothing. 
    
                Mark d.
399.6VAXDRIV04_070 - that is most probably the cure...BACHUS::CAERELSWed Apr 02 1997 02:11351
Thanks for all replies, sorry for responding so late - we had Easter holidays.

re .2

Customer runs version 6.1 with several patches of which : VAXSHAD09_061 and
VAXDRIV02_070.
Shadow sysgen parameters and MVTIMOUT are at default values.

re .3
I've included an SDA output of your requested info.

>>>   looking at the dump, but if you have the dump could you post the output
>>>   of SHO DEV DUA110 and DUA210. And do you have VAXDRIV04_070 installed?

great input (at the time of the 6.1 upgrade we installed VAXDRIV02_070. Due to
time constraints (reboot = downtime and this 24x24 production system) no major 
revision of patches has been done - this is certainly something we have to
review !. Anyway in the release notes of VAXDRIV04_070 I saw: 

      o  A problem exists with HSJ/HSD30,40 and  50  controllers  where,
         after an event that initiates Mountverfication, a Pack-Ack will
         fail to complete.  The controller will report that it is making
         progress  on  the  command, but will never finish.  This causes
         all IO to the affected devices to be hung.

which is exactly the problem we had. Can you confirm this ?

Thanks again to Mark D. for this swift response,

regards,

Rik Caerels


================================================================================

HSJ10$DUA110 (HSJ11$DUA110)             HSX00             UCB address:  9DC52B40

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count        639   ORB address    9DC5B500
      PID        00000000   Error count              0   DDB address    9BFCE7F0
Alloc. lock ID   050003BF   Reference count          1   DDT address    9FD22C98
Alloc. class            1   Online count             1   VCB address    9F1E3600
Class/Type          01/8D   BOFF                  0080   CRB address    9DC42F80
Def. buf. size        512   Byte count            0200   PDT address    9DC44990
DEVDEPEND        0BCE1055   SVAPTE            CE549698   CDDB address   9DC4F7C0
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F389440
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9DCE8BC0
DLCK address     00000000                                2P_DDB address 9DCE7A80
                                                         I/O wait queue    empty
Shadow Member Device DEVSTS status:   4004 nocnvrt,mscp_ignsrv

		----- Shadow Descriptor Block (SHAD) 9F389440 -----

Virtual Unit status:              0000 

Members                0    Act user IRPs          1    VU UCB          9F21D9C0
Devices                0    SCB LBN         00000000    MMB             050003EE
Fcpy Targets         192    Generation Num  9F3897C8    Master FL       30313131
Mcpy Targets           0                    9FCD5BEC    Restart FL      564E4944
Last Read Index        0    Virtual Unit Id 00000001                            
Master Index           0                    050003EB                            

	    ----- SHAD Device summary for $1$DUA110  -----


		--- Primary Class Driver Data Block (CDDB) 9DC4F7C0 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue      9EDF86A0    DDB address     9BFCE7F0
System ID       10073520    Restart Queue      empty    CRB address     9DC42F80
                    4200    DAP Count              7    CDDB link       9DCDD0C0
Contrl. ID      54411859    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           1    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DC52B40
MSCP Cmd status 00000000                                                        

OpenVMS (TM) VAX V6.1     -- System Dump Analysis				 1-APR-1997 10:19:38.68			Page 2
I/O data structures




		--- Secondary Class Driver Data Block (CDDB) 9DCE8BC0 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue         empty    DDB address     9DCE7A80
System ID       10083522    Restart Queue      empty    CRB address     9DCE7C80
                    4200    DAP Count              8    CDDB link       9F355800
Contrl. ID      53210933    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DCE9340
MSCP Cmd status 00000000                                                        
				I/O request queue
				-----------------

STATE CDRP/IRP    PID   MODE CHAN  FUNC    WCB     EFN    AST     IOSB    STATUS

 C    9EDEF820  9FCCD90A  K  0000  000C  9EDEF640   0  9EDEF700  00000000  2102
	readpblk func,physio,mvirp 


		--- Volume Control Block (VCB) 9F1E3600 ---

Volume: PR110DISK     (Member of shadow set DSA1110)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9DC52B40    Virtual unit UCB  9F21D9C0
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F1E3500
AQB address     9DCE6E40                  00000000    Shadow member FL  9F1E3700
RVT address     9F21D9C0                              Shadow member BL  9F1E359C

(HSJ11$DUA110) HSJ10$DUA110             HSX00             UCB address:  9DC52B40

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count        639   ORB address    9DC5B500
      PID        00000000   Error count              0   DDB address    9BFCE7F0
Alloc. lock ID   050003BF   Reference count          1   DDT address    9FD22C98
Alloc. class            1   Online count             1   VCB address    9F1E3600
Class/Type          01/8D   BOFF                  0080   CRB address    9DC42F80
Def. buf. size        512   Byte count            0200   PDT address    9DC44990
DEVDEPEND        0BCE1055   SVAPTE            CE549698   CDDB address   9DC4F7C0
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F389440
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9DCE8BC0
DLCK address     00000000                                2P_DDB address 9DCE7A80
                                                         I/O wait queue    empty
Shadow Member Device DEVSTS status:   4004 nocnvrt,mscp_ignsrv

		----- Shadow Descriptor Block (SHAD) 9F389440 -----

Virtual Unit status:              0000 

Members                0    Act user IRPs          1    VU UCB          9F21D9C0
Devices                0    SCB LBN         00000000    MMB             050003EE
Fcpy Targets         192    Generation Num  9F3897C8    Master FL       30313131
Mcpy Targets           0                    9FCD5BEC    Restart FL      564E4944
Last Read Index        0    Virtual Unit Id 00000001                            
Master Index           0                    050003EB                            

	    ----- SHAD Device summary for $1$DUA110  -----


		--- Primary Class Driver Data Block (CDDB) 9DC4F7C0 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue      9EDF86A0    DDB address     9BFCE7F0
System ID       10073520    Restart Queue      empty    CRB address     9DC42F80
                    4200    DAP Count              7    CDDB link       9DCDD0C0
Contrl. ID      54411859    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           1    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DC52B40
MSCP Cmd status 00000000                                                        

		--- Secondary Class Driver Data Block (CDDB) 9DCE8BC0 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue         empty    DDB address     9DCE7A80
System ID       10083522    Restart Queue      empty    CRB address     9DCE7C80
                    4200    DAP Count              8    CDDB link       9F355800
Contrl. ID      53210933    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DCE9340
MSCP Cmd status 00000000                                                        
				I/O request queue
				-----------------

STATE CDRP/IRP    PID   MODE CHAN  FUNC    WCB     EFN    AST     IOSB    STATUS

 C    9EDEF820  9FCCD90A  K  0000  000C  9EDEF640   0  9EDEF700  00000000  2102
	readpblk func,physio,mvirp 


		--- Volume Control Block (VCB) 9F1E3600 ---

Volume: PR110DISK     (Member of shadow set DSA1110)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9DC52B40    Virtual unit UCB  9F21D9C0
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F1E3500
AQB address     9DCE6E40                  00000000    Shadow member FL  9F1E3700
RVT address     9F21D9C0                              Shadow member BL  9F1E359C
HSJ20$DUA210 (HSJ21$DUA210)             HSX00             UCB address:  9DC5CD80

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count        634   ORB address    9DCE0D80
      PID        00000000   Error count              0   DDB address    9DCE1580
Alloc. lock ID   050003CB   Reference count          1   DDT address    9FD22C98
Alloc. class            1   Online count             1   VCB address    9F1E3700
Class/Type          01/8D   BOFF                  0080   CRB address    9DCE2C80
Def. buf. size        512   Byte count            0200   PDT address    9DC44990
DEVDEPEND        0BCE1055   SVAPTE            CE54969C   CDDB address   9DCE37C0
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F389440
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9F355800
DLCK address     00000000                                2P_DDB address 9DDB3F80
                                                         I/O wait queue    empty
Shadow Member Device DEVSTS status:   4004 nocnvrt,mscp_ignsrv

		----- Shadow Descriptor Block (SHAD) 9F389440 -----

Virtual Unit status:              0000 

Members                0    Act user IRPs          1    VU UCB          9F21D9C0
Devices                0    SCB LBN         00000000    MMB             050003EE
Fcpy Targets         192    Generation Num  9F3897C8    Master FL       30313131
Mcpy Targets           0                    9FCD5BEC    Restart FL      564E4944
Last Read Index        0    Virtual Unit Id 00000001                            
Master Index           0                    050003EB                            

	    ----- SHAD Device summary for $1$DUA210  -----


		--- Primary Class Driver Data Block (CDDB) 9DCE37C0 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue      9EDFA1A0    DDB address     9DCE1580
System ID       10093920    Restart Queue      empty    CRB address     9DCE2C80
                    4200    DAP Count              8    CDDB link       9DCE53C0
Contrl. ID      54600015    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DC5CD80
MSCP Cmd status FFFFFFFF                                                        

		--- Secondary Class Driver Data Block (CDDB) 9F355800 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue         empty    DDB address     9DDB3F80
System ID       100A3922    Restart Queue      empty    CRB address     9DDB4000
                    4200    DAP Count              8    CDDB link       00000000
Contrl. ID      54411986    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9F209C40
MSCP Cmd status 00000000                                                        

	*** I/O request queue is empty ***

		--- Volume Control Block (VCB) 9F1E3700 ---

Volume: PR110DISK     (Member of shadow set DSA1110)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9DC5CD80    Virtual unit UCB  9F21D9C0
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F1E3500
AQB address     9DCE6E40                  00000000    Shadow member FL  9F1E359C
RVT address     9F21D9C0                              Shadow member BL  9F1E3600
(HSJ21$DUA210) HSJ20$DUA210             HSX00             UCB address:  9DC5CD80

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count        634   ORB address    9DCE0D80
      PID        00000000   Error count              0   DDB address    9DCE1580
Alloc. lock ID   050003CB   Reference count          1   DDT address    9FD22C98
Alloc. class            1   Online count             1   VCB address    9F1E3700
Class/Type          01/8D   BOFF                  0080   CRB address    9DCE2C80
Def. buf. size        512   Byte count            0200   PDT address    9DC44990
DEVDEPEND        0BCE1055   SVAPTE            CE54969C   CDDB address   9DCE37C0
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F389440
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9F355800
DLCK address     00000000                                2P_DDB address 9DDB3F80
                                                         I/O wait queue    empty
Shadow Member Device DEVSTS status:   4004 nocnvrt,mscp_ignsrv

		----- Shadow Descriptor Block (SHAD) 9F389440 -----

Virtual Unit status:              0000 

Members                0    Act user IRPs          1    VU UCB          9F21D9C0
Devices                0    SCB LBN         00000000    MMB             050003EE
Fcpy Targets         192    Generation Num  9F3897C8    Master FL       30313131
Mcpy Targets           0                    9FCD5BEC    Restart FL      564E4944
Last Read Index        0    Virtual Unit Id 00000001                            
Master Index           0                    050003EB                            

	    ----- SHAD Device summary for $1$DUA210  -----


		--- Primary Class Driver Data Block (CDDB) 9DCE37C0 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue      9EDFA1A0    DDB address     9DCE1580
System ID       10093920    Restart Queue      empty    CRB address     9DCE2C80
                    4200    DAP Count              8    CDDB link       9DCE53C0
Contrl. ID      54600015    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DC5CD80
MSCP Cmd status FFFFFFFF                                                        

		--- Secondary Class Driver Data Block (CDDB) 9F355800 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue         empty    DDB address     9DDB3F80
System ID       100A3922    Restart Queue      empty    CRB address     9DDB4000
                    4200    DAP Count              8    CDDB link       00000000
Contrl. ID      54411986    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9F209C40
MSCP Cmd status 00000000                                                        

	*** I/O request queue is empty ***

		--- Volume Control Block (VCB) 9F1E3700 ---

Volume: PR110DISK     (Member of shadow set DSA1110)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9DC5CD80    Virtual unit UCB  9F21D9C0
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F1E3500
AQB address     9DCE6E40                  00000000    Shadow member FL  9F1E359C
RVT address     9F21D9C0                              Shadow member BL  9F1E3600

================================================================================
    
399.7Follow-up questionsGREGOR::OPPWed Apr 02 1997 08:3010
    RE: .5
    
    	And if the I/O device is not responding, what's the design
    rationale for keeping it in the shadow set (unless it's the last
    surviving member)?  If the I/O device is a storage controller and
    it's not responding, why would HBVS not attempt to fail-over?  
    Thanks,
    
    Greg
    
399.8Doesn't look like DUDRIVER is the fixVMSSPT::JENKINSKevin M Jenkins VMS Support EngineeringWed Apr 02 1997 10:0619
        RE .0

    	I don't believe that you are seeing the Pack/ACK hang problem.
    The outstanding IO is a read. Also you version of SDA appears to
    be incompatible with your version of SHDRIVER. It is not interpreting
    the structures and bits properly.. There may/should be a newer version
    available somewhere? Perhaps one of the BOOT or DOSD kits?

    RE .7

    	The SHDRIVER code thread responsible for Mountverification is
    "stalled" waiting for one of it's IOs to complete. If the IO never
    completes then the thread is never resumed and hence the
    Mountverification becomes hung. Now when/if DUDRIVER resets the
    controller, then all SHDRIVER IOs are returned with and error status
    and thus the Mountverification thread is resumed and SHDRIVER can
    decide what to do about the Shadowsets membership.


399.9Not likely the DRIV04 issueCSC32::M_DIFABIOMOVL #OPINION,EXE$GL_BLAKHOLEWed Apr 02 1997 11:5512
     What Kevin looked at was the Response ID and the MSCP Cmd status
    fields in the CDDB. What we look for is a non-zero Response ID and an 
    MSCP Cmd status that is negative, going from FFFFFFFE to a more
    negative number every controller timeout seconds. (He also looked at
    your Master FL and Restart FL and saw they were not valid addresses,
    hence the SDA mismatch he mentioned.)
    
      So yes, that is exactly what I was looking for and no, it doesn't 
    appear that you had that specific problem.  In your info, there was 
    a read outstanding to DUA110 at the time of the crash.
    
                  Mark d. 
399.10more infoBACHUS::CAERELSThu Apr 03 1997 03:19225
    
    
	There's a whole story behind this problem and I'll spare you the 
    details. One important fact is that one disk was replaced (due to 
    exceeded errors on that disk) and this could have caused all the
    problems :
    access to all disks on the HSJ controller serving that particular
    disk became impossible due to the mount verify, mounted status.
    Of the 4 nodes in the cluster, 1 system was rebooted and hung
    in its' boot process for 3 hours (MOUNTV image in STARTUP)

===============================================================================

HSJ20$DUA213                            HSX00             UCB address:  9DC5D3C0

Device status:   00020010 online,lcl_valid
Characteristics: 1CC54008 dir,fod,shr,avl,elg,all,idv,odv,rnd
                 00002221 clu,mscp,nnm,loc

Owner UIC [000001,000004]   Operation count          0   ORB address    9DCE19C0
      PID        00010004   Error count              0   DDB address    9DCE1580
Alloc. lock ID   030004CE   Reference count          2   DDT address    9FD22C98
Alloc. class            1   Online count             1   CRB address    9DCE2C80
Class/Type          01/8D   BOFF                  0000   PDT address    9DC44990
Def. buf. size        512   Byte count            0000   CDDB address   9DCE37C0
DEVDEPEND        00000000   SVAPTE            00000000   I/O wait queue    empty
DEVDEPND2        00000000   DEVSTS                1004                          
FLCK index             34   RWAITCNT              0001                          
DLCK address     00000000                                                       
Device   DEVSTS   status:   1004 nocnvrt,mscp_pkack

		--- Primary Class Driver Data Block (CDDB) 9DCE37C0 ---

Status:              0040 alcls_set
Status2:             0002 crnset
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc

Allocation class       1    CDRP Queue      9EDFA1A0    DDB address     9DCE1580
System ID       10093920    Restart Queue      empty    CRB address     9DCE2C80
                    4200    DAP Count              8    CDDB link       9DCE53C0
Contrl. ID      54600015    Contr. timeout       200    PDT address     9DC44990
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9DC5CD80
MSCP Cmd status FFFFFFFF                                                        
				I/O request queue
				-----------------

STATE CDRP/IRP    PID   MODE CHAN  FUNC    WCB     EFN    AST     IOSB    STATUS

 C    9EDFA9E0  00010004  E  FF30  0808  00000000  26  00000000  7FFE95B8  0101
	packack bufio,physio 

===============================================================================

SDA output of another node on this device (with the SDA.EXE from DOSD) has
in fact the non-zero Response ID and the MSCP Cmd status field negative,
although the DSA device is NOT in mount verification. Is this related to the
problem solved in DRIV04 ? The reason I insist is that the problem description
in DRIV04 is very much alike the situation we saw at the customer site.
(the customer agrees on this as well.) If they don't match, I'll escalate the
problem.

Rik




HSJ20$DUA213 (HSJ21$DUA213)             HSX00             UCB address:  9E70BB00

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count     104373   ORB address    9E71AAC0
      PID        00000000   Error count              2   DDB address    9E719480
Alloc. lock ID   14000B27   Reference count          1   DDT address    A0AFD498
Alloc. class            1   Online count             1   VCB address    9F7D9780
Class/Type          01/8D   BOFF                  0000   CRB address    9E719540
Def. buf. size        512   Byte count            0000   PDT address    9E6F2B90
DEVDEPEND        0BCE1055   SVAPTE            00000000   CDDB address   9E71A540
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F981D40
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9E71F340
DLCK address     00000000                                2P_DDB address 9E71E880
                                                         I/O wait queue    empty

Shadow Device status:   4004 nocnvrt,shd_wlgsta_cha

		----- Shadow Descriptor Block (SHAD) 9F981D40 -----

Virtual Unit status:              0041 normal,merging

Members                2    Act user IRPs          0    VU UCB          9F83F580
Devices                2    SCB LBN         001F5A50    Master FL          empty
Fcpy Targets           0    Generation Num  B106E760    Restart FL         empty
Mcpy Targets           2                    009B1BB2                            
Last Read Index        0    Virtual Unit Id 00000000                            
Master Index           0                    12610459                            

	    ----- SHAD Device summary for Virtual Unit  $1$DUA213  -----

Device $1$DUA113
  Index 0 Device Status    AE merge,cip,master,src,valid
  UCB 9E701300   VCB 9F7D9680   Unit Id. 12A10071 00000001   WLT: 9F8924D8
	Copy LBN FFFFFFFF
Device $1$DUA213
  Index 1 Device Status    A6 merge,cip,src,valid
  UCB 9E70BB00   VCB 9F7D9780   Unit Id. 12A100D5 00000001   WLT: 9F8A8318
	Copy LBN FFFFFFFF

		--- Primary Class Driver Data Block (CDDB) 9E71A540 ---

Status:              0040 alcls_set
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc

Allocation class       1    CDRP Queue      9F7C6160    DDB address     9E719480
System ID       10093920    Restart Queue      empty    CRB address     9E719540
                    4200    DAP Count              5    CDDB link       9E71E1C0
Contrl. ID      54600015    Contr. timeout       200    PDT address     9E6F2B90
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9E70B740
MSCP Cmd status FFFFFFFF                                                        

		--- Secondary Class Driver Data Block (CDDB) 9E71F340 ---

Status:              0040 alcls_set
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc

Allocation class       1    CDRP Queue      9F78F320    DDB address     9E71E880
System ID       100A3922    Restart Queue      empty    CRB address     9E71E940
                    4200    DAP Count              5    CDDB link       9E721780
Contrl. ID      54411986    Contr. timeout       200    PDT address     9E6F2B90
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     853D0065    Wait UCB Count         0    UCB chain       9E70B880
MSCP Cmd status FFFFFF94                                                        

	*** I/O request queue is empty ***

		--- Volume Control Block (VCB) 9F7D9780 ---

Volume: PR113DISK     (Member of shadow set DSA1113)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9E70BB00    Virtual unit UCB  9F83F580
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F7D9580
AQB address     9E70F540                  00000000    Shadow member FL  9F7D961C
RVT address     9F83F580                              Shadow member BL  9F7D9680
(HSJ21$DUA213) HSJ20$DUA213             HSX00             UCB address:  9E70BB00

Device status:   00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
                 02042231 clu,2p,mscp,nnm,loc,shd,wlg

Owner UIC [000001,000004]   Operation count     104373   ORB address    9E71AAC0
      PID        00000000   Error count              2   DDB address    9E719480
Alloc. lock ID   14000B27   Reference count          1   DDT address    A0AFD498
Alloc. class            1   Online count             1   VCB address    9F7D9780
Class/Type          01/8D   BOFF                  0000   CRB address    9E719540
Def. buf. size        512   Byte count            0000   PDT address    9E6F2B90
DEVDEPEND        0BCE1055   SVAPTE            00000000   CDDB address   9E71A540
DEVDEPND2        00000000   DEVSTS                4004   SHAD address   9F981D40
FLCK index             34   RWAITCNT              0000   2P_CDDB addr.  9E71F340
DLCK address     00000000                                2P_DDB address 9E71E880
                                                         I/O wait queue    empty

Shadow Device status:   4004 nocnvrt,shd_wlgsta_cha

		----- Shadow Descriptor Block (SHAD) 9F981D40 -----

Virtual Unit status:              0041 normal,merging

Members                2    Act user IRPs          0    VU UCB          9F83F580
Devices                2    SCB LBN         001F5A50    Master FL          empty
Fcpy Targets           0    Generation Num  B106E760    Restart FL         empty
Mcpy Targets           2                    009B1BB2                            
Last Read Index        0    Virtual Unit Id 00000000                            
Master Index           0                    12610459                            

	    ----- SHAD Device summary for Virtual Unit  $1$DUA213  -----

Device $1$DUA113
  Index 0 Device Status    AE merge,cip,master,src,valid
  UCB 9E701300   VCB 9F7D9680   Unit Id. 12A10071 00000001   WLT: 9F8924D8
	Copy LBN FFFFFFFF
Device $1$DUA213
  Index 1 Device Status    A6 merge,cip,src,valid
  UCB 9E70BB00   VCB 9F7D9780   Unit Id. 12A100D5 00000001   WLT: 9F8A8318
	Copy LBN FFFFFFFF

		--- Primary Class Driver Data Block (CDDB) 9E71A540 ---

Status:              0040 alcls_set
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc

Allocation class       1    CDRP Queue      9F7C6160    DDB address     9E719480
System ID       10093920    Restart Queue      empty    CRB address     9E719540
                    4200    DAP Count              5    CDDB link       9E71E1C0
Contrl. ID      54600015    Contr. timeout       200    PDT address     9E6F2B90
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     00000000    Wait UCB Count         0    UCB chain       9E70B740
MSCP Cmd status FFFFFFFF                                                        

		--- Secondary Class Driver Data Block (CDDB) 9E71F340 ---

Status:              0040 alcls_set
Controller Flags:    A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc

Allocation class       1    CDRP Queue      9F78F320    DDB address     9E71E880
System ID       100A3922    Restart Queue      empty    CRB address     9E71E940
                    4200    DAP Count              5    CDDB link       9E721780
Contrl. ID      54411986    Contr. timeout       200    PDT address     9E6F2B90
                01280009    Reinit Count           0    Original UCB    00000000
Response ID     853D0065    Wait UCB Count         0    UCB chain       9E70B880
MSCP Cmd status FFFFFF94                                                        

	*** I/O request queue is empty ***

		--- Volume Control Block (VCB) 9F7D9780 ---

Volume: PR113DISK     (Member of shadow set DSA1113)
Status: 00 
Copy sequence number: 001F  Copy type: 2 mgcpy

Transactions           1    UCB address   9E70BB00    Virtual unit UCB  9F83F580
Relative volume        0    Work area     001F5A51    Virtual unit VCB  9F7D9580
AQB address     9E70F540                  00000000    Shadow member FL  9F7D961C
RVT address     9F83F580                              Shadow member BL  9F7D9680
    
    
399.11Thank youGREGOR::OPPThu Apr 03 1997 12:177
    RE: .8
    
    	Thanks for the explanation.  I had naively assumed that there was
    an I/O time-out somewhere that would prevent infinite waiting.  
    
    Greg
    
399.12That is what I was looking forCSC32::M_DIFABIOMOVL #OPINION,EXE$GL_BLAKHOLEThu Apr 03 1997 21:0519
    re .10
    
      In short, yep. Basically we ensure that the controller is making
    progress on the oldest outstanding command. After 2 timeout periods
    we issue a GCS (Get Command Status) request to the controller. We
    start our 'countdown' at FFFFFFFE (It was FFFFFFFF at the first 
    timeout, FFFFFFFE at the second). If the controller responds that
    it is making progress, we poll each timeout. So your's was something
    like FFFFFFD8, so you were 40 timeout periods into polling. Since the
    controller was responding with a more 'negative' status each time, we
    go on from FFFFFFFF towards 0.
    
       It isn't always a packack that is reported to be 'making progress'.
    In your case it was a read, right? In any event, from that symptom
    you've shown you do want to get ALPDRIV04 installed. The alternative
    would have been to restart the controller that was reporting that it
    was making progress on the command. 
    
                          Mark d.