| Title: | VAX and Alpha VMS |
| Notice: | This is a new VMSnotes, please read note 2.1 |
| Moderator: | VAXAXP::BERNARDO |
| Created: | Wed Jan 22 1997 |
| Last Modified: | Fri Jun 06 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 703 |
| Total number of notes: | 3722 |
Hello,
My customer (Volvo) lost 250 cars due to Volvume shadowing not doing what it is should do: if a member
of a shadowset has problems, remove it from the shadowset and continue working on 1 member.
Scenario:
+-------+-O-O-O-O +-------+-O-O-O-O
| H | H | | H | H |
| S | S | | S | S |
| J | J | | J | J |
| 4 | 4 | | 4 | 4 |
| 0 | 0 | | 0 | 0 |
| | | | | |
| A | B | | C | D |
+-------+ +-------+
\ /
\-------------O---------------/
/ \
/ \
+-------+ +-------+
| VAX | | VAX |
+-------+ +-------+
Disks are shadowed by VMS between HSJA/B and HSJC/D.
One disk on HSJA generated errors, resulting in a mount verification on the shadowset. This mount
verification never completed although only one member of the shadowset was "bad". The customer didn't
think of powering off the HSJ, rebooted one node holding its' pagefiles on that disk resulting in a
SHADDETINCON bugcheck on all other members.
Question: Why did shadowing not remove this physical member from its' shadowset ?
I've included a SDA output of one of the nodes that had this SHADDETINCON crash. You can see the
dsa1110 device in mount verification, with one member having a MVIRP (special mount verification IRP),
the other member seems to be OK...
thank you for any input...
Rik Caerels
DSA1110 HSX00 UCB address: 9F21D9C0
Device status: 00064810 online,valid,mntverip,lcl_valid,supmvmsg
Characteristics: 1C4D4008 dir,fod,shr,avl,mnt,elg,idv,odv,rnd
00082021 clu,mscp,loc,vrt
Owner UIC [000001,000004] Operation count 852 ORB address 9EB00380
PID 00000000 Error count 0 DDB address 9DC00780
Alloc. lock ID 080002D7 Reference count 1 DDT address 9FCCC018
Alloc. class 0 Online count 1 VCB address 9F1E3500
Class/Type 01/8D BOFF 0000 CRB address 9DC00A00
Def. buf. size 512 Byte count 7E00 PDT address 9DC44990
DEVDEPEND 0BCE1055 SVAPTE A6DE6468 CDDB address 9DC44040
DEVDEPND2 00000000 DEVSTS 010C SHAD address 9F389440
FLCK index 34 RWAITCNT 0001 I/O wait queue empty
DLCK address 00000000
Shadow Virtual Unit DEVSTS status: 010C nocnvrt,du_shmv_strtd,mscp_mntverip
----- Shadow Descriptor Block (SHAD) 9F389440 -----
Virtual Unit status: 0000
Members 0 Act user IRPs 1 VU UCB 9F21D9C0
Devices 0 SCB LBN 00000000 MMB 050003EE
Fcpy Targets 192 Generation Num 9F3897C8 Master FL 30313131
Mcpy Targets 0 9FCD5BEC Restart FL 564E4944
Last Read Index 0 Virtual Unit Id 00000001
Master Index 0 050003EB
----- SHAD Device summary for DSA1110 -----
--- Primary Class Driver Data Block (CDDB) 9DC44040 ---
Status: 0000
Status2: 0000
Controller Flags: 00D0 cf_this,cf_misc,cf_attn
Allocation class 0 CDRP Queue empty DDB address 9DC00780
System ID 00000000 Restart Queue empty CRB address 9DC00A00
0000 DAP Count 0 CDDB link 9DC4F7C0
Contrl. ID 00000000 Contr. timeout 0 PDT address 00000000
00000000 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DC44180
MSCP Cmd status 00000000
*** I/O request queue is empty ***
--- Volume Control Block (VCB) 9F1E3500 ---
Volume: PR110DISK Lock name: PR110DISK
Status: A0 extfid,system
Status2: 14 mountver,nohighwater
Status3: 00000000
Shadow status: 01 shadmast
Mount count 1 Rel. volume 0 AQB address 9DCE6E40
Transactions 1 Max. files 410947 RVT address 9F21D9C0
Free blocks 856632 Rsvd. files 10 FCB queue 9F21DB00
OpenVMS (TM) VAX V6.1 -- System Dump Analysis 28-MAR-1997 15:46:33.93 Page 2
I/O data structures
Window size 7 Cluster size 4 Cache blk. 9F368D40
Vol. lock ID 050003F7 Def. extend sz. 5 Shadow mem. FL 9F1E3600
Shadow lock ID 080003FC Record size 0 Shadow mem. BL 9F1E3700
--- Shadow set DSA1110 member summary ---
Volume: PR110DISK
Physical unit Primary path Secondary path Member status
------------- ------------ -------------- -------------
$1$DUA110 HSJ10 HSJ11 Merge copy in progress
$1$DUA210 HSJ20 HSJ21 Merge copy in progress
--- ACP Queue Block (AQB) 9DCE6E40 ---
ACP requests are serviced by the eXtended Qio Processor (XQP)
Status: 14 defsys,xqioproc
Mount count 4 ACP type f11v2 Linkage 9DC44000
ACP class 157 Request queue 00000000
*** ACP request queue is empty ***
NVJ$DUA110 (HSJ11$DUA110) HSX00 UCB address: 9DC52B40
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 639 ORB address 9DC5B500
PID 00000000 Error count 0 DDB address 9BFCE7F0
Alloc. lock ID 050003BF Reference count 1 DDT address 9FD22C98
Alloc. class 1 Online count 1 VCB address 9F1E3600
Class/Type 01/8D BOFF 0080 CRB address 9DC42F80
Def. buf. size 512 Byte count 0200 PDT address 9DC44990
DEVDEPEND 0BCE1055 SVAPTE CE549698 CDDB address 9DC4F7C0
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F389440
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9DCE8BC0
DLCK address 00000000 2P_DDB address 9DCE7A80
I/O wait queue empty
I/O request queue
-----------------
STATE CDRP/IRP PID MODE CHAN FUNC WCB EFN AST IOSB STATUS
C 9EDEF820 9FCCD90A K 0000 000C 9EDEF640 0 9EDEF700 00000000 2102
readpblk func,physio,mvirp
OpenVMS (TM) VAX V6.1 -- System Dump Analysis 28-MAR-1997 15:46:33.93 Page 3
I/O data structures
--- Volume Control Block (VCB) 9F1E3600 ---
Volume: PR110DISK (Member of shadow set DSA1110)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9DC52B40 Virtual unit UCB 9F21D9C0
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F1E3500
AQB address 9DCE6E40 00000000 Shadow member FL 9F1E3700
RVT address 9F21D9C0 Shadow member BL 9F1E359C
NVJ$DUA210 (HSJ21$DUA210) HSX00 UCB address: 9DC5CD80
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 634 ORB address 9DCE0D80
PID 00000000 Error count 0 DDB address 9DCE1580
Alloc. lock ID 050003CB Reference count 1 DDT address 9FD22C98
Alloc. class 1 Online count 1 VCB address 9F1E3700
Class/Type 01/8D BOFF 0080 CRB address 9DCE2C80
Def. buf. size 512 Byte count 0200 PDT address 9DC44990
DEVDEPEND 0BCE1055 SVAPTE CE54969C CDDB address 9DCE37C0
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F389440
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9F355800
DLCK address 00000000 2P_DDB address 9DDB3F80
I/O wait queue empty
*** I/O request queue is empty ***
--- Volume Control Block (VCB) 9F1E3700 ---
Volume: PR110DISK (Member of shadow set DSA1110)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9DC5CD80 Virtual unit UCB 9F21D9C0
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F1E3500
AQB address 9DCE6E40 00000000 Shadow member FL 9F1E359C
RVT address 9F21D9C0 Shadow member BL 9F1E3600
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 399.1 | VMSSG::FRIEDRICHS | Ask me about Young Eagles | Fri Mar 28 1997 12:27 | 50 | |
(.0 reformatted)
<<< Note 399.0 by BACHUS::CAERELS >>>
-< Shadowing is not removing a "bad" member of a shadowset... >-
Hello,
My customer (Volvo) lost 250 cars due to Volvume shadowing not doing what it is
should do: if a member of a shadowset has problems, remove it from the shadowset
and continue working on 1 member.
Scenario:
+-------+-O-O-O-O +-------+-O-O-O-O
| H | H | | H | H |
| S | S | | S | S |
| J | J | | J | J |
| 4 | 4 | | 4 | 4 |
| 0 | 0 | | 0 | 0 |
| | | | | |
| A | B | | C | D |
+-------+ +-------+
\ /
\-------------O---------------/
/ \
/ \
+-------+ +-------+
| VAX | | VAX |
+-------+ +-------+
Disks are shadowed by VMS between HSJA/B and HSJC/D.
One disk on HSJA generated errors, resulting in a mount verification on the
shadowset. This mount verification never completed although only one member of
the shadowset was "bad". The customer didn't think of powering off the HSJ,
rebooted one node holding its' pagefiles on that disk resulting in a
SHADDETINCON bugcheck on all other members.
Question: Why did shadowing not remove this physical member from its' shadowset
?
I've included a SDA output of one of the nodes that had this SHADDETINCON crash.
You can see the dsa1110 device in mount verification, with one member having a
MVIRP (special mount verification IRP), the other member seems to be OK...
thank you for any input...
Rik Caerels
| |||||
| 399.2 | IPMT if you really want it looked at! | VMSSG::FRIEDRICHS | Ask me about Young Eagles | Fri Mar 28 1997 12:37 | 7 |
What version of VMS are you running? patches?
What is SHADOW_MBR_TMO? What is MVTIMOUT?
Cheers,
jeff
| |||||
| 399.3 | Can you post SHO DEV for DUA110 and DUA210 from Crash? | CSC32::M_DIFABIO | MOVL #OPINION,EXE$GL_BLAKHOLE | Fri Mar 28 1997 19:32 | 11 |
Wish you had done a SHO DEV DUA for each of the physical disks. Then
we would have gotten the Primary CDDB info for each. Mount verify, so
were likely doing PACKACK's to a controller. I won't speculate without
looking at the dump, but if you have the dump could you post the output
of SHO DEV DUA110 and DUA210. And do you have VAXDRIV04_070 installed?
For shadowing to expell a member, we would need to get some response/
end message from the HSJ during the PACKACK. We check if it's time to
expell a member when we get a response from a command.
Mark d.
| |||||
| 399.4 | Sounds normal to me | RICKS::OPP | Mon Mar 31 1997 20:34 | 13 | |
In my experience with fault-tolerant VAX systems, the behavior
you described for Host-Based Volume Shadowing is *normal* assuming
you've used default SYSGEN parameters. This is because VMS tries
to preserve data integrity in favor of response time w.r.t. to disk
errors, such as bad block replacements, etc. You need to determine
how aggressive your customer needs/wants to be in the other direction.
For example, the SYSGEN shadow member time-out parameter can be sig-
nificantly reduced from the default (formerly 20 seconds).
Regards,
Greg
| |||||
| 399.5 | Still need a response from the I/O | CSC32::M_DIFABIO | MOVL #OPINION,EXE$GL_BLAKHOLE | Tue Apr 01 1997 12:34 | 4 |
...But if there is no response from the I/O a shorter SHADOW_MBR_TMO
buys you nothing.
Mark d.
| |||||
| 399.6 | VAXDRIV04_070 - that is most probably the cure... | BACHUS::CAERELS | Wed Apr 02 1997 01:11 | 351 | |
Thanks for all replies, sorry for responding so late - we had Easter holidays.
re .2
Customer runs version 6.1 with several patches of which : VAXSHAD09_061 and
VAXDRIV02_070.
Shadow sysgen parameters and MVTIMOUT are at default values.
re .3
I've included an SDA output of your requested info.
>>> looking at the dump, but if you have the dump could you post the output
>>> of SHO DEV DUA110 and DUA210. And do you have VAXDRIV04_070 installed?
great input (at the time of the 6.1 upgrade we installed VAXDRIV02_070. Due to
time constraints (reboot = downtime and this 24x24 production system) no major
revision of patches has been done - this is certainly something we have to
review !. Anyway in the release notes of VAXDRIV04_070 I saw:
o A problem exists with HSJ/HSD30,40 and 50 controllers where,
after an event that initiates Mountverfication, a Pack-Ack will
fail to complete. The controller will report that it is making
progress on the command, but will never finish. This causes
all IO to the affected devices to be hung.
which is exactly the problem we had. Can you confirm this ?
Thanks again to Mark D. for this swift response,
regards,
Rik Caerels
================================================================================
HSJ10$DUA110 (HSJ11$DUA110) HSX00 UCB address: 9DC52B40
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 639 ORB address 9DC5B500
PID 00000000 Error count 0 DDB address 9BFCE7F0
Alloc. lock ID 050003BF Reference count 1 DDT address 9FD22C98
Alloc. class 1 Online count 1 VCB address 9F1E3600
Class/Type 01/8D BOFF 0080 CRB address 9DC42F80
Def. buf. size 512 Byte count 0200 PDT address 9DC44990
DEVDEPEND 0BCE1055 SVAPTE CE549698 CDDB address 9DC4F7C0
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F389440
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9DCE8BC0
DLCK address 00000000 2P_DDB address 9DCE7A80
I/O wait queue empty
Shadow Member Device DEVSTS status: 4004 nocnvrt,mscp_ignsrv
----- Shadow Descriptor Block (SHAD) 9F389440 -----
Virtual Unit status: 0000
Members 0 Act user IRPs 1 VU UCB 9F21D9C0
Devices 0 SCB LBN 00000000 MMB 050003EE
Fcpy Targets 192 Generation Num 9F3897C8 Master FL 30313131
Mcpy Targets 0 9FCD5BEC Restart FL 564E4944
Last Read Index 0 Virtual Unit Id 00000001
Master Index 0 050003EB
----- SHAD Device summary for $1$DUA110 -----
--- Primary Class Driver Data Block (CDDB) 9DC4F7C0 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue 9EDF86A0 DDB address 9BFCE7F0
System ID 10073520 Restart Queue empty CRB address 9DC42F80
4200 DAP Count 7 CDDB link 9DCDD0C0
Contrl. ID 54411859 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 1 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DC52B40
MSCP Cmd status 00000000
OpenVMS (TM) VAX V6.1 -- System Dump Analysis 1-APR-1997 10:19:38.68 Page 2
I/O data structures
--- Secondary Class Driver Data Block (CDDB) 9DCE8BC0 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue empty DDB address 9DCE7A80
System ID 10083522 Restart Queue empty CRB address 9DCE7C80
4200 DAP Count 8 CDDB link 9F355800
Contrl. ID 53210933 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DCE9340
MSCP Cmd status 00000000
I/O request queue
-----------------
STATE CDRP/IRP PID MODE CHAN FUNC WCB EFN AST IOSB STATUS
C 9EDEF820 9FCCD90A K 0000 000C 9EDEF640 0 9EDEF700 00000000 2102
readpblk func,physio,mvirp
--- Volume Control Block (VCB) 9F1E3600 ---
Volume: PR110DISK (Member of shadow set DSA1110)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9DC52B40 Virtual unit UCB 9F21D9C0
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F1E3500
AQB address 9DCE6E40 00000000 Shadow member FL 9F1E3700
RVT address 9F21D9C0 Shadow member BL 9F1E359C
(HSJ11$DUA110) HSJ10$DUA110 HSX00 UCB address: 9DC52B40
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 639 ORB address 9DC5B500
PID 00000000 Error count 0 DDB address 9BFCE7F0
Alloc. lock ID 050003BF Reference count 1 DDT address 9FD22C98
Alloc. class 1 Online count 1 VCB address 9F1E3600
Class/Type 01/8D BOFF 0080 CRB address 9DC42F80
Def. buf. size 512 Byte count 0200 PDT address 9DC44990
DEVDEPEND 0BCE1055 SVAPTE CE549698 CDDB address 9DC4F7C0
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F389440
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9DCE8BC0
DLCK address 00000000 2P_DDB address 9DCE7A80
I/O wait queue empty
Shadow Member Device DEVSTS status: 4004 nocnvrt,mscp_ignsrv
----- Shadow Descriptor Block (SHAD) 9F389440 -----
Virtual Unit status: 0000
Members 0 Act user IRPs 1 VU UCB 9F21D9C0
Devices 0 SCB LBN 00000000 MMB 050003EE
Fcpy Targets 192 Generation Num 9F3897C8 Master FL 30313131
Mcpy Targets 0 9FCD5BEC Restart FL 564E4944
Last Read Index 0 Virtual Unit Id 00000001
Master Index 0 050003EB
----- SHAD Device summary for $1$DUA110 -----
--- Primary Class Driver Data Block (CDDB) 9DC4F7C0 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue 9EDF86A0 DDB address 9BFCE7F0
System ID 10073520 Restart Queue empty CRB address 9DC42F80
4200 DAP Count 7 CDDB link 9DCDD0C0
Contrl. ID 54411859 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 1 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DC52B40
MSCP Cmd status 00000000
--- Secondary Class Driver Data Block (CDDB) 9DCE8BC0 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue empty DDB address 9DCE7A80
System ID 10083522 Restart Queue empty CRB address 9DCE7C80
4200 DAP Count 8 CDDB link 9F355800
Contrl. ID 53210933 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DCE9340
MSCP Cmd status 00000000
I/O request queue
-----------------
STATE CDRP/IRP PID MODE CHAN FUNC WCB EFN AST IOSB STATUS
C 9EDEF820 9FCCD90A K 0000 000C 9EDEF640 0 9EDEF700 00000000 2102
readpblk func,physio,mvirp
--- Volume Control Block (VCB) 9F1E3600 ---
Volume: PR110DISK (Member of shadow set DSA1110)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9DC52B40 Virtual unit UCB 9F21D9C0
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F1E3500
AQB address 9DCE6E40 00000000 Shadow member FL 9F1E3700
RVT address 9F21D9C0 Shadow member BL 9F1E359C
HSJ20$DUA210 (HSJ21$DUA210) HSX00 UCB address: 9DC5CD80
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 634 ORB address 9DCE0D80
PID 00000000 Error count 0 DDB address 9DCE1580
Alloc. lock ID 050003CB Reference count 1 DDT address 9FD22C98
Alloc. class 1 Online count 1 VCB address 9F1E3700
Class/Type 01/8D BOFF 0080 CRB address 9DCE2C80
Def. buf. size 512 Byte count 0200 PDT address 9DC44990
DEVDEPEND 0BCE1055 SVAPTE CE54969C CDDB address 9DCE37C0
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F389440
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9F355800
DLCK address 00000000 2P_DDB address 9DDB3F80
I/O wait queue empty
Shadow Member Device DEVSTS status: 4004 nocnvrt,mscp_ignsrv
----- Shadow Descriptor Block (SHAD) 9F389440 -----
Virtual Unit status: 0000
Members 0 Act user IRPs 1 VU UCB 9F21D9C0
Devices 0 SCB LBN 00000000 MMB 050003EE
Fcpy Targets 192 Generation Num 9F3897C8 Master FL 30313131
Mcpy Targets 0 9FCD5BEC Restart FL 564E4944
Last Read Index 0 Virtual Unit Id 00000001
Master Index 0 050003EB
----- SHAD Device summary for $1$DUA210 -----
--- Primary Class Driver Data Block (CDDB) 9DCE37C0 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue 9EDFA1A0 DDB address 9DCE1580
System ID 10093920 Restart Queue empty CRB address 9DCE2C80
4200 DAP Count 8 CDDB link 9DCE53C0
Contrl. ID 54600015 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DC5CD80
MSCP Cmd status FFFFFFFF
--- Secondary Class Driver Data Block (CDDB) 9F355800 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue empty DDB address 9DDB3F80
System ID 100A3922 Restart Queue empty CRB address 9DDB4000
4200 DAP Count 8 CDDB link 00000000
Contrl. ID 54411986 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9F209C40
MSCP Cmd status 00000000
*** I/O request queue is empty ***
--- Volume Control Block (VCB) 9F1E3700 ---
Volume: PR110DISK (Member of shadow set DSA1110)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9DC5CD80 Virtual unit UCB 9F21D9C0
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F1E3500
AQB address 9DCE6E40 00000000 Shadow member FL 9F1E359C
RVT address 9F21D9C0 Shadow member BL 9F1E3600
(HSJ21$DUA210) HSJ20$DUA210 HSX00 UCB address: 9DC5CD80
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 634 ORB address 9DCE0D80
PID 00000000 Error count 0 DDB address 9DCE1580
Alloc. lock ID 050003CB Reference count 1 DDT address 9FD22C98
Alloc. class 1 Online count 1 VCB address 9F1E3700
Class/Type 01/8D BOFF 0080 CRB address 9DCE2C80
Def. buf. size 512 Byte count 0200 PDT address 9DC44990
DEVDEPEND 0BCE1055 SVAPTE CE54969C CDDB address 9DCE37C0
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F389440
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9F355800
DLCK address 00000000 2P_DDB address 9DDB3F80
I/O wait queue empty
Shadow Member Device DEVSTS status: 4004 nocnvrt,mscp_ignsrv
----- Shadow Descriptor Block (SHAD) 9F389440 -----
Virtual Unit status: 0000
Members 0 Act user IRPs 1 VU UCB 9F21D9C0
Devices 0 SCB LBN 00000000 MMB 050003EE
Fcpy Targets 192 Generation Num 9F3897C8 Master FL 30313131
Mcpy Targets 0 9FCD5BEC Restart FL 564E4944
Last Read Index 0 Virtual Unit Id 00000001
Master Index 0 050003EB
----- SHAD Device summary for $1$DUA210 -----
--- Primary Class Driver Data Block (CDDB) 9DCE37C0 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue 9EDFA1A0 DDB address 9DCE1580
System ID 10093920 Restart Queue empty CRB address 9DCE2C80
4200 DAP Count 8 CDDB link 9DCE53C0
Contrl. ID 54600015 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DC5CD80
MSCP Cmd status FFFFFFFF
--- Secondary Class Driver Data Block (CDDB) 9F355800 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue empty DDB address 9DDB3F80
System ID 100A3922 Restart Queue empty CRB address 9DDB4000
4200 DAP Count 8 CDDB link 00000000
Contrl. ID 54411986 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9F209C40
MSCP Cmd status 00000000
*** I/O request queue is empty ***
--- Volume Control Block (VCB) 9F1E3700 ---
Volume: PR110DISK (Member of shadow set DSA1110)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9DC5CD80 Virtual unit UCB 9F21D9C0
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F1E3500
AQB address 9DCE6E40 00000000 Shadow member FL 9F1E359C
RVT address 9F21D9C0 Shadow member BL 9F1E3600
================================================================================
| |||||
| 399.7 | Follow-up questions | GREGOR::OPP | Wed Apr 02 1997 07:30 | 10 | |
RE: .5
And if the I/O device is not responding, what's the design
rationale for keeping it in the shadow set (unless it's the last
surviving member)? If the I/O device is a storage controller and
it's not responding, why would HBVS not attempt to fail-over?
Thanks,
Greg
| |||||
| 399.8 | Doesn't look like DUDRIVER is the fix | VMSSPT::JENKINS | Kevin M Jenkins VMS Support Engineering | Wed Apr 02 1997 09:06 | 19 |
RE .0
I don't believe that you are seeing the Pack/ACK hang problem.
The outstanding IO is a read. Also you version of SDA appears to
be incompatible with your version of SHDRIVER. It is not interpreting
the structures and bits properly.. There may/should be a newer version
available somewhere? Perhaps one of the BOOT or DOSD kits?
RE .7
The SHDRIVER code thread responsible for Mountverification is
"stalled" waiting for one of it's IOs to complete. If the IO never
completes then the thread is never resumed and hence the
Mountverification becomes hung. Now when/if DUDRIVER resets the
controller, then all SHDRIVER IOs are returned with and error status
and thus the Mountverification thread is resumed and SHDRIVER can
decide what to do about the Shadowsets membership.
| |||||
| 399.9 | Not likely the DRIV04 issue | CSC32::M_DIFABIO | MOVL #OPINION,EXE$GL_BLAKHOLE | Wed Apr 02 1997 10:55 | 12 |
What Kevin looked at was the Response ID and the MSCP Cmd status
fields in the CDDB. What we look for is a non-zero Response ID and an
MSCP Cmd status that is negative, going from FFFFFFFE to a more
negative number every controller timeout seconds. (He also looked at
your Master FL and Restart FL and saw they were not valid addresses,
hence the SDA mismatch he mentioned.)
So yes, that is exactly what I was looking for and no, it doesn't
appear that you had that specific problem. In your info, there was
a read outstanding to DUA110 at the time of the crash.
Mark d.
| |||||
| 399.10 | more info | BACHUS::CAERELS | Thu Apr 03 1997 02:19 | 225 | |
There's a whole story behind this problem and I'll spare you the
details. One important fact is that one disk was replaced (due to
exceeded errors on that disk) and this could have caused all the
problems :
access to all disks on the HSJ controller serving that particular
disk became impossible due to the mount verify, mounted status.
Of the 4 nodes in the cluster, 1 system was rebooted and hung
in its' boot process for 3 hours (MOUNTV image in STARTUP)
===============================================================================
HSJ20$DUA213 HSX00 UCB address: 9DC5D3C0
Device status: 00020010 online,lcl_valid
Characteristics: 1CC54008 dir,fod,shr,avl,elg,all,idv,odv,rnd
00002221 clu,mscp,nnm,loc
Owner UIC [000001,000004] Operation count 0 ORB address 9DCE19C0
PID 00010004 Error count 0 DDB address 9DCE1580
Alloc. lock ID 030004CE Reference count 2 DDT address 9FD22C98
Alloc. class 1 Online count 1 CRB address 9DCE2C80
Class/Type 01/8D BOFF 0000 PDT address 9DC44990
Def. buf. size 512 Byte count 0000 CDDB address 9DCE37C0
DEVDEPEND 00000000 SVAPTE 00000000 I/O wait queue empty
DEVDEPND2 00000000 DEVSTS 1004
FLCK index 34 RWAITCNT 0001
DLCK address 00000000
Device DEVSTS status: 1004 nocnvrt,mscp_pkack
--- Primary Class Driver Data Block (CDDB) 9DCE37C0 ---
Status: 0040 alcls_set
Status2: 0002 crnset
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_load,cf_replc
Allocation class 1 CDRP Queue 9EDFA1A0 DDB address 9DCE1580
System ID 10093920 Restart Queue empty CRB address 9DCE2C80
4200 DAP Count 8 CDDB link 9DCE53C0
Contrl. ID 54600015 Contr. timeout 200 PDT address 9DC44990
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9DC5CD80
MSCP Cmd status FFFFFFFF
I/O request queue
-----------------
STATE CDRP/IRP PID MODE CHAN FUNC WCB EFN AST IOSB STATUS
C 9EDFA9E0 00010004 E FF30 0808 00000000 26 00000000 7FFE95B8 0101
packack bufio,physio
===============================================================================
SDA output of another node on this device (with the SDA.EXE from DOSD) has
in fact the non-zero Response ID and the MSCP Cmd status field negative,
although the DSA device is NOT in mount verification. Is this related to the
problem solved in DRIV04 ? The reason I insist is that the problem description
in DRIV04 is very much alike the situation we saw at the customer site.
(the customer agrees on this as well.) If they don't match, I'll escalate the
problem.
Rik
HSJ20$DUA213 (HSJ21$DUA213) HSX00 UCB address: 9E70BB00
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 104373 ORB address 9E71AAC0
PID 00000000 Error count 2 DDB address 9E719480
Alloc. lock ID 14000B27 Reference count 1 DDT address A0AFD498
Alloc. class 1 Online count 1 VCB address 9F7D9780
Class/Type 01/8D BOFF 0000 CRB address 9E719540
Def. buf. size 512 Byte count 0000 PDT address 9E6F2B90
DEVDEPEND 0BCE1055 SVAPTE 00000000 CDDB address 9E71A540
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F981D40
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9E71F340
DLCK address 00000000 2P_DDB address 9E71E880
I/O wait queue empty
Shadow Device status: 4004 nocnvrt,shd_wlgsta_cha
----- Shadow Descriptor Block (SHAD) 9F981D40 -----
Virtual Unit status: 0041 normal,merging
Members 2 Act user IRPs 0 VU UCB 9F83F580
Devices 2 SCB LBN 001F5A50 Master FL empty
Fcpy Targets 0 Generation Num B106E760 Restart FL empty
Mcpy Targets 2 009B1BB2
Last Read Index 0 Virtual Unit Id 00000000
Master Index 0 12610459
----- SHAD Device summary for Virtual Unit $1$DUA213 -----
Device $1$DUA113
Index 0 Device Status AE merge,cip,master,src,valid
UCB 9E701300 VCB 9F7D9680 Unit Id. 12A10071 00000001 WLT: 9F8924D8
Copy LBN FFFFFFFF
Device $1$DUA213
Index 1 Device Status A6 merge,cip,src,valid
UCB 9E70BB00 VCB 9F7D9780 Unit Id. 12A100D5 00000001 WLT: 9F8A8318
Copy LBN FFFFFFFF
--- Primary Class Driver Data Block (CDDB) 9E71A540 ---
Status: 0040 alcls_set
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc
Allocation class 1 CDRP Queue 9F7C6160 DDB address 9E719480
System ID 10093920 Restart Queue empty CRB address 9E719540
4200 DAP Count 5 CDDB link 9E71E1C0
Contrl. ID 54600015 Contr. timeout 200 PDT address 9E6F2B90
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9E70B740
MSCP Cmd status FFFFFFFF
--- Secondary Class Driver Data Block (CDDB) 9E71F340 ---
Status: 0040 alcls_set
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc
Allocation class 1 CDRP Queue 9F78F320 DDB address 9E71E880
System ID 100A3922 Restart Queue empty CRB address 9E71E940
4200 DAP Count 5 CDDB link 9E721780
Contrl. ID 54411986 Contr. timeout 200 PDT address 9E6F2B90
01280009 Reinit Count 0 Original UCB 00000000
Response ID 853D0065 Wait UCB Count 0 UCB chain 9E70B880
MSCP Cmd status FFFFFF94
*** I/O request queue is empty ***
--- Volume Control Block (VCB) 9F7D9780 ---
Volume: PR113DISK (Member of shadow set DSA1113)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9E70BB00 Virtual unit UCB 9F83F580
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F7D9580
AQB address 9E70F540 00000000 Shadow member FL 9F7D961C
RVT address 9F83F580 Shadow member BL 9F7D9680
(HSJ21$DUA213) HSJ20$DUA213 HSX00 UCB address: 9E70BB00
Device status: 00020810 online,valid,lcl_valid
Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd
02042231 clu,2p,mscp,nnm,loc,shd,wlg
Owner UIC [000001,000004] Operation count 104373 ORB address 9E71AAC0
PID 00000000 Error count 2 DDB address 9E719480
Alloc. lock ID 14000B27 Reference count 1 DDT address A0AFD498
Alloc. class 1 Online count 1 VCB address 9F7D9780
Class/Type 01/8D BOFF 0000 CRB address 9E719540
Def. buf. size 512 Byte count 0000 PDT address 9E6F2B90
DEVDEPEND 0BCE1055 SVAPTE 00000000 CDDB address 9E71A540
DEVDEPND2 00000000 DEVSTS 4004 SHAD address 9F981D40
FLCK index 34 RWAITCNT 0000 2P_CDDB addr. 9E71F340
DLCK address 00000000 2P_DDB address 9E71E880
I/O wait queue empty
Shadow Device status: 4004 nocnvrt,shd_wlgsta_cha
----- Shadow Descriptor Block (SHAD) 9F981D40 -----
Virtual Unit status: 0041 normal,merging
Members 2 Act user IRPs 0 VU UCB 9F83F580
Devices 2 SCB LBN 001F5A50 Master FL empty
Fcpy Targets 0 Generation Num B106E760 Restart FL empty
Mcpy Targets 2 009B1BB2
Last Read Index 0 Virtual Unit Id 00000000
Master Index 0 12610459
----- SHAD Device summary for Virtual Unit $1$DUA213 -----
Device $1$DUA113
Index 0 Device Status AE merge,cip,master,src,valid
UCB 9E701300 VCB 9F7D9680 Unit Id. 12A10071 00000001 WLT: 9F8924D8
Copy LBN FFFFFFFF
Device $1$DUA213
Index 1 Device Status A6 merge,cip,src,valid
UCB 9E70BB00 VCB 9F7D9780 Unit Id. 12A100D5 00000001 WLT: 9F8A8318
Copy LBN FFFFFFFF
--- Primary Class Driver Data Block (CDDB) 9E71A540 ---
Status: 0040 alcls_set
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc
Allocation class 1 CDRP Queue 9F7C6160 DDB address 9E719480
System ID 10093920 Restart Queue empty CRB address 9E719540
4200 DAP Count 5 CDDB link 9E71E1C0
Contrl. ID 54600015 Contr. timeout 200 PDT address 9E6F2B90
01280009 Reinit Count 0 Original UCB 00000000
Response ID 00000000 Wait UCB Count 0 UCB chain 9E70B740
MSCP Cmd status FFFFFFFF
--- Secondary Class Driver Data Block (CDDB) 9E71F340 ---
Status: 0040 alcls_set
Controller Flags: A2DC cf_mlths,cf_ldcd,cf_this,cf_misc,cf_attn,cf_whl,cf_replc
Allocation class 1 CDRP Queue 9F78F320 DDB address 9E71E880
System ID 100A3922 Restart Queue empty CRB address 9E71E940
4200 DAP Count 5 CDDB link 9E721780
Contrl. ID 54411986 Contr. timeout 200 PDT address 9E6F2B90
01280009 Reinit Count 0 Original UCB 00000000
Response ID 853D0065 Wait UCB Count 0 UCB chain 9E70B880
MSCP Cmd status FFFFFF94
*** I/O request queue is empty ***
--- Volume Control Block (VCB) 9F7D9780 ---
Volume: PR113DISK (Member of shadow set DSA1113)
Status: 00
Copy sequence number: 001F Copy type: 2 mgcpy
Transactions 1 UCB address 9E70BB00 Virtual unit UCB 9F83F580
Relative volume 0 Work area 001F5A51 Virtual unit VCB 9F7D9580
AQB address 9E70F540 00000000 Shadow member FL 9F7D961C
RVT address 9F83F580 Shadow member BL 9F7D9680
| |||||
| 399.11 | Thank you | GREGOR::OPP | Thu Apr 03 1997 11:17 | 7 | |
RE: .8
Thanks for the explanation. I had naively assumed that there was
an I/O time-out somewhere that would prevent infinite waiting.
Greg
| |||||
| 399.12 | That is what I was looking for | CSC32::M_DIFABIO | MOVL #OPINION,EXE$GL_BLAKHOLE | Thu Apr 03 1997 20:05 | 19 |
re .10
In short, yep. Basically we ensure that the controller is making
progress on the oldest outstanding command. After 2 timeout periods
we issue a GCS (Get Command Status) request to the controller. We
start our 'countdown' at FFFFFFFE (It was FFFFFFFF at the first
timeout, FFFFFFFE at the second). If the controller responds that
it is making progress, we poll each timeout. So your's was something
like FFFFFFD8, so you were 40 timeout periods into polling. Since the
controller was responding with a more 'negative' status each time, we
go on from FFFFFFFF towards 0.
It isn't always a packack that is reported to be 'making progress'.
In your case it was a read, right? In any event, from that symptom
you've shown you do want to get ALPDRIV04 installed. The alternative
would have been to restart the controller that was reporting that it
was making progress on the command.
Mark d.
| |||||