| Dump taken on 28-JAN-1995 08:07:04.66
INVEXCEPTN, Exception while above ASTDEL or on interrupt stack
Version of system: VAX/VMS VERSION V5.5-2
VAXcluster node: UKAV06, a VAX 7000-630
CPU 00 reason for Bugcheck: INVEXCEPTN, Exception while above ASTDEL or on
inter
rupt stack
Process currently executing on this CPU: None
Current IPL: 8 (decimal)
CPU database address: 90208000
MPB address: 83696A20
Spinlocks currently owned by CPU 00
IOLOCK8 Address 8142F2F0
Owner CPU ID 00 IPL 08
Ownership Depth 0001 Rank 14
CPUs Waiting 0000 Index 34
CPU 01 reason for Bugcheck: CPUEXIT, Shutdown requested by another CPU
Process currently executing on this CPU: None
Current IPL: 31 (decimal)
CPU database address: 8136A000
No spinlocks currently owned by CPU 01
CPU 02 reason for Bugcheck: CPUEXIT, Shutdown requested by another CPU
Process currently executing on this CPU: OPCOM
Current image file: DSA100:[SYS6.SYSCOMMON.][SYSEXE]OPCOM.EXE;2
Current IPL: 31 (decimal)
CPU database address: 81368000
Spinlocks currently owned by CPU 02
POOL Address 8142EEE0
Owner CPU ID 02 IPL 0B
Ownership Depth 0001 Rank 07
CPUs Waiting 0000 Index 27
SP => 90209DB0 00000004
90209DB4 7FF0FBB4
90209DB8 FFFFFFFD
90209DBC 00000000
90209DC0 00002400 UCB$M_TEMPLATE+00400
90209DC4 00000001
90209DC8 00000003
90209DCC 0000044C %SYSTEM-F-RADRMOD, reserved addrs flt
90209DD0 837D54D8 STDRIVER+017C8
90209DD4 04080008
90209DD8 00000004
90209DDC 81404BEF IO_ROUTINES+013EF
90209DE0 00000001
90209DE4 00000000
IOC$IOPOST+0002B: PUSHL R1
IOC$IOPOST+0002D: MOVZBL #00,R0
IOC$IOPOST+00030: JSB IOC$SIMREQCOM+000C5
IOC$IOPOST+00036: MOVL (SP)+,R1
IOC$IOPOST+00039: JSB (R1)
IOC$IOPOST+0003B: cant easily find where this went to...
code leading to failure PC
STDRIVER+017B9: MOVL R5,R3
STDRIVER+017BC: BICW2 #10,002A(R3)
STDRIVER+017C1: MOVL 004C(R3),002C(R3)
%SDA-E-NOINSTRAN, cannot translate instruction
SDA> ex STDRIVER+017C0;20
000700DC 0006034C 002CC300 4CC3D000 837D54D0
^^^^^^^^ ^^^^^^ MOVL 004C(R3),002C(R3)
^^^^^^^^ What the heck is this ????
A4D05400 10C3D053 55D00000 00070134 837D54E0
^^ ^^^^ MOVL R5,R3
instructions resume at :
STDRIVER+017D6: MOVL R5,R3
STDRIVER+017D9: MOVL 0010(R3),R4
It looks as though something trod on this bit of memory, what did it ??
No match in stars articles.
My medler processes hung before I could get more info.
|
|
I originaly hibernated this call as a on off, but it happened again as call
41010 so had to have another go...
Iwas misled, when told that the memory contents were corrupt, checking on the
running system and using patch this code is in the image as well, including
the latest version 2.0 of striping.
Heres what I found, put it on the notes file and seems it was a know problem
that none of us found!
Told keith allen and nichole this is fixed in 2.1 and recommened customer
upgrade in the short term, in the long term migrate to raid.
--------------------------------------------------------------------------------
<<< COOKIE::DISK$SYSTEM_3:[NOTES$LIBRARY]STRIPING.NOTE;1 >>>
-< >-
================================================================================
Note 488.0 crash @stdriver+17c8 - other node shutdown 4 replies
COMICS::GLEDHILL 168 lines 16-MAR-1995 03:39
--------------------------------------------------------------------------------
Has anyone seen this, crash in stdriver v2.0-5 (but 2.0-7 is the same image with
the same code in it - also checksums the same), just after shutting down
another node in the cluster. The system is post processing a read request with
an end action routine in stdriver, this code doesn't seem to end, it just runs
into data which causes the system to crash with an addressing fault.
This is NOT due to corruption of the image or the loaded driver in memory - I
reinstalled the driver from cscpat_1048013 and the same code was in the file.
This has happened twice under the same cirumstances on a 7630 5.5-2. In both
the irp had iost=2234 (incomplete segmented request posted) and irp$l_ucb
points to a dsa device (The stripe sets here consist of shadow sets).
I guess this must be an obscure bit of code that doesn't get called often,
(maybe when this incomplete request occurs?). I haven't found the sources so
can't see what it is meant to do.
Is this problem known about, fixed in 2.1 maybe. I have two dumps available
if anyone wants to look at them, will ipmt shortly if required.
Here are some gory details. Note that in both dumps there is some corruption
(lock and/or pool) Don't know if this is relevant, may be due to what was
going on the other cpus at the time (in lock code and pool deallocation
routines).
Thanks Dave Gledhill.
----------------------
Current operating stack (INTERRUPT):
SP => 9303DDB0 00000004
9303DDB4 7FE3BC94
9303DDB8 FFFFFFFD
9303DDBC 00000000
9303DDC0 00000800 CPU$M_VIRTCONS
9303DDC4 00000001
9303DDC8 00000003
9303DDCC 0000044C (reserved addr fault)
9303DDD0 83864A38 STDRIVER+017C8 <- failing pc
9303DDD4 04080008
9303DDD8 00000004
9303DDDC 81472BEF IOC$IOPOST+00067 (jmps to irp$l_pid)
9303DDE0 00000001
9303DDE4 00000000
9303DDE8 00000003
9303DDEC 9303C000
9303DDF0 83CD56B0
9303DDF4 9668C000
9303DDF8 8147ACDF SCH$RESCHED+000DF
9303DDFC 04C30001
TDA> ex/inst stdriver + 17a0;40
STDRIVER+017A0: BRW STDRIVER+017D6
STDRIVER+017A3: BBC #04,002A(R5),STDRIVER+0178A
STDRIVER+017A9: TSTW 0046(R5)
STDRIVER+017AD: BEQL STDRIVER+017B5
STDRIVER+017AF: CLRL 003A(R5)
STDRIVER+017B3: BRB STDRIVER+017B9
STDRIVER+017B5: CLRW 003A(R5)
STDRIVER+017B9: MOVL R5,R3
STDRIVER+017BC: BICW2 #10,002A(R3)
STDRIVER+017C1: MOVL 004C(R3),002C(R3)
(no more code for a while now! cant translate 17c8)
TDA> form @r5 (irp being post processed I think.)
8390C518 IRP$L_IOQFL FFFDFB78
8390C51C IRP$L_IOQBL FC6F7F08
8390C520 IRP$W_SIZE 00A8
8390C522 IRP$B_TYPE 0A
8390C523 IRP$B_RMOD 41
8390C524 IRP$L_PID 8390C510
(this is the middle of pool that seems to be related to striping and contains
8390C510: JMP @#STDRIVER+015A3
This code can fall through to the failing pc (there is one rsb, but a branch
around it so it seems possible to get to the failing address from 15a3).
8390C528 IRP$L_AST 838FE0D0
IRP$L_SHD_IOFL
8390C52C IRP$L_ASTPRM 007DCE00
IRP$L_HRB
IRP$L_SHAD
8390C530 IRP$L_MIRP 83CD5EF0
IRP$L_WIND
8390C534 IRP$L_UCB 8381E570
8390C538 IRP$W_FUNC 000C -> readpblk
8390C53A IRP$B_CLN_INDX 1D
IRP$B_EFN
8390C53B IRP$B_PRI 1C
IRP$B_SHD_FLAGS
8390C53C IRP$L_CLN_WLE 007DCE0C
IRP$L_IOSB
8390C540 IRP$W_CHAN FD10
8390C542 IRP$W_STS 0002
8390C544 IRP$L_SVAPTE 95DDDD98
8390C548 IRP$W_BOFF 0000
8390C54A IRP$L_BCNT 0800
IRP$L_DCD_BLK_COUNT
IRP$W_BCNT
8390C54C 0000
8390C54E IRP$W_STS2 0000
8390C550 IRP$L_IOST1 00002234 <-
IRP$L_MEDIA
8390C554 IRP$B_CARCON 00
IRP$L_IOST2
IRP$L_TT_TERM
8390C555 000200
8390C558 IRP$L_ABCNT 0000000B
IRP$Q_NT_PRVMSK
IRP$Q_STATION
IRP$Q_TT_STATE
IRP$W_ABCNT
8390C55C IRP$L_OBCNT 00000800
IRP$W_OBCNT
8390C560 IRP$L_SEGVBN 000B8189
8390C564 IRP$L_DIAGBUF 95DDDD98
IRP$L_SCB_BUF
IRP$W_TT_PRMPT
8390C568 IRP$L_DCD_SRC_UCB 270CD734
IRP$L_SEQNUM
8390C56C IRP$L_EXTEND 00000000
8390C570 IRP$L_ARB 83CC2C58
IRP$L_SHDSPC
8390C574 IRP$B_CPY_MODE 00
IRP$L_KEYDESC
IRP$L_WLE_PTR
8390C575 000000
8390C578 IRP$L_FQFL 8378D780
8390C57C IRP$L_FQBL 8965A280
8390C580 IRP$W_CDRPSIZE FFA0
8390C582 IRP$B_CD_TYPE 39
8390C583 IRP$B_FIPL 34
IRP$B_FLCK
8390C584 IRP$L_FPC 8473ADE5 DUDRIVER+015F5
8390C588 IRP$L_FR3 814A02F0 SCS$GA_LOCALSB+00110
8390C58C IRP$L_FR4 83780BA0
8390C590 IRP$L_SAVD_RTN 8473ADAE DUDRIVER+015BE
8390C594 IRP$L_MSG_BUF FFFF0000 PDT$M_BI_IDR
IRP$L_SHD_LOCK_FR4
8390C598 IRP$L_RSPID 00000000
IRP$L_SHD_LOCK_FR5
8390C59C IRP$L_CDT 8371D2C0
IRP$L_SHD_LOCK_FR0
8390C5A0 IRP$L_RWCPTR 814A035A SCS$GA_LOCALSB+0017A
8390C5A4 IRP$L_ERASE_VBN 892C2580
IRP$L_LBUFH_AD
IRP$L_SHD_PIO_LNK
8390C5A8 IRP$B_SHD_PIO_CNT 00
IRP$B_SUBCMD_STS
IRP$L_LBOFF
IRP$T_LBUFHNDL
8390C5A9 IRP$B_SHD_PIO_ACT 00
8390C5AA IRP$B_SHD_PIO_FLAGS 00
8390C5AB IRP$B_SHD_PIO_ERRCNT 00
8390C5AC IRP$L_CDRPFL 00000000
IRP$L_RBUFH_AD
IRP$L_SHD_PIO_ERROR
8390C5B0 IRP$B_SHD_PIO_ERRINDEX 37
IRP$L_RBOFF
8390C5B1 IRP$B_SHD_PIO_ERRSEV 03
8390C5B2 009D
8390C5B4 IRP$L_SHD_LOCK_FPC 000D0001
IRP$L_UBARSRCE
IRP$L_XCT_LEN
8390C5B8 IRP$L_DUTUFLAGS 00000000
IRP$L_SHD_LOCK_FR1
8390C5BC IRP$L_SHD_LOCK_FR2 00240000
IRP$W_DUTUCNTR
IRP$W_ENDMSGSIZ
IRP$C_LENGTH
================================================================================
Note 488.1 crash @stdriver+17c8 - other node shutdown 1 of 4
CSC32::BARGER 278 lines 16-MAR-1995 11:07
-< fixed in 2.1 >-
--------------------------------------------------------------------------------
Dave,
I think this article explains what your seeing - fixed in 2.1.
***************************************************************
Stdriver may cause bugcheck executing NON-istream data
******************** CAUTION: FOR INTERNAL USE ONLY
*********************
*
*
* THIS INFORMATION IS FOR USE BY DIGITAL EQUIPMENT CORP. AND ITS
*
* EMPLOYEES ONLY. PLEASE USE EXTREME CARE IF YOU MUST DISCUSS ANY
*
* PART OF THIS INFORMATION WITH ANYONE WHO IS NOT A DIGITAL
EMPLOYEE. *
*
*
******************************************************************************
BUGCHECK = ~
VMS VERSION = ~
CPU = ~
SID = ~
PROCESS = ~
IMAGE = ~
MODULE OF CRASH = STDRIVER
MODULE OFFSET = 000017C1 000017C8
FAILED INSTRUCTION = ~
CONDITION_CODE = ~
VIRTUAL ADDRESS = ~
REASON MASK = ~
SIGNAL_ARRAY = ~
PC = ~
PSL = ~
DESCRIPTION:
A reserved addressing mode or other fault may occur
in STDRIVER due to 3 instructions being commented
out of the driver. Basically we get down a code
path that has no exit. We then crash thying to
execute a data area for the next module.
TECHNIQUE(s) for confirmation:
SDA> sho stack
Current operating stack (INTERRUPT):
88E33DA8 80BBC442 EXE$EXCEPTION+00047
88E33DAC 04080009
88E33DB0 00000004
88E33DB4 7FDE9F54
88E33DB8 FFFFFFFD
88E33DBC 00000000
88E33DC0 00002200 UCB$M_TEMPLATE+00200
88E33DC4 00000001
88E33DC8 00000003
88E33DCC 0000044C BUG$_WSLENOVAL+00004 Reserved addressing mode
fault
88E33DD0 8216F398 STDRIVER+017C8 fail "PC"
88E33DD4 04080008
88E33DD8 00000004
88E33DDC 80C249EF IOC$IOPOST+00067 Routine that called stdriver
88E33DE0 00000001
88E33DE4 00000001
88E33DE8 00000007
88E33DEC 88E32000
88E33DF0 821DD700
88E33DF4 8AEEB400
88E33DF8 80C2C6DF SCH$RESCHED+000DF
88E33DFC 04C30001
SDA> e/i 8216F398
%SDA-E-NOINSTRAN, cannot translate instruction
e/i 8216F398-80;120
STDRIVER+01768: HALT
STDRIVER+01769: JSB @#V_IOC$QNXTSEG
STDRIVER+0176F: BLBC @#SMP$GL_FLAGS,STDRIVER+01786
STDRIVER+01776: PUSHL R0
STDRIVER+01778: MOVZBL 000B(R5),R0
STDRIVER+0177D: JSB @#V_SMP$RESTORE
STDRIVER+01783: MOVL (SP)+,R0
STDRIVER+01786: MTPR (SP)+,#12
STDRIVER+01789: RSB
STDRIVER+0178A: MOVL 0044(R5),R1
STDRIVER+0178F: MOVL 004C(R5),R3
STDRIVER+01794: BNEQ STDRIVER+0179B
STDRIVER+01796: MOVL 002C(R5),R3
STDRIVER+0179B: MOVL R3,002C(R5)
STDRIVER+017A0: BRW STDRIVER+017D6
STDRIVER+017A3: BBC #04,002A(R5),STDRIVER+0178A
STDRIVER+017A9: TSTW 0046(R5)
STDRIVER+017AD: BEQL STDRIVER+017B5
STDRIVER+017AF: CLRL 003A(R5)
STDRIVER+017B3: BRB STDRIVER+017B9
STDRIVER+017B5: CLRW 003A(R5)
STDRIVER+017B9: MOVL R5,R3
STDRIVER+017BC: BICW2 #10,002A(R3)
STDRIVER+017C1: MOVL 004C(R3),002C(R3) Last good instruction,
Refernce source listing later.
%SDA-E-NOINSTRAN, cannot translate instruction
Dump the area of code around the failure;
SDA> ex stdriver+17a0;80
GOOD CODE UNTIL,,,
:: :: ::
D4061300 46C5B5E1 002AC504 E1003331 13. . *. F... 8216F370
2AC310AA 5355D000 3AC5B404 11003AC5 :... :. US . * 8216F380
-----------------------------------
000700DC 0006034C 002CC300 4CC3D000 . L. ,.L... ... 8216F390
A4D05400 10C3D053 55D00000 00070134 4..... US ..T 8216F3A0
-----------------------------------
GOOD CODE AFTER,,,
A4003AC3 C0000CC3 D451001C C3D0552C ,U ..Q .. :. 8216F3B0
05136000 38C3B11F 1360B550 D4AFDE66 f P `.. 8.`.. 8216F3C0
D06E50D0 5E10C250 02A03CF0 115004C0 .P. <..P .^ Pn 8216F3D0
38C3B019 1264A4B5 5E10C00C D630505E ^P0 . .^ d.. 8 8216F3E0
The source listings follow. It is interesting in that
the last MOVL instruction is indeed that last
instruction in that code thread. The garbage that
followed is actually definition area for the next
routine.
We basically were walking thru the istream
and kept going into the data area!
Routine we were in when we crashed...
16B3 3700 ; I/O operation ended with an unsuccessful status
16B3 3701 ;
16B3 3702 ; If the request is logical I/O, branch back to
unlock. (70$)
16B3 3703 ;
16B3 3704 ; If the device is a sequential device, then the I/O
packet is
16B3 3705 ; merely sent to the ACP for notification of the
error.
16B3 3706 ;
16B3 3707 ; If the device is a random device, then the virtual
block numb
er
16B3 3708 ; stored in IRP$L_SEGVBN is the block that has an
error.
16B3 3709 ;
16B3 3710
16B3 3711 90$: BBC #IRP$V_VIRTUAL, -
16B9 3712 W^IRP$W_STS(R5), 70$ ; branch if logical
I/O
16B9 3713 TSTW W^IRP$L_OBCNT+2(R5) ; see if byte count
> 64k
16BD 3714 BEQL 100$ ; EQL implies < 64k.
16BF 3715 CLRL W^IRP$L_IOST1+2(R5) ; zero byte count
before recy
cleing irp
16C3 3716 BRB 110$ ; branch around
16C5 3717 100$: CLRW W^IRP$L_IOST1+2(R5) ; zero byte count
before recy
cleing irp
16C9 3718 110$: MOVL R5, R3 ; copy IRP address
16CC 3719 BICW #IRP$M_VIRTUAL, W^IRP$W_STS(R3) ; clear
virtual I/O f
lag
16D1 3720 MOVL W^IRP$L_DIAGBUF(R3), W^IRP$L_SVAPTE(R3) ;
reset page
table address
^
| *** last movl that works.
16D8 3721 ;<--- MOVL W^IRP$L_OBCNT(R3), R2 ; get original byte
count
16D8 3722 ;<--- BSBW IOC$QTOACP ; queue packet to
ACP
16D8 3723 ;<--- BRW IOPOST
16D8 3724 ^
|
EXIT PATH HAS BEEN COMMENTED OUT!!
NEXT ROUTINE
STRIPE V2.0-002 st_reqcom_normal
18-APR-
1991 16:38:33 [STRIPE.V2.BLD]STDRIVER.MAR;1 (65)
16D8 3726 .sbttl st_reqcom_normal
16D8 3727 .DSABL LSB
16D8 3728 ;st_REQCOM_NORMAL
16D8 3729 ;
16D8 3730 ; Collect all the the pieces of the io before return
to
16D8 3731 ; the user, maximize the error severity,
16D8 3732 ; log errors for ss$_ivbuflen and ss$_ivmedia
16D8 3733 ;
16D8 3734 ; If this is a buffered io(read) we must return the
data to th
e users
16D8 3735 ; buffer. we keep a set of sptes
(ucb$l_ix_svapte/addr) availab
le for
16D8 3736 ; this purpose.
16D8 3737 ;
16D8 3738 ; Then dealloate any buffers used for the buffered
transfer (no
npaged pool)
16D8 3739 ; Finally check for any stalled io due to shortages of
memory.
(chk_memq)
16D8 3740 ; or lack of strps
16D8 3741 ;
16D8 3742 ; table to log errors for the stripe driver the 1st
word is the
16D8 3743 ; vms error status the second word is the stripe
equivalent
16D8 3744 ; which gets decode by st$errlog(striping error log
formater)
16D8 3745 log_codes:
\034C 16D8 3746 .word
ss$_ivbuflen
\0006 16DA 3747 .word
stlog$k_ivbuflen
this should look \00DC 16DC 3748 .word
ss$_illblknum
like the area of \0007 16DE 3749 .word
stlog$k_ivmedia
memory following \0134 16E0 3750 .word
ss$_ivaddr
the last MOVL \0007 16E2 3751 .word
stlog$k_ivmedia
that worked. \0000 16E4 3752 .word 0
16E6 3753
16E6 3754 ST_REQCOM_NORMAL:
16E6 3755 MOVL R5, R3 ; Put IRP
address in
usual reg
16E9 3756 MOVL W^IRP$L_AST(R3), R4 ; Set
address of the
STRP for
16EE 3757 ; this IRP
16EE 3758 MOVL STRP$L_UCB(R4), R5 ; Set
address of the
ST UCB
16F2 3759 MOVL W^IRP$L_UCB(R3), R1 ; Point to
member's U
CB
16F7 3760 CLRL W^IRP$L_PID(R3) ; Remember
this PIRP
completed
16FB 3761 ADDL W^IRP$L_IOST1+2(R3), - ; Accumulate
actual b
ytes
SOLUTION:
Stripe V2.1
REFERENCE:
This problem was CLD'd CXO09474
\\ PROD=OPENVMS-VAX SPD=25.01 CAT=OPSYS GRP=OPENVMS-VAX OS=OPENVMS-VAX
SOURCE=CA
NASTA
\\ STDRIVER
\\ 000017C1 000017C8
================================================================================
Note 488.2 crash @stdriver+17c8 - other node shutdown 2 of 4
COMICS::GLEDHILL 1 line 16-MAR-1995 11:21
-< What database is that in? >-
--------------------------------------------------------------------------------
================================================================================
Note 488.3 crash @stdriver+17c8 - other node shutdown 3 of 4
CSC32::BARGER 12 lines 16-MAR-1995 16:33
-< STARS - OPERATING_SYSTE >-
--------------------------------------------------------------------------------
I found it both in CANASTA and STARS (OPERATING_SYSTE) database.
I suspect you searched on STDRIVER+17C8 instead of
STDRIVER+017C8??
Doesn't seem to be a standard, but if people use the log file, it
usually uses 5 character module offsets...
I've been bit by this before.
Craig
|