[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | AdvFS Support/Info/Questions Notefile |
Notice: | note 187 is Freq Asked Questions;note 7 is support policy |
Moderator: | DECWET::DADDAMIO |
|
Created: | Wed Jun 02 1993 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1077 |
Total number of notes: | 4417 |
1023.0. "mirrored HSZ40 disk -> advfs panic" by ATZIS1::PUTZENLECHNE (wherever is fun, there's always ALPHA) Thu Mar 20 1997 09:30
Hi!
Can somebody have a look at this issue:
I experienced a system-down because of a single error on a MIRRORED
disk on the HSZ40 because of an follow on advfs domain panic.
The customer asked me why a disk error is causing a problem if he uses
mirrorsets. I don't know the answer.
crossposted in HSZ40 notes-file and ADVFS_SUPPORT notes-file
Please look at the following description:
I have the following configuration:
An Alphaserver 8200 with a HSZ40 (dual redundant) connected to scsi5.
On the HSZ40 there is a unit D100 configured which is a stripeset.
This stripeset (STRIPE3) consists of 6 mirrorsets (MIRR31 - MIRR36).
Mirrorset MIRR32 consists of DISK210 and DISK110.
The unit D100 is the UNIX-device /dev/rza41c which is used by the
ADVFS domain named ora_dat1.
ON 13th March the unit D100 logged the following error which
can be decoded as a command timeout to DISK110:
******************************** ENTRY 4 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 10.
Timestamp of occurrence 13-MAR-1997 05:32:38
Host name sapfddi4
System type register x0000000C AlphaServer 8x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x0000000D
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 199. CAM SCSI Event Type
------- Unit Info -------
Bus Number 5.
Unit Number x0148 Target = 1. <--- this is rza41c
LUN = 0. UNIT D100 on
------- CAM Data ------- the HSZ40
Class x00 Disk
Subsystem x00 Disk
Number of Packets 10.
------ Packet Type ------ 258. Module Name String
Routine Name cdisk_check_sense
------ Packet Type ------ 256. Generic String
Event - Unit Attention
------ Packet Type ------ 262. Info Error String
Error Type Information Message Detected (recovered)
------ Packet Type ------ 257. Device Name String
Device Name DEC HSZ4
------ Packet Type ------ 256. Generic String
Active CCB at time of error
------ Packet Type ------ 256. Generic String
CCB request completed with an error
------ Packet Type ------ 1. SCSIh I/O Request CCB(CCB_SCSIIO)
Packet Revision 37.
CCB Address xFFFFFC005D4B7B28
CCB Lengt x00C0
XPT Function Code x01 Execute requested SCSI I/O
Cam Status x84 CCB Request Completed WITH Error
Autosense Data Valid for Target
Path ID 5.
Target ID 1.
Target LUN 0.
Cam Flags x00000482 SIM Queue Actions are Enabled
Data Direction (10: DATA OUT)
Disable the SIM Queue Frozen State
*pdrv_ptr xFFFFFC005D4B7828
*next_ccb x0000000000000000
*req_map xFFFFFC007B13F400
void (*cam_cbfcnp)() xFFFFFC00004A5460
*data_ptr xFFFFFFFFC6428000
Data Transfer Length 16384.
*sense_ptr xFFFFFC005D4B7850
Auotsense Byte Length 160.
CDB Length 10.
Scatter/Gather Entry Cnt 0.
SCSI Status x02 Check Condition
Autosense Residue Length x00
Transfer Residue Length x00004000
(CDB) Command & Data Buf
15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order
0000: 00000000 0000C037 B301002A * *...7... ...*
Timeout Value x0000003C
*msg_ptr x0000000000000000
Message Length 0.
Vendor Unique Flags x4000
Tag Queue Actions x20 Tag for Simple Queue
------ Packet Type ------ 256. Generic String
Error, exception, or abnormal condition
------ Packet Type ------ 256. Generic String
UNIT ATTENTION - Medium changed or target
reset
------ Packet Type ------ 768. SCSI Sense Data
Packet Revision 0.
------- HSZ Data -------
Instance Code x031A4002 Command timeout.
Component ID = Device Services.
Event Number = x0000001A
Repair Action = x00000040
NR Threshold = x00000002
Template Type x51 Disk Transfer Error.
Template Flags x00 HCE = 0, Event did not occur during Host
Command Execution.
Ctrl Serial # ZG60606525
Ctrl Software Revision V30Z
RAIDSET State x00 NORMAL. All members present and
reconstructed, IF LUN is configured as a
RAIDSET.
Error Count 1.
Retry Count 0.
Most Recent ASC xB0
Most Recent ASCQ x00
Next Most Recent ASC x00
Next Most Recent ASCQ x00
Device Locator x000101 Port = 1.
Target = 1.
LUN = 0. <--- DISK110
Drive Software Revision 0007
Drive Product Name RZ29B (C) DEC
Device Type x00 Direct Access Device.
Sense Data Qualifier x00 Buf Mode = 0, The target shall not
report GOOD Status on write
commands until the data
blocks are actually written
on the medium.
UWEUO = 0, not defined.
MSBD = 0, not defined.
FBW = 0, not defined.
IDSD = 0, Valid Device Sense Data
fields.
DSSD = 0, Device Sense Data fields
supplied by the controller.
-- Standard Sense Data --
Error Code x70 Current Error
Segment # x00
Information Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
Sense Key x06 Unit Attention
Additional Sense Length x98
CMD Specific Info Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
ASC & ASCQ xB000 ASC = x00B0
ASCQ = x0000
Command timeout.
FRU Code x00
Sense Key Specific Byte 0 x00 Sense Key Data NOT Valid
Byte 1 x00
Byte 2 x00
-- Device Sense Data --
Error Code x00 Error Code not decoded
Segment # x00
Information Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
Sense Key x04 Hardware Error
Additional Sense Length x00
CMD Specific Info Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
ASC & ASCQ xB000 ASC = x00B0
ASCQ = x0000
Command timeout.
FRU Code x00
Sense Key Specific Byte 0 x00 Sense Key Data NOT Valid
Byte 1 x00
Byte 2 x00
******************************** ENTRY 5 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 11.
Timestamp of occurrence 13-MAR-1997 05:32:40
Host name sapfddi4
System type register x0000000C AlphaServer 8x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x0000000D
Event validity 1. O/S claims event is valid
Event severity 3. High Priority
Entry type 199. CAM SCSI Event Type
------- Unit Info -------
Bus Number 5.
Unit Number x0148 Target = 1.
LUN = 0.
------- CAM Data -------
Class x00 Disk
Subsystem x00 Disk
Number of Packets 4.
------ Packet Type ------ 258. Module Name String
Routine Name cdisk_reset_rec_err
------ Packet Type ------ 256. Generic String
Recovery failed
------ Packet Type ------ 260. Hardware Error String
Error Type Hard Error Detected
------ Packet Type ------ 257. Device Name String
Device Name DEC HSZ4
At the same time the domain "ora_dat1" paniced, and oracle stopped.
These are the entries from /var/adm/messages:
Mar 13 05:32:40 sapfddi4 vmunix: advfs I/O error: setId 0x3171fd89.000554e0.ffff
fffe.0000 tag 0xfffffff7.0000u page 474
Mar 13 05:32:40 sapfddi4 vmunix: vd 1 blk 28522432 blkCnt 32
Mar 13 05:32:40 sapfddi4 vmunix: write error = 5
Mar 13 05:32:40 sapfddi4 vmunix:
Mar 13 05:32:40 sapfddi4 vmunix: bs_osf_complete: metadata write failed
Mar 13 05:32:40 sapfddi4 vmunix: AdvFS Domain Panic; Domain ora_dat1 Id 0x3171fd
89.000554e0
The DISK110 was not failed after this. I wonder why such a "soft" error
is causing such heavy failure. The system is built with redundant conrollers
and mirrored disks to prevent system down situations in case of hardware
errors of disks or controllerboards, but in this case this did not work.
Can anybody help me to explain what really happened.
thanks for every input
Helmut
T.R | Title | User | Personal Name | Date | Lines |
---|
1023.1 | Domain Panic | NETRIX::"[email protected]" | | Thu Mar 20 1997 14:28 | 19 |
| The hardware issue can be answered better by the HSZ40 notes expert
as to why the mirrored RAID disk set reported a problem.
AdvFS put its domain into a "domain panic" state when a metadata write
failed due to the controller reporting an error to AdvFS.
Domain panic really means that AdvFS will no longer issue IO requests to
the disk controller for a specific domain once an IO error is detected.
Beginning in V3.2, this state was added to avoid panicking the entire
system when AdvFS was unable to successfully write its metadata to disk.
You should be able to unmount all of the domain filesets after a domain
panic.
You should also be able to remount those filesets immediately assuming
there is no fatal hardware problem. The system does not need to
rebooted.
[Posted by WWW Notes gateway]
|