| Title: | HSZ40 Product Conference |
| Moderator: | SSDEVO::EDMONDS |
| Created: | Mon Apr 11 1994 |
| Last Modified: | Fri Jun 06 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 902 |
| Total number of notes: | 3319 |
Hi!
Can somebody have a look at this :
I experienced a system-down because of a single error on a MIRRORED
disk on the HSZ40 because of an follow on advfs domain panic.
The customer asked me why a disk error is causing a problem if he uses
mirrorsets. I don't know the answer.
crossposted in HSZ40 notes-file and ADVFS_SUPPORT notes-file
Please look at the following description:
I have the following configuration:
An Alphaserver 8200 with a HSZ40 (dual redundant) connected to scsi5.
On the HSZ40 there is a unit D100 configured which is a stripeset.
This stripeset (STRIPE3) consists of 6 mirrorsets (MIRR31 - MIRR36).
Mirrorset MIRR32 consists of DISK210 and DISK110.
The unit D100 is the UNIX-device /dev/rza41c which is used by the
ADVFS domain named ora_dat1.
ON 13th March the unit D100 logged the following error which
can be decoded as a command timeout to DISK110:
******************************** ENTRY 4 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 10.
Timestamp of occurrence 13-MAR-1997 05:32:38
Host name sapfddi4
System type register x0000000C AlphaServer 8x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x0000000D
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 199. CAM SCSI Event Type
------- Unit Info -------
Bus Number 5.
Unit Number x0148 Target = 1. <--- this is rza41c
LUN = 0. UNIT D100 on
------- CAM Data ------- the HSZ40
Class x00 Disk
Subsystem x00 Disk
Number of Packets 10.
------ Packet Type ------ 258. Module Name String
Routine Name cdisk_check_sense
------ Packet Type ------ 256. Generic String
Event - Unit Attention
------ Packet Type ------ 262. Info Error String
Error Type Information Message Detected (recovered)
------ Packet Type ------ 257. Device Name String
Device Name DEC HSZ4
------ Packet Type ------ 256. Generic String
Active CCB at time of error
------ Packet Type ------ 256. Generic String
CCB request completed with an error
------ Packet Type ------ 1. SCSIh I/O Request CCB(CCB_SCSIIO)
Packet Revision 37.
CCB Address xFFFFFC005D4B7B28
CCB Lengt x00C0
XPT Function Code x01 Execute requested SCSI I/O
Cam Status x84 CCB Request Completed WITH Error
Autosense Data Valid for Target
Path ID 5.
Target ID 1.
Target LUN 0.
Cam Flags x00000482 SIM Queue Actions are Enabled
Data Direction (10: DATA OUT)
Disable the SIM Queue Frozen State
*pdrv_ptr xFFFFFC005D4B7828
*next_ccb x0000000000000000
*req_map xFFFFFC007B13F400
void (*cam_cbfcnp)() xFFFFFC00004A5460
*data_ptr xFFFFFFFFC6428000
Data Transfer Length 16384.
*sense_ptr xFFFFFC005D4B7850
Auotsense Byte Length 160.
CDB Length 10.
Scatter/Gather Entry Cnt 0.
SCSI Status x02 Check Condition
Autosense Residue Length x00
Transfer Residue Length x00004000
(CDB) Command & Data Buf
15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order
0000: 00000000 0000C037 B301002A * *...7... ...*
Timeout Value x0000003C
*msg_ptr x0000000000000000
Message Length 0.
Vendor Unique Flags x4000
Tag Queue Actions x20 Tag for Simple Queue
------ Packet Type ------ 256. Generic String
Error, exception, or abnormal condition
------ Packet Type ------ 256. Generic String
UNIT ATTENTION - Medium changed or target
reset
------ Packet Type ------ 768. SCSI Sense Data
Packet Revision 0.
------- HSZ Data -------
Instance Code x031A4002 Command timeout.
Component ID = Device Services.
Event Number = x0000001A
Repair Action = x00000040
NR Threshold = x00000002
Template Type x51 Disk Transfer Error.
Template Flags x00 HCE = 0, Event did not occur during Host
Command Execution.
Ctrl Serial # ZG60606525
Ctrl Software Revision V30Z
RAIDSET State x00 NORMAL. All members present and
reconstructed, IF LUN is configured as a
RAIDSET.
Error Count 1.
Retry Count 0.
Most Recent ASC xB0
Most Recent ASCQ x00
Next Most Recent ASC x00
Next Most Recent ASCQ x00
Device Locator x000101 Port = 1.
Target = 1.
LUN = 0. <--- DISK110
Drive Software Revision 0007
Drive Product Name RZ29B (C) DEC
Device Type x00 Direct Access Device.
Sense Data Qualifier x00 Buf Mode = 0, The target shall not
report GOOD Status on write
commands until the data
blocks are actually written
on the medium.
UWEUO = 0, not defined.
MSBD = 0, not defined.
FBW = 0, not defined.
IDSD = 0, Valid Device Sense Data
fields.
DSSD = 0, Device Sense Data fields
supplied by the controller.
-- Standard Sense Data --
Error Code x70 Current Error
Segment # x00
Information Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
Sense Key x06 Unit Attention
Additional Sense Length x98
CMD Specific Info Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
ASC & ASCQ xB000 ASC = x00B0
ASCQ = x0000
Command timeout.
FRU Code x00
Sense Key Specific Byte 0 x00 Sense Key Data NOT Valid
Byte 1 x00
Byte 2 x00
-- Device Sense Data --
Error Code x00 Error Code not decoded
Segment # x00
Information Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
Sense Key x04 Hardware Error
Additional Sense Length x00
CMD Specific Info Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
ASC & ASCQ xB000 ASC = x00B0
ASCQ = x0000
Command timeout.
FRU Code x00
Sense Key Specific Byte 0 x00 Sense Key Data NOT Valid
Byte 1 x00
Byte 2 x00
******************************** ENTRY 5 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 11.
Timestamp of occurrence 13-MAR-1997 05:32:40
Host name sapfddi4
System type register x0000000C AlphaServer 8x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x0000000D
Event validity 1. O/S claims event is valid
Event severity 3. High Priority
Entry type 199. CAM SCSI Event Type
------- Unit Info -------
Bus Number 5.
Unit Number x0148 Target = 1.
LUN = 0.
------- CAM Data -------
Class x00 Disk
Subsystem x00 Disk
Number of Packets 4.
------ Packet Type ------ 258. Module Name String
Routine Name cdisk_reset_rec_err
------ Packet Type ------ 256. Generic String
Recovery failed
------ Packet Type ------ 260. Hardware Error String
Error Type Hard Error Detected
------ Packet Type ------ 257. Device Name String
Device Name DEC HSZ4
At the same time the domain "ora_dat1" paniced, and oracle stopped.
These are the entries from /var/adm/messages:
Mar 13 05:32:40 sapfddi4 vmunix: advfs I/O error: setId 0x3171fd89.000554e0.ffff
fffe.0000 tag 0xfffffff7.0000u page 474
Mar 13 05:32:40 sapfddi4 vmunix: vd 1 blk 28522432 blkCnt 32
Mar 13 05:32:40 sapfddi4 vmunix: write error = 5
Mar 13 05:32:40 sapfddi4 vmunix:
Mar 13 05:32:40 sapfddi4 vmunix: bs_osf_complete: metadata write failed
Mar 13 05:32:40 sapfddi4 vmunix: AdvFS Domain Panic; Domain ora_dat1 Id 0x3171fd
89.000554e0
The DISK110 was not failed after this. I wonder why such a "soft" error
is causing such heavy failure. The system is built with redundant conrollers
and mirrored disks to prevent system down situations in case of hardware
errors of disks or controllerboards, but in this case this did not work.
Can anybody help me to explain what really happened.
thanks for every input
Helmut
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 819.1 | not enough information | SSDEVO::RMCLEAN | Thu Mar 20 1997 16:03 | 3 | |
What version of HSOF software are you running & what patch level. The error logs don't tell us this nor do they tell us what configuration you have. | |||||
| 819.2 | Configuration HSZ40 | ATZIS2::PUTZENLECHNE | wherever is fun, there's always ALPHA | Tue Apr 01 1997 02:33 | 119 |
Hi!
I'm sorry for the delay, I had to go out of the office last week.
The HSZ40 is connected to an Alphaserver 8200 via a KZPSA in a
DWLPA. UNIX Version was at V3.2d-1 and is now upgraded to 3.2G.
here i print out the relevant part hsz40 config:
HSZ03> sho this full
Controller:
HSZ40 ZG60606525 Firmware V30Z-2, Hardware B03
Configured for dual-redundancy with ZG60506190
In dual-redundant configuration
SCSI address 6
Time: 20-MAR-1997 17:07:08
Host port:
SCSI target(s) (1, 2, 3, 4), Preferred target(s) (1, 3)
TRANSFER_RATE_REQUESTED = 10MHZ
Cache:
32 megabyte write cache, version 2
Cache is GOOD
Battery is GOOD
Unflushed data in cache
CACHE_FLUSH_TIMER = DEFAULT (10 seconds)
CACHE_POLICY = B
Host Functionality Mode = A
Licensing information:
RAID (RAID Option) is ENABLED, license key is VALID
WBCA (Writeback Cache Option) is ENABLED, license key is VALID
MIRR (Disk Mirroring Option) is ENABLED, license key is VALID
Extended information:
Terminal speed 9600 baud, eight bit, no parity, 1 stop bit
Operation control: 00000004 Security state code: 76193
Configuration backup enabled on 16 devices
HSZ03> sho unit
LUN Uses
--------------------------------------------------------------
D100 STRIPE3
D101 MIRR11
D200 STRIPE2
D300 STRIPE5
D400 STRIPE4
The effected UNIT was D100:
HSZ03> sho d100
LUN Uses
--------------------------------------------------------------
D100 STRIPE3
Switches:
RUN NOWRITE_PROTECT READ_CACHE
WRITEBACK_CACHE
MAXIMUM_CACHED_TRANSFER_SIZE = 1024
State:
ONLINE to this controller
Not reserved
PREFERRED_PATH = THIS_CONTROLLER
Size: 50265168 blocks
HSZ03> sho stripe3
Name Storageset Uses Used by
------------------------------------------------------------------------------
STRIPE3 stripeset MIRR31 D100
MIRR32
MIRR33
MIRR34
MIRR35
MIRR36
Switches:
CHUNKSIZE = 256 blocks
State:
NORMAL
MIRR31 (member 0) is NORMAL
MIRR32 (member 1) is NORMAL
MIRR33 (member 2) is NORMAL
MIRR34 (member 3) is NORMAL
MIRR35 (member 4) is NORMAL
MIRR36 (member 5) is NORMAL
Size: 50265168 blocks
HSZ03> sho mirr32
Name Storageset Uses Used by
------------------------------------------------------------------------------
MIRR32 mirrorset DISK110 STRIPE3
DISK210
Switches:
NOPOLICY (for replacement)
COPY (priority) = NORMAL
READ_SOURCE = LEAST_BUSY
MEMBERSHIP = 2, 2 members present
State:
NORMAL
DISK210 (member 0) is NORMAL
DISK110 (member 1) is NORMAL <--- disk with error
Size: 8377528 blocks
HSZ03> sho disk110
Name Type Port Targ Lun Used by
------------------------------------------------------------------------------
DISK110 disk 1 1 0 MIRR32
DEC RZ29B (C) DEC 0016
Switches:
NOTRANSPORTABLE
TRANSFER_RATE_REQUESTED = 10MHZ (synchronous 10 MHZ negotiated)
Size: 8377528 blocks
Configuration being backed up on this container
| |||||
| 819.3 | You need -3 patch | SSDEVO::RMCLEAN | Tue Apr 01 1997 10:47 | 25 | |
>> HSZ40 ZG60606525 Firmware V30Z-2, Hardware B03
You should be running V30Z-3 It corrects some problems in this area.
I. Patch Description:
This mirrorset repair/fast buffer problem may be encountered with HSOF
V3.0Z, V5.0Z and V5.0J. Mirroring (with or without striping) must be
in use on the controller. Data transfers greater than the value
specified in the controller parameter MAXIMUM_CACHED_TRANSFER_SIZE
must be taking place. The default parameter value is 32 blocks
(16KB). An unrecoverable error from a device must initiate a Mirror
repair.
When the above conditions take place, the controller improperly
de-allocates buffers, contaminating the Fast Buffer pool and the Cache
Buffer pool. Subsequently, when a mix of transfers greater than the
MAXIMUM_CACHED_TRANSFER_SIZE (using Fast buffers) and less than the
MAXIMUM_CACHED_TRANSFER_SIZE (using Cache Buffers) occurs, the
double-allocated buffers will be used and a data integrity problem is
stimulated.
| |||||
| 819.4 | OK - but....? | ATZIS2::PUTZENLECHNE | wherever is fun, there's always ALPHA | Wed Apr 02 1997 02:19 | 14 |
Thanks!
I did not really understand the Patch Description, but i will
install the patch and hope this helps.
What i do not understand is if there are two different problems
fixed with this patch?
1.) transfer size > MAXIMUM_CACHED_TRANSFER_SIZE
2.) unrecoverable error from device must initiate mirror repair
Are these things independent from each other or is there a
relationship between 1.) and 2.)?
Helmut
| |||||
| 819.5 | KERNEL::LOANE | Comfortably numb!! | Wed Apr 02 1997 07:26 | 3 | |
What it really says is that you are susceptible to the problem IF
ALL the points in the reply are valid i.e. You have Mirror sets
.AND. you have errors .AND. .......etc
| |||||
| 819.6 | Yes, it's not sure | ATZIS2::PUTZENLECHNE | wherever is fun, there's always ALPHA | Thu Apr 03 1997 07:11 | 6 |
My words - I fear it is not the same, because i think we had already
implemeted the "early fix" (setting MAXIMUM_CACHED_TRANSFER_SIZE to
1024, and changing HSZ40-entriy in the cam_data.c) as the error
occurred.
Helmut
| |||||