Title: | HSZ40 Product Conference |
Moderator: | SSDEVO::EDMONDS |
Created: | Mon Apr 11 1994 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 902 |
Total number of notes: | 3319 |
Hi! Can somebody have a look at this : I experienced a system-down because of a single error on a MIRRORED disk on the HSZ40 because of an follow on advfs domain panic. The customer asked me why a disk error is causing a problem if he uses mirrorsets. I don't know the answer. crossposted in HSZ40 notes-file and ADVFS_SUPPORT notes-file Please look at the following description: I have the following configuration: An Alphaserver 8200 with a HSZ40 (dual redundant) connected to scsi5. On the HSZ40 there is a unit D100 configured which is a stripeset. This stripeset (STRIPE3) consists of 6 mirrorsets (MIRR31 - MIRR36). Mirrorset MIRR32 consists of DISK210 and DISK110. The unit D100 is the UNIX-device /dev/rza41c which is used by the ADVFS domain named ora_dat1. ON 13th March the unit D100 logged the following error which can be decoded as a command timeout to DISK110: ******************************** ENTRY 4 ******************************** Logging OS 2. Digital UNIX System Architecture 2. Alpha Event sequence number 10. Timestamp of occurrence 13-MAR-1997 05:32:38 Host name sapfddi4 System type register x0000000C AlphaServer 8x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x0000000D Event validity 1. O/S claims event is valid Event severity 5. Low Priority Entry type 199. CAM SCSI Event Type ------- Unit Info ------- Bus Number 5. Unit Number x0148 Target = 1. <--- this is rza41c LUN = 0. UNIT D100 on ------- CAM Data ------- the HSZ40 Class x00 Disk Subsystem x00 Disk Number of Packets 10. ------ Packet Type ------ 258. Module Name String Routine Name cdisk_check_sense ------ Packet Type ------ 256. Generic String Event - Unit Attention ------ Packet Type ------ 262. Info Error String Error Type Information Message Detected (recovered) ------ Packet Type ------ 257. Device Name String Device Name DEC HSZ4 ------ Packet Type ------ 256. Generic String Active CCB at time of error ------ Packet Type ------ 256. Generic String CCB request completed with an error ------ Packet Type ------ 1. SCSIh I/O Request CCB(CCB_SCSIIO) Packet Revision 37. CCB Address xFFFFFC005D4B7B28 CCB Lengt x00C0 XPT Function Code x01 Execute requested SCSI I/O Cam Status x84 CCB Request Completed WITH Error Autosense Data Valid for Target Path ID 5. Target ID 1. Target LUN 0. Cam Flags x00000482 SIM Queue Actions are Enabled Data Direction (10: DATA OUT) Disable the SIM Queue Frozen State *pdrv_ptr xFFFFFC005D4B7828 *next_ccb x0000000000000000 *req_map xFFFFFC007B13F400 void (*cam_cbfcnp)() xFFFFFC00004A5460 *data_ptr xFFFFFFFFC6428000 Data Transfer Length 16384. *sense_ptr xFFFFFC005D4B7850 Auotsense Byte Length 160. CDB Length 10. Scatter/Gather Entry Cnt 0. SCSI Status x02 Check Condition Autosense Residue Length x00 Transfer Residue Length x00004000 (CDB) Command & Data Buf 15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order 0000: 00000000 0000C037 B301002A * *...7... ...* Timeout Value x0000003C *msg_ptr x0000000000000000 Message Length 0. Vendor Unique Flags x4000 Tag Queue Actions x20 Tag for Simple Queue ------ Packet Type ------ 256. Generic String Error, exception, or abnormal condition ------ Packet Type ------ 256. Generic String UNIT ATTENTION - Medium changed or target reset ------ Packet Type ------ 768. SCSI Sense Data Packet Revision 0. ------- HSZ Data ------- Instance Code x031A4002 Command timeout. Component ID = Device Services. Event Number = x0000001A Repair Action = x00000040 NR Threshold = x00000002 Template Type x51 Disk Transfer Error. Template Flags x00 HCE = 0, Event did not occur during Host Command Execution. Ctrl Serial # ZG60606525 Ctrl Software Revision V30Z RAIDSET State x00 NORMAL. All members present and reconstructed, IF LUN is configured as a RAIDSET. Error Count 1. Retry Count 0. Most Recent ASC xB0 Most Recent ASCQ x00 Next Most Recent ASC x00 Next Most Recent ASCQ x00 Device Locator x000101 Port = 1. Target = 1. LUN = 0. <--- DISK110 Drive Software Revision 0007 Drive Product Name RZ29B (C) DEC Device Type x00 Direct Access Device. Sense Data Qualifier x00 Buf Mode = 0, The target shall not report GOOD Status on write commands until the data blocks are actually written on the medium. UWEUO = 0, not defined. MSBD = 0, not defined. FBW = 0, not defined. IDSD = 0, Valid Device Sense Data fields. DSSD = 0, Device Sense Data fields supplied by the controller. -- Standard Sense Data -- Error Code x70 Current Error Segment # x00 Information Byte 3 x00 Byte 2 x00 Byte 1 x00 Byte 0 x00 Sense Key x06 Unit Attention Additional Sense Length x98 CMD Specific Info Byte 3 x00 Byte 2 x00 Byte 1 x00 Byte 0 x00 ASC & ASCQ xB000 ASC = x00B0 ASCQ = x0000 Command timeout. FRU Code x00 Sense Key Specific Byte 0 x00 Sense Key Data NOT Valid Byte 1 x00 Byte 2 x00 -- Device Sense Data -- Error Code x00 Error Code not decoded Segment # x00 Information Byte 3 x00 Byte 2 x00 Byte 1 x00 Byte 0 x00 Sense Key x04 Hardware Error Additional Sense Length x00 CMD Specific Info Byte 3 x00 Byte 2 x00 Byte 1 x00 Byte 0 x00 ASC & ASCQ xB000 ASC = x00B0 ASCQ = x0000 Command timeout. FRU Code x00 Sense Key Specific Byte 0 x00 Sense Key Data NOT Valid Byte 1 x00 Byte 2 x00 ******************************** ENTRY 5 ******************************** Logging OS 2. Digital UNIX System Architecture 2. Alpha Event sequence number 11. Timestamp of occurrence 13-MAR-1997 05:32:40 Host name sapfddi4 System type register x0000000C AlphaServer 8x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x0000000D Event validity 1. O/S claims event is valid Event severity 3. High Priority Entry type 199. CAM SCSI Event Type ------- Unit Info ------- Bus Number 5. Unit Number x0148 Target = 1. LUN = 0. ------- CAM Data ------- Class x00 Disk Subsystem x00 Disk Number of Packets 4. ------ Packet Type ------ 258. Module Name String Routine Name cdisk_reset_rec_err ------ Packet Type ------ 256. Generic String Recovery failed ------ Packet Type ------ 260. Hardware Error String Error Type Hard Error Detected ------ Packet Type ------ 257. Device Name String Device Name DEC HSZ4 At the same time the domain "ora_dat1" paniced, and oracle stopped. These are the entries from /var/adm/messages: Mar 13 05:32:40 sapfddi4 vmunix: advfs I/O error: setId 0x3171fd89.000554e0.ffff fffe.0000 tag 0xfffffff7.0000u page 474 Mar 13 05:32:40 sapfddi4 vmunix: vd 1 blk 28522432 blkCnt 32 Mar 13 05:32:40 sapfddi4 vmunix: write error = 5 Mar 13 05:32:40 sapfddi4 vmunix: Mar 13 05:32:40 sapfddi4 vmunix: bs_osf_complete: metadata write failed Mar 13 05:32:40 sapfddi4 vmunix: AdvFS Domain Panic; Domain ora_dat1 Id 0x3171fd 89.000554e0 The DISK110 was not failed after this. I wonder why such a "soft" error is causing such heavy failure. The system is built with redundant conrollers and mirrored disks to prevent system down situations in case of hardware errors of disks or controllerboards, but in this case this did not work. Can anybody help me to explain what really happened. thanks for every input Helmut
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
819.1 | not enough information | SSDEVO::RMCLEAN | Thu Mar 20 1997 16:03 | 3 | |
What version of HSOF software are you running & what patch level. The error logs don't tell us this nor do they tell us what configuration you have. | |||||
819.2 | Configuration HSZ40 | ATZIS2::PUTZENLECHNE | wherever is fun, there's always ALPHA | Tue Apr 01 1997 03:33 | 119 |
Hi! I'm sorry for the delay, I had to go out of the office last week. The HSZ40 is connected to an Alphaserver 8200 via a KZPSA in a DWLPA. UNIX Version was at V3.2d-1 and is now upgraded to 3.2G. here i print out the relevant part hsz40 config: HSZ03> sho this full Controller: HSZ40 ZG60606525 Firmware V30Z-2, Hardware B03 Configured for dual-redundancy with ZG60506190 In dual-redundant configuration SCSI address 6 Time: 20-MAR-1997 17:07:08 Host port: SCSI target(s) (1, 2, 3, 4), Preferred target(s) (1, 3) TRANSFER_RATE_REQUESTED = 10MHZ Cache: 32 megabyte write cache, version 2 Cache is GOOD Battery is GOOD Unflushed data in cache CACHE_FLUSH_TIMER = DEFAULT (10 seconds) CACHE_POLICY = B Host Functionality Mode = A Licensing information: RAID (RAID Option) is ENABLED, license key is VALID WBCA (Writeback Cache Option) is ENABLED, license key is VALID MIRR (Disk Mirroring Option) is ENABLED, license key is VALID Extended information: Terminal speed 9600 baud, eight bit, no parity, 1 stop bit Operation control: 00000004 Security state code: 76193 Configuration backup enabled on 16 devices HSZ03> sho unit LUN Uses -------------------------------------------------------------- D100 STRIPE3 D101 MIRR11 D200 STRIPE2 D300 STRIPE5 D400 STRIPE4 The effected UNIT was D100: HSZ03> sho d100 LUN Uses -------------------------------------------------------------- D100 STRIPE3 Switches: RUN NOWRITE_PROTECT READ_CACHE WRITEBACK_CACHE MAXIMUM_CACHED_TRANSFER_SIZE = 1024 State: ONLINE to this controller Not reserved PREFERRED_PATH = THIS_CONTROLLER Size: 50265168 blocks HSZ03> sho stripe3 Name Storageset Uses Used by ------------------------------------------------------------------------------ STRIPE3 stripeset MIRR31 D100 MIRR32 MIRR33 MIRR34 MIRR35 MIRR36 Switches: CHUNKSIZE = 256 blocks State: NORMAL MIRR31 (member 0) is NORMAL MIRR32 (member 1) is NORMAL MIRR33 (member 2) is NORMAL MIRR34 (member 3) is NORMAL MIRR35 (member 4) is NORMAL MIRR36 (member 5) is NORMAL Size: 50265168 blocks HSZ03> sho mirr32 Name Storageset Uses Used by ------------------------------------------------------------------------------ MIRR32 mirrorset DISK110 STRIPE3 DISK210 Switches: NOPOLICY (for replacement) COPY (priority) = NORMAL READ_SOURCE = LEAST_BUSY MEMBERSHIP = 2, 2 members present State: NORMAL DISK210 (member 0) is NORMAL DISK110 (member 1) is NORMAL <--- disk with error Size: 8377528 blocks HSZ03> sho disk110 Name Type Port Targ Lun Used by ------------------------------------------------------------------------------ DISK110 disk 1 1 0 MIRR32 DEC RZ29B (C) DEC 0016 Switches: NOTRANSPORTABLE TRANSFER_RATE_REQUESTED = 10MHZ (synchronous 10 MHZ negotiated) Size: 8377528 blocks Configuration being backed up on this container | |||||
819.3 | You need -3 patch | SSDEVO::RMCLEAN | Tue Apr 01 1997 11:47 | 25 | |
>> HSZ40 ZG60606525 Firmware V30Z-2, Hardware B03 You should be running V30Z-3 It corrects some problems in this area. I. Patch Description: This mirrorset repair/fast buffer problem may be encountered with HSOF V3.0Z, V5.0Z and V5.0J. Mirroring (with or without striping) must be in use on the controller. Data transfers greater than the value specified in the controller parameter MAXIMUM_CACHED_TRANSFER_SIZE must be taking place. The default parameter value is 32 blocks (16KB). An unrecoverable error from a device must initiate a Mirror repair. When the above conditions take place, the controller improperly de-allocates buffers, contaminating the Fast Buffer pool and the Cache Buffer pool. Subsequently, when a mix of transfers greater than the MAXIMUM_CACHED_TRANSFER_SIZE (using Fast buffers) and less than the MAXIMUM_CACHED_TRANSFER_SIZE (using Cache Buffers) occurs, the double-allocated buffers will be used and a data integrity problem is stimulated. | |||||
819.4 | OK - but....? | ATZIS2::PUTZENLECHNE | wherever is fun, there's always ALPHA | Wed Apr 02 1997 03:19 | 14 |
Thanks! I did not really understand the Patch Description, but i will install the patch and hope this helps. What i do not understand is if there are two different problems fixed with this patch? 1.) transfer size > MAXIMUM_CACHED_TRANSFER_SIZE 2.) unrecoverable error from device must initiate mirror repair Are these things independent from each other or is there a relationship between 1.) and 2.)? Helmut | |||||
819.5 | KERNEL::LOANE | Comfortably numb!! | Wed Apr 02 1997 08:26 | 3 | |
What it really says is that you are susceptible to the problem IF ALL the points in the reply are valid i.e. You have Mirror sets .AND. you have errors .AND. .......etc | |||||
819.6 | Yes, it's not sure | ATZIS2::PUTZENLECHNE | wherever is fun, there's always ALPHA | Thu Apr 03 1997 08:11 | 6 |
My words - I fear it is not the same, because i think we had already implemeted the "early fix" (setting MAXIMUM_CACHED_TRANSFER_SIZE to 1024, and changing HSZ40-entriy in the cam_data.c) as the error occurred. Helmut |