[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::advfs_support

Title:AdvFS Support/Info/Questions Notefile
Notice:note 187 is Freq Asked Questions;note 7 is support policy
Moderator:DECWET::DADDAMIO
Created:Wed Jun 02 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1077
Total number of notes:4417

1023.0. "mirrored HSZ40 disk -> advfs panic" by ATZIS1::PUTZENLECHNE (wherever is fun, there's always ALPHA) Thu Mar 20 1997 09:30

Hi!

Can somebody have a look at this issue:

I experienced a system-down because of a single error on a MIRRORED
disk on the HSZ40 because of an follow on advfs domain panic.

The customer asked me why a disk error is causing a problem if he uses
mirrorsets. I don't know the answer.  

crossposted in HSZ40 notes-file and ADVFS_SUPPORT notes-file

Please look at the following description:

I have the following configuration:

An Alphaserver 8200 with a HSZ40 (dual redundant) connected to scsi5.
On the HSZ40 there is a unit D100 configured which is a stripeset.
This stripeset (STRIPE3) consists of 6 mirrorsets (MIRR31 - MIRR36).
Mirrorset MIRR32 consists of DISK210 and DISK110.

The unit D100 is the UNIX-device /dev/rza41c which is used by the
ADVFS domain named ora_dat1.

ON 13th March the unit D100 logged the following error which
can be decoded as a command timeout to DISK110:


******************************** ENTRY    4 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number            10.
Timestamp of occurrence              13-MAR-1997 05:32:38
Host name                            sapfddi4

System type register      x0000000C  AlphaServer 8x00
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x0000000D

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      199. CAM SCSI Event Type


------- Unit Info -------
Bus Number                        5.
Unit Number                   x0148  Target =   1.  <--- this is rza41c
                                     LUN =   0.          UNIT D100 on
------- CAM Data -------				 the HSZ40
Class                           x00  Disk
Subsystem                       x00  Disk
Number of Packets                10.

------ Packet Type ------       258. Module Name String

Routine Name                         cdisk_check_sense

------ Packet Type ------       256. Generic String

                                     Event - Unit Attention

------ Packet Type ------       262. Info Error String

Error Type                           Information Message Detected (recovered)

------ Packet Type ------       257. Device Name String

Device Name                          DEC     HSZ4

------ Packet Type ------       256. Generic String

                                     Active CCB at time of error

------ Packet Type ------       256. Generic String

                                     CCB request completed with an error

------ Packet Type ------         1. SCSIh I/O Request CCB(CCB_SCSIIO)
Packet Revision                  37.

CCB Address               xFFFFFC005D4B7B28
CCB Lengt                    x00C0
XPT Function Code               x01  Execute requested SCSI I/O
Cam Status                      x84  CCB Request Completed WITH Error
                                     Autosense Data Valid for Target
Path ID                           5.
Target ID                         1.
Target LUN                        0.
Cam Flags                 x00000482  SIM Queue Actions are Enabled
                                     Data Direction (10: DATA OUT)
                                     Disable the SIM Queue Frozen State
*pdrv_ptr                 xFFFFFC005D4B7828
*next_ccb                 x0000000000000000
*req_map                  xFFFFFC007B13F400
void (*cam_cbfcnp)()      xFFFFFC00004A5460
*data_ptr                 xFFFFFFFFC6428000
Data Transfer Length          16384.
*sense_ptr                xFFFFFC005D4B7850
Auotsense Byte Length           160.
CDB Length                       10.
Scatter/Gather Entry Cnt          0.
SCSI Status                     x02  Check Condition
Autosense Residue Length        x00
Transfer Residue Length   x00004000
(CDB) Command & Data Buf

          15--<-12  11--<-08  07--<-04  03--<-00   :Byte Order
 0000:              00000000  0000C037  B301002A   *    *...7... ...*

Timeout Value             x0000003C
*msg_ptr                  x0000000000000000
Message Length                    0.
Vendor Unique Flags           x4000
Tag Queue Actions               x20  Tag for Simple Queue

------ Packet Type ------       256. Generic String

                                     Error, exception, or abnormal condition

------ Packet Type ------       256. Generic String

                                     UNIT ATTENTION - Medium changed or target
                                     reset

------ Packet Type ------       768. SCSI Sense Data
Packet Revision                   0.

------- HSZ Data -------
Instance Code             x031A4002  Command timeout.

                                     Component ID =   Device Services.
                                     Event Number =   x0000001A
                                     Repair Action =   x00000040
                                     NR Threshold =   x00000002
Template Type                   x51  Disk Transfer Error.
Template Flags                  x00  HCE =   0, Event did not occur during Host
                                             Command Execution.
Ctrl Serial #                              ZG60606525
Ctrl Software Revision               V30Z
RAIDSET State                   x00  NORMAL. All members present and
                                     reconstructed, IF LUN is configured as a
                                     RAIDSET.

Error Count                       1.
Retry Count                       0.
Most Recent ASC                 xB0
Most Recent ASCQ                x00
Next Most Recent ASC            x00
Next Most Recent ASCQ           x00
Device Locator              x000101  Port    =   1.
                                     Target  =   1.
                                     LUN     =   0.    <--- DISK110
Drive Software Revision              0007
Drive Product Name                   RZ29B    (C) DEC
Device Type                     x00  Direct Access Device.
Sense Data Qualifier            x00  Buf Mode =   0, The target shall not
                                                  report GOOD Status on write
                                                  commands until the data
                                                  blocks are actually written
                                                  on the medium.
                                     UWEUO =   0, not defined.
                                     MSBD =   0, not defined.
                                     FBW =   0, not defined.
                                     IDSD =   0, Valid Device Sense Data
                                              fields.
                                     DSSD =   0, Device Sense Data fields
                                              supplied by the controller.
-- Standard Sense Data --

Error Code                      x70  Current Error
Segment #                       x00
Information Byte 3              x00
            Byte 2              x00
            Byte 1              x00
            Byte 0              x00
Sense Key                       x06  Unit Attention
Additional Sense Length         x98
CMD Specific Info Byte 3        x00
                  Byte 2        x00
                  Byte 1        x00
                  Byte 0        x00

ASC & ASCQ                    xB000  ASC  =   x00B0
                                     ASCQ =   x0000
                                     Command timeout.

FRU Code                        x00
Sense Key Specific Byte 0       x00  Sense Key Data NOT Valid
                   Byte 1       x00
                   Byte 2       x00

-- Device Sense Data --

Error Code                      x00  Error Code not decoded
Segment #                       x00
Information Byte 3              x00
            Byte 2              x00
            Byte 1              x00
            Byte 0              x00
Sense Key                       x04  Hardware Error
Additional Sense Length         x00
CMD Specific Info Byte 3        x00
                  Byte 2        x00
                  Byte 1        x00
                  Byte 0        x00

ASC & ASCQ                    xB000  ASC  =   x00B0
                                     ASCQ =   x0000
                                     Command timeout.

FRU Code                        x00
Sense Key Specific Byte 0       x00  Sense Key Data NOT Valid
                   Byte 1       x00
                   Byte 2       x00


******************************** ENTRY    5 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number            11.
Timestamp of occurrence              13-MAR-1997 05:32:40
Host name                            sapfddi4

System type register      x0000000C  AlphaServer 8x00
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x0000000D

Event validity                    1. O/S claims event is valid
Event severity                    3. High Priority
Entry type                      199. CAM SCSI Event Type


------- Unit Info -------
Bus Number                        5.
Unit Number                   x0148  Target =   1.
                                     LUN =   0.
------- CAM Data -------
Class                           x00  Disk
Subsystem                       x00  Disk
Number of Packets                 4.

------ Packet Type ------       258. Module Name String

Routine Name                         cdisk_reset_rec_err

------ Packet Type ------       256. Generic String

                                     Recovery failed

------ Packet Type ------       260. Hardware Error String

Error Type                           Hard Error Detected

------ Packet Type ------       257. Device Name String

Device Name                          DEC     HSZ4



At the same time the domain "ora_dat1" paniced, and oracle stopped.
These are the entries from /var/adm/messages:

Mar 13 05:32:40 sapfddi4 vmunix: advfs I/O error: setId 0x3171fd89.000554e0.ffff
fffe.0000  tag 0xfffffff7.0000u  page 474
Mar 13 05:32:40 sapfddi4 vmunix:        vd 1  blk 28522432  blkCnt 32
Mar 13 05:32:40 sapfddi4 vmunix:        write error = 5
Mar 13 05:32:40 sapfddi4 vmunix:
Mar 13 05:32:40 sapfddi4 vmunix: bs_osf_complete: metadata write failed
Mar 13 05:32:40 sapfddi4 vmunix: AdvFS Domain Panic; Domain ora_dat1 Id 0x3171fd
89.000554e0



The DISK110 was not failed after this. I wonder why such a "soft" error
is causing such heavy failure. The system is built with redundant conrollers
and mirrored disks to prevent system down situations in case of hardware
errors of disks or controllerboards, but in this case this did not work.

Can anybody help me to explain what really happened.

thanks for every input

Helmut

T.RTitleUserPersonal
Name
DateLines
1023.1Domain PanicNETRIX::&quot;[email protected]&quot;Thu Mar 20 1997 14:2819
The hardware issue can be answered better by the HSZ40 notes expert
as to why the mirrored RAID disk set reported a problem.

AdvFS put its domain into a "domain panic" state when a metadata write
failed due to the controller reporting an error to AdvFS.
Domain panic really means that AdvFS will no longer issue IO requests to
the disk controller for a specific domain once an IO error is detected.
Beginning in V3.2, this state was added to avoid panicking the entire
system when AdvFS was unable to successfully write its metadata to disk.

You should be able to unmount all of the domain filesets after a domain
panic.
You should also be able to remount those filesets immediately assuming
there is no fatal hardware problem. The system does not need to 
rebooted.



[Posted by WWW Notes gateway]