| Title: | DIGITAL UNIX (FORMERLY KNOWN AS DEC OSF/1) |
| Notice: | Welcome to the Digital UNIX Conference |
| Moderator: | SMURF::DENHAM |
| Created: | Thu Mar 16 1995 |
| Last Modified: | Fri Jun 06 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 10068 |
| Total number of notes: | 35879 |
A question regarding CAM regarding disk bad block replacement.
Yesterday, my customer reported that his application was returning an
I/O error while copying a file. Analysis of the error log show that
the source disk (of the file copy) was having an unrecoverable error
(ASC/Q=1100). The disk in question is an RZ26. Looking over the
history of error log entries showed that the disk had reported multiple
occurances of this same error (ASC/Q=1100) at the same block number
(2023725) over several days.
Looking at another system that uses RZ29B drives, I noticed that this
system had performed a few bad block replacements. The data from the
error log for the BBRs was:
----- CAM STRING -----
ROUTINE NAME cdisk_bbr_done
----- CAM STRING -----
cdisk_bbr_read: Bad block ok no BBR
_action bad block number: 253
----- CAM STRING -----
ERROR TYPE Soft Error Detected (recovered)
----- CAM STRING -----
DEVICE NAME DEC RZ29B
Or:
----- CAM STRING -----
ROUTINE NAME cdisk_bbr_done
----- CAM STRING -----
cdisk_bbr_write: BBR complete bad
_block number: 175318
----- CAM STRING -----
ERROR TYPE Soft Error Detected (recovered)
----- CAM STRING -----
DEVICE NAME DEC RZ29B
My question is:
Why didn't CAM perform the BBR for the RZ26 in first instance, while it
did for the RZ29B?
The major difference between the RZ26 case and the RZ29B case is that
the RZ26 had ASC/Q=1100="Unrecovered read error", while the RZ29B was
showing a recovered soft error. Is this the reason that the RZ26
didn't get the BBR?
For now, I just used /sbin/scu to reassign the block in question, so
that now, reading the file doesn't give an I/O error, and advised the
user to get the file restored from backups.
Thanks!
Stu
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 9189.1 | Silent data corruption | SSDEVO::ROLLOW | Dr. File System's Home for Wayward Inodes. | Fri Mar 14 1997 11:55 | 15 |
Fred will undoubtly give a better answer, but I think the reason was because the RZ26 error wasn't recoverable. Had it elected to replace the bad block behind your back you would have no clue (except the previous history) that the block was bad. In the RZ29B case, it was able to read a good copy of the data, reassign the block and write the good data back to the new block. By not replacing un- recoverable errors you're forced to reassign it by hand and realize in the process that the data is corrupt and must be restored. MSCP disks had the feature of remembering that the data in a replaced block was bad; Force Error. As a result it could replace the block, recoverable or not. SCSI disks don't have that feature. | |||||
| 9189.2 | Thanks! | OHFSS1::FULLER | Never confuse a memo with reality | Fri Mar 14 1997 13:58 | 8 |
re: .1
That's pretty much what I'd figured, especially in light of the MSCP
forced error flag.
Thanks for the get-back.
Stu
| |||||