Title: | DIGITAL UNIX (FORMERLY KNOWN AS DEC OSF/1) |
Notice: | Welcome to the Digital UNIX Conference |
Moderator: | SMURF::DENHAM |
Created: | Thu Mar 16 1995 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 10068 |
Total number of notes: | 35879 |
A question regarding CAM regarding disk bad block replacement. Yesterday, my customer reported that his application was returning an I/O error while copying a file. Analysis of the error log show that the source disk (of the file copy) was having an unrecoverable error (ASC/Q=1100). The disk in question is an RZ26. Looking over the history of error log entries showed that the disk had reported multiple occurances of this same error (ASC/Q=1100) at the same block number (2023725) over several days. Looking at another system that uses RZ29B drives, I noticed that this system had performed a few bad block replacements. The data from the error log for the BBRs was: ----- CAM STRING ----- ROUTINE NAME cdisk_bbr_done ----- CAM STRING ----- cdisk_bbr_read: Bad block ok no BBR _action bad block number: 253 ----- CAM STRING ----- ERROR TYPE Soft Error Detected (recovered) ----- CAM STRING ----- DEVICE NAME DEC RZ29B Or: ----- CAM STRING ----- ROUTINE NAME cdisk_bbr_done ----- CAM STRING ----- cdisk_bbr_write: BBR complete bad _block number: 175318 ----- CAM STRING ----- ERROR TYPE Soft Error Detected (recovered) ----- CAM STRING ----- DEVICE NAME DEC RZ29B My question is: Why didn't CAM perform the BBR for the RZ26 in first instance, while it did for the RZ29B? The major difference between the RZ26 case and the RZ29B case is that the RZ26 had ASC/Q=1100="Unrecovered read error", while the RZ29B was showing a recovered soft error. Is this the reason that the RZ26 didn't get the BBR? For now, I just used /sbin/scu to reassign the block in question, so that now, reading the file doesn't give an I/O error, and advised the user to get the file restored from backups. Thanks! Stu
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
9189.1 | Silent data corruption | SSDEVO::ROLLOW | Dr. File System's Home for Wayward Inodes. | Fri Mar 14 1997 11:55 | 15 |
Fred will undoubtly give a better answer, but I think the reason was because the RZ26 error wasn't recoverable. Had it elected to replace the bad block behind your back you would have no clue (except the previous history) that the block was bad. In the RZ29B case, it was able to read a good copy of the data, reassign the block and write the good data back to the new block. By not replacing un- recoverable errors you're forced to reassign it by hand and realize in the process that the data is corrupt and must be restored. MSCP disks had the feature of remembering that the data in a replaced block was bad; Force Error. As a result it could replace the block, recoverable or not. SCSI disks don't have that feature. | |||||
9189.2 | Thanks! | OHFSS1::FULLER | Never confuse a memo with reality | Fri Mar 14 1997 13:58 | 8 |
re: .1 That's pretty much what I'd figured, especially in light of the MSCP forced error flag. Thanks for the get-back. Stu |