[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:	DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:	Welcome to the Digital UNIX Conference
Moderator:	SMURF::DENHAM

Created:	Thu Mar 16 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	10068
Total number of notes:	35879

9189.0. "SCSI CAM bad block replacement" by OHFSS1::FULLER (Never confuse a memo with reality) Fri Mar 14 1997 11:06

    A question regarding CAM regarding disk bad block replacement.
    
    Yesterday, my customer reported that his application was returning an
    I/O error while copying a file.  Analysis of the error log show that
    the source disk (of the file copy) was having an unrecoverable error
    (ASC/Q=1100).  The disk in question is an RZ26.  Looking over the
    history of error log entries showed that the disk had reported multiple
    occurances of this same error (ASC/Q=1100) at the same block number
    (2023725) over several days.  
    
    Looking at another system that uses RZ29B drives, I noticed that this
    system had performed a few bad block replacements.  The data from the
    error log for the BBRs was:
    
    ----- CAM STRING -----
    ROUTINE NAME                        cdisk_bbr_done
    ----- CAM STRING -----
                                        cdisk_bbr_read: Bad block ok no BBR
                                             _action bad block number: 253
    ----- CAM STRING -----
    ERROR TYPE                          Soft Error Detected (recovered)
    ----- CAM STRING -----
    DEVICE NAME                         DEC     RZ29B
    
    Or:
    
    ----- CAM STRING -----
    ROUTINE NAME                        cdisk_bbr_done
    ----- CAM STRING -----
                                        cdisk_bbr_write: BBR complete bad
                                             _block number: 175318
    ----- CAM STRING -----
    ERROR TYPE                          Soft Error Detected (recovered)
    ----- CAM STRING -----
    DEVICE NAME                         DEC     RZ29B
    
    My question is:
    
    Why didn't CAM perform the BBR for the RZ26 in first instance, while it
    did for the RZ29B?
    
    The major difference between the RZ26 case and the RZ29B case is that
    the RZ26 had ASC/Q=1100="Unrecovered read error", while the RZ29B was
    showing a recovered soft error.  Is this the reason that the RZ26
    didn't get the BBR?
    
    For now, I just used /sbin/scu to reassign the block in question, so
    that now, reading the file doesn't give an I/O error, and advised the
    user to get the file restored from backups.
    
    Thanks!
    
    	Stu

T.R	Title	User	Personal Name	Date	Lines
9189.1	Silent data corruption	SSDEVO::ROLLOW	Dr. File System's Home for Wayward Inodes.	`Fri Mar 14 1997 11:55`	15
	Fred will undoubtly give a better answer, but I think the reason was because the RZ26 error wasn't recoverable. Had it elected to replace the bad block behind your back you would have no clue (except the previous history) that the block was bad. In the RZ29B case, it was able to read a good copy of the data, reassign the block and write the good data back to the new block. By not replacing un- recoverable errors you're forced to reassign it by hand and realize in the process that the data is corrupt and must be restored. MSCP disks had the feature of remembering that the data in a replaced block was bad; Force Error. As a result it could replace the block, recoverable or not. SCSI disks don't have that feature.
9189.2	Thanks!	OHFSS1::FULLER	Never confuse a memo with reality	`Fri Mar 14 1997 13:58`	8
	re: .1 That's pretty much what I'd figured, especially in light of the MSCP forced error flag. Thanks for the get-back. Stu