T.R | Title | User | Personal Name | Date | Lines |
---|
6736.1 | from the drive side... | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Fri May 30 1997 14:49 | 11 |
|
No idea why you have 1 less cylnder reported.
The drive does not use any type of alternate cylnder to use for revectoring. It
has spare sectors pre-mapped at the factory on each track and each head that it
uses. If the drive has major media/head problems and uses up all of these
sectors (that are NOT included in the capacity of the device), any subsequent
reassign results in an Sense Key of 04, ASC of 19 (Defect List Error). Not an
Illegal Request as your log indicated.
Roger.
|
6736.5 | SCSI Timeout is an Issue for heavy IO | MSAM03::RAHMAN | | Sun Jun 01 1997 03:48 | 152 |
|
Hi Roger,
That is exactly the case.. the block is not bad as u could detect it
during the formating at the manufacturing. It is an ugly block, ie
bad because of difficulties arise when attempting to read the block.
I believe the situation that I am encountering is similar to the 2
notes
I attached below. Please analyse this situation. If there is no logical
explaination, then the customer has the right to change digittal's
Hardware.
I have checked the in /usr/include/sys/disklabel.h, about the
definition
of alternate sector and alternate cylinder and it seems that it is not
used in /etc/disktab. Please verify the Rz29-va (is it seagate
baracuda) and
is the unix driver does not comply to the SCSI command from seagate?
MCS engineers has verified the "suspected" disk is OK at local digital
office!!
I would say it is because of heavy IO, that the driver mark it as bad,
and
the alternate track is running out, because of so many "UGLY" block.
Please look into this matter more seriously. If u need info please ask
for it.
I am very interested to solve this matter once and for all. Otherwiese,
tommorow I walk into the customer and selling different vendors box.
rahman ibrahim@MSA
SSU Malaysia.
132.0">Topic #132: ``Bad RCT causes an err on BBR?
I believe the term "Good" block and "Bad" block in the RCT should
be clearly understood. The term "bad" generally implies unreadable
If the block is deemed bad at the factory (PBN entry in the FCT) or
the Formatter "detects" the block as bad, then it will format the
header with header code "11", marking it unusuable. If the block
header is still "00" (Good LBN) but difficulties arise attempting to
read the block (continued uncorectable ECC, smashed header, etc) the
block is again deemed bad. Alternate copys of the relative block
will be acessed in the RCT during BBR or revector operations.
There is, however, a condition I like to call "ugly". This is a
block that is not bad but contains "bad data" with good ECC, EDC,
etc. Alternate copies of these type blocks WILL NOT BE ACCESSED
under normal circumstances. Example: K.SDI fails and
"forgets the HOST/RCT boundary" and writes a data pattern into the
first few blocks of the RCT during periodics, for example. This
corrupts the first copy of the RCT control block. The data happens
to get written with good ECC,EDC. This could have a variety of
effects during host mount of that disk. Continuing on, problems
arise and the Field Engineer determines the K.SDI is bad and
replaces it. Good ! The disk is still corrupt but the symptoms may
not be obvious. If the corruption "clobbered word 4 in the RCT (BBR
control word) the symptoms appear during each attempt to ONLINE the
disk (VMS Mount for example). If the P1 or P2 flags happen to be
set, the system will attempt to finish a BBR that never really
started. If the replaced LBN address field gets filled with this
erroneous pattern, the HSC may attempt a BBR to a "non-existent" LBN
and crash the HSC "Every time a mount is attempted. If undefined
bits get set in the control word, the HSC will "data safety
write-protect" the disk every time it is mounted. The list is
endless, esp if the descriptor blocks become affected. The point
is this, if blocks in the RCT get written with bad data but good
ECC, then alternate copies of the blocks are NOT ACCESSED because
the block is considered "good" (better term is readable, not
necessarily good). I can produce these symptoms manually, and
they do happen in the field, fortunately infrequently (I hope). We
had two occassions of K.SDI failure in our lab (CSSE lab) that
produced these very same "subtle" but serious problems. I saved the
printout for one and use it during my seminar (DSA troubleshooting )
to teach FE's how to deal with logical failures usually resulting
from hardware failures. Rule of thumb. If you have experienced
any hardware problem that could affect the R/W data path to the disk
(controller, SDI, disk electronics, you may have experienced
corruption on the media, which stays around "after" the HW is
resolved. I call it logical recovery. Mark Himes CX/CSSE
href="5752.0">Topic #5752: ``command timeout issue ''
Looks like HSJ01$DUA62 and HSJ04$DUA702 are suffering Command
Timeouts; What rev firmware are they running? (If it's running
V007, upgrade to 0016...if it's running 0014, then it should be OK).
I've included a blitz that Roger Patenaude put out in relation to
Command Timeouts. BTW, you should REALLY upgrade HSOF to V2.7 and
SWEAT to X2.7Copyright (c) Digital Equipment Corporation 1995. All
rights reserved. +---------------------------+TM | | |
| | | | | | d | i | g | i | t | a | l | TIME
DEPENDENT CASE | | | | | | | |
+---------------------------+ TITLE: What are SCSI Command
Timeouts Errors? AUTHOR: Roger Patenaude DATE:
August 16, 1995 DTN: 237-3705 TD #:
1904 ENET: BABAGI::Patenaude CROSS REFERENCE
#'s: DEPT: Storage External Products (PRISM/TIME/CLD#'s)
Continuation Engineering INTENDED AUDIENCE: All
PRIORITY LEVEL: 2 (U.S./EUROPE/GIA)
(1=TIME CRITICAL,
2=NON-TIME CRITICAL)
=====================================================================
PROBLEM: -------- The purpose of this Blitz is to give you
some insight as to what a SCSI "Command Timeout" error is. I've
kept this very generic as more of an informational Blitz for a
change. These errors are telling you that a specific "command" did
not complete in a specified period of time. This can be caused by
multiple sources and in most all cases can be recovered by the host
system by reissuing the failed command. Some of the reasons for
"Command Timeouts" are; 1) The SCSI bus is too busy. The SCSI bus
priority is designed using the drives ID in arbitration with no
regard for how many times the device wins the bus. So, if you
have a bus with the highest priority device doing VERY heavy
workload ("hogging" the bus), then other devices on the bus
will not be able to arbitrate and win the bus. These devices
will then have commands outstanding that they cannot complete.
The host will then log an error "command timeout" and sometime
follow it with a bus reset. 2) The host issued a command to a
drive that took to long to complete. This could be due to a
broken device but more common is that the device is doing a
long commands and does not have time to answer the host. Normal
convention is the host will only ask "how things are
proceeding" (as in the case where you issued a rewind to a tape
drive and are waiting for it to become ready) via a Test Unit
Ready command but if data type (read/write) command are
continually issued to the unit this the first command can not be
completed and may time out. 3) Operating system driver issues. The
drivers may not be allowing reasonable enough time for the
commands to complete. A case in point, VMS recently increased
the command timeout values in MKDRIVER (TAPE) and DKDRIVER
(DISK) (from 3 seconds to 10 in MK). This was because 3 was
just to aggressive on a busy bus and command timeouts and bus
resets were occurring under heavy load. 4) Device issues. The
drive may not have enough horsepower to complete the commands
it accepted in a reasonable amount of time. OR, the drive may
be not be working on commands it has accepted because it is too
busy. RZ28B's running version 003 code are one such case, the
drive will optimize it's seeks by working commands that are in
the local area of the heads. One side effect is that a command
may timeout if it was not in the local area of where the drive
is spending all it's time thus not getting serviced. RZ28B's
running 006 do not have this issue. RESOLUTION/WORKAROUND:
----------------------- For the most part these are just events
and should be left alone. In the rare case where this
is disruptive due to resets occurring, review the four points
above and see how they fit into your environment. You may need
to split heavily loaded devices between multiple busses, or you
may need new firmware or maybe move a device off to another
bus. ADDITIONAL COMMENTS: -------------------- None.
**** DIGITAL INTERNAL USE ONLY ****
|
6736.6 | Man you are ALL over the place... | SUBSYS::VIDIOT::PATENAUDE | Ask your boss for ARRAY's... | Mon Jun 02 1997 12:51 | 45 |
|
> That is exactly the case.. the block is not bad as u could detect it
> during the formating at the manufacturing. It is an ugly block, ie
> bad because of difficulties arise when attempting to read the block.
Exactly WHAT case????? You got a failure in the errorlog that said;
----- CAM STRING -----
ILLEGAL REQUEST - Illegal request or
_CDB parameter
The drive also returned status that said it got an invalid request!
How are you equating that with a note about DSDF / RCT / FCT information that
was written about SDI device's (RA81, RA82, RA90, etc...) and a note about
command timeouts???????
> Please verify the Rz29-va (is it seagate
> baracuda) and
> is the unix driver does not comply to the SCSI command from seagate?
It is a Seagate drive and YOU can dig through UNIX drivers. Not I.
> MCS engineers has verified the "suspected" disk is OK at local digital
> office!!
>
So it's probably not the drive ;^)
> Please look into this matter more seriously. If u need info please ask
> for it.
NOTES IS NOT AN ESCALATION PATH!!!!! You need to look at this more seriously and
follow proper escalation to get this looked at. Have you tried any local sales
and service support folk? (Don't answer, rhetorical question)
> I am very interested to solve this matter once and for all. Otherwiese,
> tommorow I walk into the customer and selling different vendors box.
UNBELIEVABLE!!!! You have what most likely is a SOFTWARE problem and you are
about to condem our hardware. Unbelievable is all I can say. Glad I only have
250 shares of DEC stock as of today with this mindset.
roger.
|
6736.7 | Help is needed...... | MSAM03::RAHMAN | | Mon Jun 02 1997 21:09 | 8 |
| Thanks for ur response to the problem. Opp! Sorry this is not the
ESCALATION....
path. I will be more careful next time. However thanks for ur time in
looking into my problem.
I will escalate this problem to our support people.
Rahman
|
6736.8 | Roger is right: escalate it | SUBSYS::BROWN | SCSI and DSSI advice given cheerfully | Tue Jun 03 1997 08:19 | 19 |
| I don't think it's clear whether this is a software problem or a
configuration error. The SCSI sense data is 05/21/00, which means
the software attempted to read a block beyond the drive's capacity.
Now, we know the capacity after the error was smaller than the capacity
before the error. We know the blocks being read (16 blocks, starting at
0x7fd4ac) were within the drive's capacity before the error, and
outside the capacity after the error. We don't know when the capacity
changed, or who changed it.
The obvious candidates are:
- the Informix software
- the HSZ40 controller
- a bus reset, causing the drive to return to the most recently saved
capacity
It may take a fair amount of time and engineering support to find the
cause. Please escalate, so the right people can be identified and
assigned.
|
6736.9 | notes collision | WRKSYS::HOUSE | Kenny House, Workstations Engineering | Tue Jun 03 1997 08:23 | 26 |
| So far as I can tell, there are two issues in the basenote.
(1) The error log is quite explicit about the HSZ40's complaining about
an out-of-range logical block address used by a READ(10) command.
The LBA requested was 8377516(decimal), although the number of
sectors claimed in the disklabel was 8378028(decimal).
(2) Writing over the disklabel changed the geometry, so that the number
of sectors is now 8377528(decimal). Note that the flags now have
"dynamic_geometry" set, too.
The whole concept of a simple sector/head/track geometry is an
industry-wide falsehood. Zoned drives (with different number of
sectors per track) and RAID volumes, for example, do not have this
structure. It would be nice, however, if all logical blocks on this
"geometry" were addressable -- this does not seem to be the case in (1)
above.
Do SAP or Informix bypass the normal file structure to get to the raw
drive? Are they likely to be writing the disklabel?
There is no indication of a "retry exhausted" error or "SCSI timeout"
in the information presented in this note string to date. Nor is there
clear evidence of a hardware problem.
-- Kenny House
|
6736.10 | | SSDEVO::ROLLOW | Dr. File System's Home for Wayward Inodes. | Tue Jun 03 1997 10:05 | 13 |
| Many database class applications on UNIX use the raw device,
it avoid any issues of whether the file system buffers the
data (sync, fsync or not) and it avoids a buffer copy. If
you remember that disk read and writes have to be multiples
of the sector size it is also easy, using the same system calls
as reading and writing files.
Since Digital UNIX disklabels have been around for a few years
most vendors that use raw disks have either figured out where
the label is and don't use it, or require the user to partition
the disk to protect the label. If this is the same disklabel
that got posted to the DIGITAL_UNIX conference this morning,
that's what that 32 sectors is in the A partition.
|
6736.11 | Not broken H/W | SMURF::KNIGHT | Fred Knight | Wed Jun 04 1997 16:06 | 19 |
| What most likely happened, is that some user labeled
this device BEFORE it was put into the HSZ40 (note that
there is NO dynamic geometry in the first disklabel).
Then, after installing in into the HSZ40, they just started
to use it (with the WRONG disklabel). After the error, they
put a NEW disklabel (now a correct one) on the media (now
note that dynamic geometry IS set). And magically, it now
works!
The only other option is the HSZ40 firmware bug that has
been BLITZed about conditions when the firmware would change
the size of a volume (not common, but still possible).
In both cases, NOTHING is broken in the H/W. If it's case
1, then educate your customer, if case 2, use the documented
firmware workaround.
Fred Knight
|
6736.12 | Hmm, did somebody INIT SAVE_CONFIG? | SSDEVO::JACKSON | Jim Jackson | Wed Jun 04 1997 18:46 | 25 |
| Sure, we've seen this type of error a bunch when folks got careless about
reusing disks. Here's a recipe for the problem:
1) Have a direct-connected SCSI disk. Put a filesystem on to it.
2) Move the disk to an HSZ40
3) INIT the disk from the HSZ40 console
4) ADD UNIT
At this point, the host sees a disk that has a valid filesystem on it. The
only problem is that the last few blocks have been lopped off by the HSZ40
to contain its metadata.
One of the rules we have in our lab is if you INIT it on the HSZ, then you
have to put a new filesystem on it (VMS INIT, Unix ??). Our documentation
has stated for eons that you should assume that an HSZ INIT destroys the
user data on the disk.
disklabel value 8378028
new value 8377528
-----------------------
difference 500
500 blocks is exactly the number of blocks consumed by SAVE_CONFIG. So, in
your case, it would appear that you had a JBOD with a filesystem on it, the
disk got an INIT SAVE_CONFIG, and a new filesystem was not put in place.
|