T.R | Title | User | Personal Name | Date | Lines |
---|
6649.1 | | LEFTY::CWILLIAMS | CD or not CD, that's the question | Mon May 05 1997 10:16 | 14 |
| T. Tran owns the libraries...
Joe Smith is our apps engineer for integration issues.
I seem to own global architecture issues, as of the last month or so.
You can start a discussion here, in MAGTAPE, or SCSI, or contact us
directly. Depending on what you want to see, it may or may not be
possible, given the 3rd party origions of the libraries.
Yes, it's not defined. Yes, it needs to be. It won't happen instantly,
as the behavior seems to truly be undefined. Sigh.
Chris
|
6649.2 | Wobbly stake in the ground | DECWET::KOWALSKI | Official Beer Test Dummie | Mon May 05 1997 13:25 | 18 |
| Knowing there are more people interested, my feeling is that a more
public discussion would be appropriate to get their input.
From a requirements standpoint, I would like to see clear definition of
what a robot does on bus or device reset. Desirable behavior would be
similar to that for a sequential device in terms of completion of
operations in progress and return to a defined state, although I do not
understand the full implications of that w.r.t. robotic instruction
sets.
For example, what are the complications of stating that on reset, the
robot should complete the operation in progress, then return to the
home position? If that means that the robot ends up in home position
with a tape in its jaws, is this a physical, logical, or operational
problem for it?
Mark
|
6649.3 | | NABETH::alan | Dr. File System's Home for Wayward Inodes. | Mon May 05 1997 15:19 | 32 |
| For "operations in progress" do you mean in the sense of the
SCSI-2 Medium Changer spec. or in whatever components the
robot decomposes such commands? From the standpoint of the
SCSI-2 spec. something won't be left in the transport at the
end of an operation unless the destination of the Move Medium
was the transport. Internally, a robot might break up a slot
to slot move as:
Position to X of Slot
Position to Y of slot,
Pick medium
Position to X of destination
Position to Y of destination
Place medium
If you allow the internal subset of commands to be interrupted
by a device reset, then you should require that the robot be
left in a state from which software implementing the SCSI-2
spec. can recover. If the particular robot allows element to
transport moves (not all do), then leaving a tape in the transport
can be corrected. If the robot doesn't allow such moves and the
reset leaves the medium in the transport, manual intervention
is required and not desirable.
My opinion is that the operation should complete or abort as
though nothing had happened; complete the move or put it back
where it came from. The latter is hard to do on a TL820 Move
Medium from the inport. This might be even harder if you
happen to abort a Initialize Element Status. This causes a
physical inventory of most medium changers and would requires
saving and restoring the starting inventory to roll it back
to the starting point.
|
6649.4 | | LEFTY::CWILLIAMS | CD or not CD, that's the question | Tue May 06 1997 08:43 | 26 |
| Another issue: If the robot gets a reset during a move from a TZ8x to a
slot, and puts the media back where it came from, the media reloads,
and requires a mount/dismount cycle to get the drive to eject it again.
Most SW is not very good at getting media left in a picker back to a
spot where it can be used again - another issue.
Also, not all Libraries/Jukeboxes have bar code readers, to figure out
exactly what piece of media is where, after a "strange event", such as
reset, happens.
The RW5xx Optical Libraries always undo an operation interrupted by a
Reset. The Read Element Status after the reset is cleared will reflect
this. Because of the first issue above, this behavior is probably not
appropriate for the TL8xx libraries, even if it could be implemented.
Deterministic operation is required for error recovery. Note that this
does not mean that all Libraries have to do it the same way - it just
has to be documented and deterministic. In an ideal world, they would
all work the same, but we are already at a point where they do not.
The control SW is going to have to deal with that, unfortunately.
Good discussion so far... Thanks.
CHris
|
6649.5 | investigate problem, propose solution, take action | TAPE::SENEKER | Head banging causes brain mush | Tue May 06 1997 09:03 | 42 |
| What does "an interpretation" of the SCSI standard imply?
I would suggest that if this problem truely wants to be fixed that a
serious effort be made to characterize the impact of resets on all
operations, determine which operations are impacted, and produce a
report of suggested corrections. This report could then be reviewed
by the manufacture(s), and software teams using the products and they
could respond.
In the optical space, I have found that the integration of the drive
and robots into a library make the understanding of the intent of the
SCSI spec more difficult. To me it shows that the spec attempts to
define how the devices should act but being created by humans it doesn't
always describe exactly what should take place.
Example, I had a problem were a optical library would complete a
command after a reset, return sucess to the host, but then "undo"
the operation. For this case the reset happened after the robot
had placed the media into the transport. This destroyed the software
to hardware mapping because the host was informed that the operation
completed and media was now in slot xxx but really it was back in the
original location. In this case we interprated the SCSI spec as having
two options, 1) complete the command, return success, then execute the
reset or 2) complete the command, return failure (bus was reset status),
execute the reset which causes the operation in progress to be undone.
Either of these would allow the host software to work correctly, after
providing traces to HP they found a hole in the firmware and decided
that option two was the best interpratation of the spec.
I was frustrated that this problem was not found by the people in
storage that verify SCSI operations of our storage devices. To me
the characterization of actions after a SCSI bus reset should be
critical to that process.
I hope if effort is made to clean up this issue for tape libraries
that the work is done in a broader general purpose robotic mechanism
context. If it cannot be done in that context, I hope the reasons
various actions are requested/required are well documented and recorded
so others in the future don't have to revisit the same problems.
Rob (from the optical jukebox world)
|
6649.6 | | DECWET::KOWALSKI | Official Beer Test Dummie | Wed May 07 1997 10:24 | 14 |
| Good discussion. I need to go back to review the SCSI-2 spec wrt
media changers. I'll be back in a couple weeks after some travel
and vacation.
>I would suggest that if this problem truely wants to be fixed that a
>serious effort be made to characterize the impact of resets on all
>operations, determine which operations are impacted, and produce a
>report of suggested corrections. This report could then be reviewed
>by the manufacture(s), and software teams using the products and they
>could respond.
Good idea. Anyone in SSAG want to form a working group?
Mark
|
6649.7 | | TAPE::SENEKER | Head banging causes brain mush | Thu May 08 1997 08:18 | 3 |
| I'll volunteer to represent software aspects for optical libraries.
Rob
|
6649.8 | Valid Element Address is a legal holding place.. | SUBSYS::TRAN | Straight <Left> Hitter.. | Thu May 08 1997 08:36 | 36 |
|
Like Chris stated, I'm responsible for tape library.
In my view, if a transport is a legal element address then robot should
be allow to store cartridge there in case of reset abort, same as move
medium to transport. The recovery is to allow next command to succeed
without hardware failure and moving from transport to any legit element
is a legal operation.
Transport is not a legal element address in TL800 series or TZ8xx loader
but it is in TL810 and TL820 series. Below is the list of DLT tape
library series my group own.
DLT Tape Library Family Table
=============================
-----------------------------------------------------------------------------
Family Product Vendor DLT Drive INQUIRY Product Geometry
Name Equivalent PID
-----------------------------------------------------------------------------
TL800 Series TL891 LXB7110 TZ89 TL800 1 Drive/10 Slots
(1) TL891 LXB7210 TZ89 TL800 2 Drives/10 Slots
TL810 Series TL810 ATL 4/52 TZ87 TL810 4 Drives/52 Slots
TL812 ATL 4/52 TZ88 TL810 4 Drives/52 Slots
TL894 ATL 4/52 TZ89 TL810 4 Drives/52 Slots
TL820 Series TL820 ATL 2640 TZ87 TL820 3 Drives/264 Slots
TL822 ATL 2640 TZ88 TL820 3 Drives/264 Slots
TL826 ATL 6/176 TZ88 TL820 6 Drives/176 Slots
TL893 ATL 2640 TZ89 TL820 3 Drives/264 Slots
TL896 ATL 6/176 TZ89 TL820 6 Drives/176 Slots
-----------------------------------------------------------------------------
NOTE: (1) TL891 + Upgrade Drive
=============================================================================
|
6649.9 | Should go to a slot! | LEFTY::CWILLIAMS | CD or not CD, that's the question | Thu May 08 1997 09:06 | 19 |
| As often is the case, I disagree with T on this one.
If a move is executed with a source and destination, the cartridge
should end up in either the source or destination if a reset occurs.
Any other behavior could cause loss of context, and require manual
intervention to find and put the cartridge back where it is supposed
to be (equivalent to Reset = failure!)
In the case of a DLT or other tape library, I would want the cartridge
to always end up in the storage slot, not the drive, due to the
previously mentioned load/unload issues. I'd be more open to options on
the optical disk side, though most of those have no media ID bar code
reader, so it is really easy to "lose" a piece of media if wierd things
happen.
If the Library breaks, all bets are off, of course.
Chris
|
6649.10 | no special jukebox recovery needed by a properly-written application | DECWET::TRESSEL | Pat Tressel | Fri May 09 1997 04:34 | 99 |
| I'm going to play devil's advocate here...
(This is a modified version of a note I just sent to the TCR folks for
comments.)
I'm in the process of making the Unix changer driver behave nicely on a
shared bus. The Unix NetWorker folks and I talked about what sort of
recovery the jukebox and driver should do on a reset (or other disruptive
event), and concluded that...
...for the most part, the software needs *no* special recovery from either
the jukebox or the driver. (There are some small amounts of polite behavior
required, but these tend to fall in the category of not being "broken".)
Ok, *why* not?
-- A reset can mean that someone's been meddling with the media, e.g. they
opened the door and added tapes, or moved some up to "fill in the holes".
So, after a reset, it's unsafe for the application to assume that any
media are where they were before the reset. In particular, if the
changer was about to grab something out of a slot and put it into a
drive, the thing in that slot could be different now. It would be a
Very Bad Thing if someone's tape were to be overwritten because it was
assumed that the media hadn't shifted position.
-- Similarly, real recovery involves checking that the changer is
operating on the same media as it was before the reset. But...the
changer has no way to identify media -- only the application does.
(Bar codes might help here, but because they're not universal, the
application can't rely on their presence, so it'll have to be able
to deal with this problem by itself, anyway.)
-- There are other conditions besides resets that are associated with
disarranged media (e.g. someone pushing the eject button on a drive not
in a jukebox). These should also already be handled by the application.
-- Since there are already ways for media to move around behind the
application's back, it should already be checking that a newly mounted
medium is the right one. For medium that's already in a drive, upon
receiving an error that could indicate meddling with the drive, the
application should verify that the same medium is still present (and
should reposition if needed).
This led us to conclude that *no matter what* sort of recovery the jukebox
or changer driver attempted, it would *not remove the requirement* that the
application verify the media identity. We also couldn't see that having the
changer do any moving around after a reset would help the application to
recover, and it might make recovery more difficult.
Since the application should already be verifying media identity after *any*
load into a drive (NetWorker does do this), whether there was an error or
not, then the application should already be able to prevent data loss due
to overwriting the wrong medium, or confusion due to trying to read the
wrong medium.
Regarding the "lost tape" syndrome: If the media are disarranged in the
slots, the application's bottom line recovery is to inventory all the slots.
This can be done in a "lazy" manner -- as long as the application is finding
the media it wants, it doesn't have to do an inventory. It's only when the
medium it wants is *not* found in the expected slot that it will have to go
hunting.
In order to reduce the probability of disarrangement (other than by means of
some person opening the jukebox door and moving things around), we *don't*
want the jukebox to go squirreling things away by itself. That is, if a
reset or other failure (e.g. power) occurs while medium is in the picker,
the medium should *stay* in the picker. This allows the application to load
the medium in a drive, find out what it is, and store it in the correct slot,
or complete whatever operation it wanted done with that medium. This is much
less labor-intensive for the application than recovering from a "lost tape"
by doing an inventory. (Recall that the changer can't tell which medium it's
got hold of. Someone could even have pulled the medium out of the picker and
substituted another one, so it can't assume it has the same one as it did
before the disruptive event.)
You may be saying to yourself just now, "But how does the application find
out there's medium in the picker? And *which* picker, if it's a multi-tower,
and was in the middle of passing the medium along the row?" Well, this is
the prime case of "polite behavior" that I mentioned above. The application
does need a way to discover these things, and, because this is a form of
failure that occurs *only* in changers (i.e. doesn't have an analogue in a
bare drive), then applications may not already deal with this case. But, as
long as the medium remains in the picker, and the application does get an
error (e.g. on the next operation involving that picker or the medium in it),
then it can assume the move failed, and check all pickers. The delivery of
an error is likely to be dealt with in the driver -- it may happen "naturally"
as later operations fail, or as the driver becomes aware of the bus reset.
This is the other half of the polite behavior: There must be a way to get
the picker to put the medium into a drive. (I'd consider the jukebox to be
"broken" if there were no way to do this.) This seems to be working now in
the tape jukeboxes we have -- NetWorker is already able (somehow) to get the
picker to relinquish its medium. But it would be good to have a reasonably
small set of actions that (among them) would allow getting the medium out of
the picker and into a drive. I don't know how to do this -- suggestions
would be appreciated.
-- Pat Tressel
|
6649.11 | | LEFTY::CWILLIAMS | CD or not CD, that's the question | Fri May 09 1997 09:02 | 28 |
| Given that all elements in the JB, including the input element, picker,
drives, slots, transfer elements, etc, are supposed to have a unique
element address, and an element type reflecting what they are, it is
indeed possible in the general case for the application to move media
from any element to any other element. A move from a slot to a drive
usually implies moving thru a transport element, lick a picker.
The problem comes in the fact that some JB's cannot determine whether
there is media in some of their elements, due to lack of sensors, bad
design, etc. This makes it difficult to deal with media left in a
picker after a state change. Not impossible, just harder.
If all JB vendors were religious about implementing all the allowed
fields for media tracking, media detection, etc, it would be easier.
They are not today.
If the applications are written to properly do all the required error
recovery, then .10 has validity. Most of the apps I've seen are not
very good at finding media in pickers. They do fairly well finding
media lost in a slot or drive, but picker issues give them grief, as
they do not seem to understand the concept of a picker as a seperate
element. Thus my recommendation to put the media back in the slot it
came from, which almost all JB's have the intelligence to do.
CHris
|
6649.12 | need a common design goal | TAPE::SENEKER | Head banging causes brain mush | Fri May 09 1997 09:45 | 25 |
| RE: .8,.10
I agree, conceptually.
As Chris stated in .11, not all jukeboxes/libraries are created
equal. If the various software drivers and applications, that are
used to control these beast, do not take that into account then
control problems are going to happen.
What I am suggesting, is that an effort be made to define what
the perfect jukebox/library world would look like. Then design
software systems for use with "real-world" systems and documenting
the limitations and differences from the "perfect-world".
Maybe with enough information and a team Digital instead of team
Networker, team MRU, team SEP tapes, team SEP optical, team OSMS, etc.
We could work together and get the manufactures to build some real
good jukebox/library systems. If a small company like Perceptics
can get jukebox manufactures to make changes then Digital should
have enough pull to get things changed.
Maybe I am a dreamer, but if you don't strive for your dreams
you sure as hell won't ever get to them.
Rob
|
6649.13 | | DECWET::RWALKER | Roger Walker - Media Changers | Fri May 09 1997 10:55 | 29 |
| Assuming that the current move request will not ever report
completion after a bus reset, the cleanest option is for
the hardware to put the media back where it started. This
will eliminate any recovery action by an application that
tracks the locations since they will not have changed their
status yet.
This isn't always possible so the second perfered action would
be to complete the move. This will cause a mismatch in state
since the appication will not receive the completion status.
If the application logs the planned move to disk then it can
quickly verify if the element state matches.
The third is just to stop with the meida in a vaild location
including the transport if the device allows moves to and
from the transport. If the device leaves the meida somewhere
else then it is broken.
The worst case here is a move is requested, the node making the
move goes down. The bus is reset and the application restarts
on the other node. If it only had memory traking of the move
request it will not know where the media was from or going to.
If the jukebox was powered off it will not know where it came
from either.
Verifing that the proper media is loaded in the dirve before use
is a key safety factor here but it does not lead to easy recovery
for the application without user intervention. It would be better
to avoid this if possible.
|
6649.14 | Need consistency in closed loop. | SUBSYS::TRAN | Straight <Left> Hitter.. | Fri May 09 1997 12:00 | 19 |
|
As Rob stated, consistancy is the key word here..
As of the Current DLT Tape library implementation goes -
TZ8xx Loader and TL800 Series will complete the move incase of reset
once it started since the transport (picker) is not a valid element.
TL810/TL820 Series may end up with cartridge in picker depends on how
far along the move is, it may also complete the move as well for the
same reason.
Asking the hardware to decide what to do base on condition during reset
is complicated. I'm talking about if the destination is the drive then
do one or two things, if it's a slot the do another, if it's a port
then yet another. If we can sort these out and come up with some way
that covers all conditions then I'm with it.
T.
|
6649.15 | accepting our limitation as well | TAPE::SENEKER | Head banging causes brain mush | Fri May 09 1997 16:26 | 42 |
| RE: .13
Roger, good points. I would also like to point out that while we
are critiquing the programmed actions of various jukebox/library
robotic mechanisms that we need to distinguish between limitations
of the hardware and limitations of our own software.
Examples:
1) OSMS could do moves to and from the robot but it doesn't. It
always does moves to and from data transfer elements, data storage
elements, or import/export elements. Due to a "implementation"
limitation there is no way to ask OSMS code to recover from media
being stuck in a robot/picker without human assistance. This
limitation could be removed but then we are back to the "real-world"
again with code changes, project schedules, impact, etc.
2) OSMS could do volume (data set) validation upon media insertion
into a data transfer element, but again it doesn't. The initial
product was designed when the customer demand was driven by
requirements where minimal swap times were more important than data
integrity. Again this feature could be added.
Rarely does the customer base see the impact of these implementation
decisions since the hardware used with OSMS is very reliable. But
these are examples of cases that show a need to document how we
accept the "real-world" but dreaming on how we would like to see the
"perfect-world".
Time to market considerations and most engineers desire to just get
something working make it difficult to see that these implementation
decisions get documented. But if it is to be done, the engineers are
really the only ones that will get it done. It would help alot if
management made this level of documentation a project requirement. I
try from time to time but I am guilty of having more of this
information in my head than down on paper or as part of some project
documentation.
As I ramble on I hear myself saying, "how much quality is good
enough?".
Rob
|
6649.16 | NetWorker for Unix requirements | DECWET::TRESSEL | Pat Tressel | Fri May 09 1997 21:01 | 13 |
| NetWorker for Unix can live with any of the following recovery behaviors:
-- Leave the tape in the picker.
-- Complete the move.
-- Undo the move.
The one thing we *don't* want is to have the medium put in some location
that was not part of the original move, i.e. it should not be parked in
some slot that was neither the source nor the destination.
-- Pat
|
6649.17 | a plea for OSMS mount verification | DECWET::TRESSEL | Pat Tressel | Fri May 09 1997 21:16 | 30 |
| Rob --
> 2) OSMS could do volume (data set) validation upon media insertion
> into a data transfer element, but again it doesn't.
Ouch.
> Again this feature could be added.
Please! People *do* open up their jukeboxes and rearrange media...
(Not necessarily the customer -- it can happen while the jukebox is being
serviced.)
What OSMS might do is to make sure the label name field in the disk label
is filled in when a new platter side is initialized, then read that name
when the platter is loaded.
> Rarely does the customer base see the impact of these implementation
> decisions since the hardware used with OSMS is very reliable.
What's saving OSMS is that the thing that's on the media is a filesystem,
not raw data. So after the thing is mounted, it'll usually be quite
noticeable that the expected files and directories are not present. But
if someone were *writing*, and the pathname didn't already exist, then
they wouldn't get an error, but their data would not be where they want
it. And if they were writing to a pathname that happened to exist on
both the real and incorrectly loaded filesystems, they'd overwrite what
something they didn't mean to.
-- Pat
|
6649.18 | see quoted string after my name | TAPE::SENEKER | Head banging causes brain mush | Mon May 12 1997 11:20 | 26 |
| Pat,
You asessments are all correct and your pleas have been requested
by others in the past. OSMS is 9 years old, (based on the age of
it parent product LaserStar) and problem history, such as IPMT, has
shown that good quality hardware, customer education, and software
error detection equates to a well performing product.
In OSMS's case, the consideration to improve data integrity involves
these product aspects, product swap performance, product design and code
changes, customer problem reports. OSMS is a old, stable, and minimally
funded. The slight degradation in swap performance due to media
validation, rare to non-existant problem reports, and the very real
cost to modify the product have prevented any changes from occurring.
Designs are on the shelf to implement these changes but business
justification does not exist to fund the work.
Also, in OpenVMS and OSMS terminology, validation of media after it has
been placed in a drive is different than mount verification. OSMS
supports the needs of OpenVMS mount verification.
Sigh. If only the technical aspects of product functionality
determined its next project definitions.
Rob
|
6649.19 | | DECWET::TRESSEL | Pat Tressel | Thu May 15 1997 02:40 | 23 |
| Rob --
> Also, in OpenVMS and OSMS terminology, validation of media after it has
> been placed in a drive is different than mount verification.
Right, but checking that the same medium is still in a drive after a bus
reset is the equivalent of mount verification, just as though the drive
had dropped offline. This is not a changer issue, but is relevant to
the the impact of bus resets on the jukebox as a whole, including drives.
Resets and offlines both can indicate that media have been meddled with.
(Or sometimes not... I remember, back in the old days, having a TU78 that
would occasionally lose vacuum. After I got the thing going again, VMS
would wind it aaaaall the way back to the beginning to read the label.)
> rare to non-existant problem reports
Right -- OSMS can get away with it because the user can recognize the
media. NetWorker can't (get away with it, I mean), because the media
have no cues to their identity other than their labels. We think we've
got a Plan that will protect NetWorker media across resets and failovers,
without too hideously much overhead... ;-)
-- Pat
|
6649.20 | media verify .vs. VMS mount verify | TAPE::SENEKER | OSDS/OSMS, 1992-1997, R.I.P. | Thu May 15 1997 09:27 | 47 |
| Pat,
RE: .19
I agree. This reply is not meant to be a argumentive but an explaination
of a technical subtly.
> Right, but checking that the same medium is still in a drive after a bus
> reset is the equivalent of mount verification, just as though the drive
> had dropped offline. This is not a changer issue, but is relevant to
> the the impact of bus resets on the jukebox as a whole, including drives.
> Resets and offlines both can indicate that media have been meddled with.
OSDS/OSMS uses the OpenVMS "medium offline" status return mechanism in
response to SCSI bus resets to allow OpenVMS to initiate mount verification
for a file structured mounted optical disk drive, either standalone or as
part of a jukebox. I agree, this case it is not a changer issue. Other
"offline" conditions also start the same mechanism.
For technical discussions I believe it is important to distinguish
between the above "OpenVMS mount verification" and a more generic "media
verification". "Media verification" is the process of ensuring that a
particular media still matches the associated software data structures
that are used to identify the media to a software control system. By this
definition "OpenVMS mount verification" is a "media verification" process
but this process is limited to the needs of OpenVMS for file structured
mounted volumes only.
OSMS uses a "trusted" partner/human assistance process to ensure the
association between physical media and software data structures are
maintained. If the "trust" is broken, then the media to software association
is called into question and OSMS disallows access to the media until human
assistance verifies the association is valid. The biggest problem with
this process is the detection of broken "trust".
OSMS's "trust" process works well but could be improved. Areas of weakness
are:
o media transportation induced disassociations
o pro-active detection of media disassociations instead of re-active
detection, example detecting a person exchanging slot 10 and slot 20
media before the system/application made the next I/O request to either
of those systems.
Please see the next note for the next step beyond "media verification".
Rob
|
6649.21 | seperate processes, verify and correction | TAPE::SENEKER | OSDS/OSMS, 1992-1997, R.I.P. | Thu May 15 1997 09:59 | 21 |
|
The next step beyond "media verification" is "media disassociation
correction". This is the corrective action process started once a
disassociation between the software and the media is detected.
In the OpenVMS mount verification world this process involves the
OPCOM messages and the periodic retry of the mount verification induced
I/O operations. This process relies on external intervention since it
cannot do anything more than put out the OPCOM messages, not being able
to take any corrective action itself.
In a jukebox environment, many software controlled corrective action
opportunities exist.
I hope this summarizes my ideas that mount verification is not the same
as media verification and media disassociation correction and that some
discussion require the distintion before real communication takes place.
Thank you for your patience, I will step down off my high-horse now.
Rob
|