[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | DIGITAL UNIX (FORMERLY KNOWN AS DEC OSF/1) |
Notice: | Welcome to the Digital UNIX Conference |
Moderator: | SMURF::DENHAM |
|
Created: | Thu Mar 16 1995 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 10068 |
Total number of notes: | 35879 |
8778.0. "Getting ctape_strat messages inermittently" by RHETT::HEBERT () Mon Feb 10 1997 15:56
This is information from my customer about the tape errors and
conditions he is getting. Thanks in advance for any help!
Armand Hebert
CSC - Atlanta
___________________________________________________________________
> 1. what is the firmware of the tape drive (from uerf)
from scu mt0 is 9B3C and mt1 is 930A;
we don't know if one or both were contributing to the problem.
note that both are tz877's and using nsr jukebox capability.
> 2. how often are these happening now?
the dramatic loops have occured twice, once last July on a 2100 3.2d-1
and Sunday on an 8400 v3.2g.
two ctape_strat messages were showing with every reboot of the 2100
last summer, apparently with NSR startup. that stopped sometime before
november (2100A upgrade and v3.2g occured in october).
i don't have console manager logs before november, i did check both
the 2100A and 8400 logs and Sunday's event was the only time since
november we've seen those messages on the console.
summary: not frequent, not recreatable at will, but ugly when it
happens.
> Have you explored a hardware problem at all
the following all sort of imply it isn't hardware, at least to me:
no errors were logged for either device.
the problem was temporarily recreatable (stopping and restarting
the nsr clone process stopped and recreated it).
the problem stopped "by itself"
(perhaps with the nwadmin mount/dismount?
perhaps by the scu or mt commands i issued?).
the problem occured last July, different system earlier OS release
(tz87 fw of 971B or 9B3C).
the cloning job when the errors were being generated seemed to be
working
fine (nsr reported both reading and writing... both tapes active).
however, that doesn't at all rule out the device being in some
condition
causing these messages that firmware could rectify.
the appearance is that cam_tape is reporting this condition when
it apparently isn't a problem, it could be "fixed" by simply
commenting out the printf (but that could void some other situations
where it really is a problem that needs to be reported).
i don't have the time to start reading cam_tape from scratch to
understand under what conditions thes message is displayed.
somebody who is familiar with cam_tape may be able to tell us what
the conditions are to attempt to simulate it?
i had posted the problem to alpha_osf_managers last July before
opening the first SRQ, others had seen it but had no answers.
any existing QAR's or tz87 firmware release notes which imply a
resolution?
in the event it isn't a known problem and we can't concoct how to
simulate it, any suggestions on what we might do for additional
problem isolation (information gathering) when it occurs again
(which may be another 6 months)?
if the messages hadn't stopped after the scu|mt|re-mount i had
planned on doing an scu device reset and then power reset of the
tz877's to see if that cleared the condition, but didn't have to.
in all likelihood something i did cleared the condition, i'm not
sure which, but...
The /nsr/logs/daemonlog shows:
Sat 16:30:26 /dev/nrmt0h label without mount operation in progress
nsrmmdbd: media db is saving its data, this may take a while
nsrmmdbd: media db is open for business
nsrd: Sun Feb 2 07:25:31 1997
nsrmmdbd: media db is cross checking the save sets
nsrmmdbd: media db is open for business
Sun 07:32:23 /dev/nrmt0h mount operation in progress
nsrd: Sun Feb 2 10:30:36 1997
nsrmmdbd: media db is cross checking the save sets
nsrmmdbd: media db is open for business
Sun 10:35:41 /dev/nrmt0h mount operation in progress
Sun 10:43:20 /dev/nrmt0h mount operation in progress
Sun 10:57:59 /dev/nrmt1h mount operation in progress
Sun 10:59:54 /dev/nrmt0h mount operation in progress
Sun 11:04:21 /dev/nrmt0h mount operation in progress
Sun 11:09:33 /dev/nrmt1h mount operation in progress
Sun 11:13:57 /dev/nrmt1h mount operation in progress
Sun 11:18:12 /dev/nrmt1h mount operation in progress
Sun 11:20:14 /dev/nrmt0h mount operation in progress
Sun 16:00:45 /dev/nrmt0h unmount operation in progress
and /nsr/logs/messages shows:
Feb 2 07:25:31 glacier syslog: NetWorker Server: (notice) started
Feb 2 07:26:10 glacier syslog: NetWorker Server: (info) Portions
Copyright )
Digital Equipment Corporation 1995. All rights reserve
Feb 2 07:26:21 glacier syslog: NetWorker index: (notice) nsrck is
cross-checking index for sxclm.sois.alaska.edu
Feb 2 07:26:22 glacier syslog: NetWorker index: (notice) nsrck is
compressing
index for sxclm.sois.alaska.edu
Feb 2 07:26:25 glacier syslog: NetWorker index: (notice) nsrck is
cross-checking index for nugget.alaska.edu
Feb 2 07:26:25 glacier syslog: NetWorker index: (notice) nsrck is
compressing
index for nugget.alaska.edu
Feb 2 07:26:28 glacier syslog: NetWorker index: (notice) nsrck is
cross-checking index for glacier.alaska.edu
Feb 2 07:26:35 glacier syslog: NetWorker index: (notice) nsrim has
finished
cross checking the media db
Feb 2 07:27:01 glacier syslog: NetWorker index: (notice) nsrck has
completed
cross-check
Feb 2 07:30:59 glacier syslog: NetWorker media: (info) tz87 tape
Backups.031073 will be needed for a recover
Feb 2 07:31:11 glacier syslog: NetWorker media: (info) suggest
mounting
Offback.030106 for backup to pool 'Offback'
Feb 2 07:31:11 glacier syslog: NetWorker media: (waiting) backup to
pool
'Offback' waiting for 1 writable backup tape
Feb 2 07:31:13 glacier syslog: NetWorker Media: (info) loading volume
Offback.030106 into /dev/nrmt0h
Feb 2 10:30:37 glacier syslog: NetWorker Server: (notice) started
Feb 2 10:31:15 glacier syslog: NetWorker Server: (info) Portions
Copyright )
Digital Equipment Corporation 1995. All rights reserve
Feb 2 10:31:25 glacier syslog: NetWorker index: (notice) nsrck is
cross-checking index for sxclm.sois.alaska.edu
Feb 2 10:31:26 glacier syslog: NetWorker index: (notice) nsrck is
compressing
index for sxclm.sois.alaska.edu
Feb 2 10:31:27 glacier syslog: NetWorker media: (info) tz87 tape
Offback.030106 was being written before crash
Feb 2 10:31:30 glacier syslog: NetWorker index: (notice) nsrck is
cross-checking index for nugget.alaska.edu
Feb 2 10:31:31 glacier syslog: NetWorker index: (notice) nsrck is
compressing
index for nugget.alaska.edu
Feb 2 10:31:32 glacier syslog: NetWorker index: (notice) nsrim has
finished
cross checking the media db
Feb 2 10:31:33 glacier syslog: NetWorker index: (notice) nsrck is
cross-checking index for glacier.alaska.edu
Feb 2 10:31:46 glacier syslog: NetWorker media: (info) read 395
records in
Backups.031073 into /dev/nrmt1h
Feb 2 11:19:04 glacier syslog: NetWorker media: (info) suggest
mounting
Offback.030109 for backup to pool 'Offback'
Feb 2 11:19:04 glacier syslog: NetWorker media: (waiting) backup to
pool
'Offback' waiting for 1 writable backup tape
Feb 2 11:19:05 glacier syslog: NetWorker Media: (info) loading volume
Offback.030109 into /dev/nrmt0h
The system was rebooted ~7:25. NSR cloning was started after that.
The system panic'd and restarted around 10:30 and that's when the
ctape_strat messages started. According to the nsr daemonlog,
only mt0 was re-mounted by nsr prior 10:57. My speculation on
Sunday was that only the dismount|mount of mt1 generated the
messages (see the first work order on the SRQ).
Here's my speculation based on all the above information:
The problem is triggered by a tape remaining mounted across a panic
(or possibly any reboot).
It was mt1 generating the messages, that was the read half of the
clone.
The manual remount of mt1 @10:57 cleared the condition causing the
messages.
It's possible mt0 had the condition as well, but NSR's remount of the
device with start-up cleared that.
Can you follow my reasoning on this... if not we may wish to switch
to phone (verbal)... (907)474-6266. If so, here's what I recommend:
You have somebody confirm that a mount would likely clear the
condition
(it's a good bet).
I add dismount + remount to our local nsr startup procedures
(this should prevent us from getting in the infinite looping
condition ever again).
You QAR the problem for either general change to NSR start procedures
to always re-mount on startup, and|or cam_tape fixing this
situation. Note, we can likely prevent the problem with NSR...
but anything else could encounter it.
I think I may have just isolated the problem, what do you think?
Might even be recreatable if somebody can invest the effort.
kurt
_____________________________________________________________________
Kurt Carlson, University of Alaska SOIS/TS, (907)474-6266
[email protected] 910 Yukon Drive #105.63, Fairbanks, AK 99775-6200
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
T.R | Title | User | Personal Name | Date | Lines |
---|
8778.1 | Can we get context here? | DECWET::FARLEE | Insufficient Virtual um...er.... | Tue Feb 11 1997 19:05 | 16 |
| Greetings.
I'm in the NetWorker engineering group, and I'm trying to make sense of this...
It appears as if this is one entry in an ongoing problem report, but we
don't have the full context:
There are references to "the ctape_strat messages", but not the full message
text.
There are references to "the big ugly loop", but not a description (that I can
see) of what is looping, and what the symptoms are.
If you can provide a bit more context, perhaps I can help from an application
front.
Kevin Farlee
|
8778.2 | | DECWET::RWALKER | Roger Walker - Media Changers | Wed Feb 12 1997 16:10 | 10 |
| I would expect that these are the "Read density invalid" errors
reported by the tape driver in function ctape_strat. Normally
at system start up.
This was QARed prior to the release of DIGITAL UNIX 4.0 but by
the time the tape driver developer could get to the problem we
could not replicate it and neither could they.
If UEG can be provided a set of steps to reproduce the problem
reliably I'm sure they would be happy to fix it.
|
8778.3 | ctape_strategy messages | FRAIS::KHAN | | Thu Feb 20 1997 11:36 | 12 |
| Yes, we also have a customer getting these messages during startup.
The sequence is like:
...vmunix: Starting CPU...
...vmunix: SuerLAT. C......
...vmunix: fta0: Link...
...vmunix: ctape_strategy: READ case and density info not valid
...vmunix: ctape_strategy: READ case and density info not valid
...
The customer thinks that the tapes are slower than before.
/Azfar
|
8778.4 | after tape added ? | FRAIS::KHAN | | Thu Feb 20 1997 11:54 | 8 |
| I talked to the customer and got some more information.
To my question as to when the messages started, he thinks it is since
the second Tape drive ( TZL07 ) was added. The first tape drive is a
TK87 ( same bus ).
He did not build a new kernel, but just MAKEDEV'd it. ( Maybe after a
kernel rebuild the message disappears ). But at present we get the
message on reboot. Who has any troubleshooting ideas ?
/Azfar
|
8778.5 | | NETRIX::"[email protected]" | Jan Reimers | Thu Feb 20 1997 22:00 | 41 |
| The ctape_strategy messages are coming from the CAM tape driver.
The message is a warning that something has not been setup correctly
for this particular tape drive inside the tape driver.
This could be an indication that the tape drive is not responding
correctly to a MODE SENSE command, which is used to determine, among
other things, the density at which the tape has been written.
It is unlikely that this is the cause for the message, however.
More likely is that there is some other application which is "stealing"
the UNIT ATTN signal from the tape driver. I think what is happening
here is that some other application is accessing the tape drive through
the user agent (/dev/cam) before the tape driver has opened the device
for the first time. The tape driver will normally do a MODE SENSE if
the tape is either at BOM or has the UNIT ATTN set. According to the
SCSI-2 standard the tape drive will hold the UNIT ATTN for that initiator
until the first command, at which time, the UNIT ATTN is released. If
and application accesses the tape drive through the user agent before
the first open with the tape driver, then the UNIT ATTN is no longer
available to the tape driver and it will not know to do the MODE SENSE
and setup the density.
Are these messages seen ONLY at startup time?
Or are they seen at other times when the tape is being read?
If only at startup time, can they be reproduced if the tape is ejected,
reinserted, and whatever application is accessing the tape drive through
/dev/cam is started?
If this is the correct scenario, a possible workaround would be to open
the tape drive through the tape driver immediately after a tape has been
inserted. If that makes the messages disappear, then we need to figure
out what is "stealing" the UNIT ATTNs.
FWIW, this is only a warning. The tape will still be read correctly
because the tape driver will tell the tape drive to use it's default
density which will cause the tape drive to read at whatever density the
tape was written at.
Jan Reimers
[Posted by WWW Notes gateway]
|