T.R | Title | User | Personal Name | Date | Lines |
---|
341.1 | comments... | COOKIE::HOLSINGER | HSM Engineering, DTN 522-2843 | Wed May 28 1997 10:51 | 26 |
| Hello David.
Thank you for providing detailed information about your HSM configuration.
Here are a couple of observations:
1. The error entry was logged for a file fault (auto unshelve). The
root error was %BACKUP-E-OPENOUT, which usually means that Backup
could not locate a specific saveset on the appropriate tape. This
means that the corresponding HSM catalog entry and tape contents
don't agree. One of the two (tape or catalog) has been modified
outside of HSM. The error is not associated with HSM caching.
You can use SMU LOCATE/FULL for the file in question, to display the
particular tape and saveset that HSM is trying to find. Then, mount
the tape /FOREIGN, and use BACKUP $1$MUA0:*.*/SAVE/LIST/OUT=TAPE.LIS
to get a list of the saveset files and members on the tape. We can
use this info to try and isolate where and how the discrepancy occured.
2. The MO cache configuration looks OK. However, the default shelf is
configured to flush only to Archive class 1. I did not see any other
Archive class definitions. HSM should always be configured with
multiple redundant Archive classes. This should be corrected ASAP.
Regards,
/Paul
|
341.2 | $2$DKF104: is also the catalog disk | CX3PST::WSC217::SWANK | David | Wed May 28 1997 11:41 | 33 |
| \Paul,
\
>Here are a couple of observations:
>
> 1. The error entry was logged for a file fault (auto unshelve). The
> root error was %BACKUP-E-OPENOUT, which usually means that Backup
> could not locate a specific saveset on the appropriate tape. This
> means that the corresponding HSM catalog entry and tape contents
> don't agree. One of the two (tape or catalog) has been modified
> outside of HSM. The error is not associated with HSM caching.
The the catalog is on device $2$DKF104: and they're having some problems with
it as well from an SLS backup standpoint. I working with the customer on
that problem as well but suspect the catalog itself could be bad.
> You can use SMU LOCATE/FULL for the file in question, to display the
> particular tape and saveset that HSM is trying to find. Then, mount
> the tape /FOREIGN, and use BACKUP $1$MUA0:*.*/SAVE/LIST/OUT=TAPE.LIS
> to get a list of the saveset files and members on the tape. We can
> use this info to try and isolate where and how the discrepancy occured.
I'll recommend the above procedure to the customer. Is there a catalog
"health check" procedure to verify its internal structure and functionality?
> 2. The MO cache configuration looks OK. However, the default shelf is
> configured to flush only to Archive class 1. I did not see any other
> Archive class definitions. HSM should always be configured with
> multiple redundant Archive classes. This should be corrected ASAP.
I've already noted the lack of redundancy to the customer, thanks for your
collaboration.
\
\Regards, David
|
341.3 | shelf error log entry w/ SYSTEM-W-ACCONFLICT | CX3PST::WSC217::SWANK | David | Wed May 28 1997 12:02 | 53 |
| Paul,
After my last reply (.2) I when back to error log that the customer sent and I
may have not sent the corresponding error log entry to the shelf command that
fails. Does the following entry provide any additional insight as to why the
SHELF command would file with a shelf-w-cancel?;
** 1455 ** REQUEST ERROR REPORT
Error detected on request number 1455 on node BLUE
Entry logged at 22-MAY-1997 05:25:16.25
Identifier: 20316808
Process: 21A00136
Username: HSM$SERVER
Timestamp: 22-MAY-1997 05:25:11.65
Client Node: BLUE
Source: Application
Type: Shelve file
Flags: FileID Makespace
State: Canceled Original Validated
Status: Error
File: $2$DKF104:[AURORA]BDAURORA_70325_011259_011591.9703706_R;1
Volume: _$2$DKF104:
FileID: (4690,1,0,0)
%HSM-E-FILERROR, file $2$DKF104:[AURORA]BDAURORA_70325_011259_011591.9703706_
%SYSTEM-W-ACCONFLICT, file access conflict
%HSM-I-RECOVERPRESHLV, inconsistent state found, file preshelved
Non-fatal shelf handler error
Fatal request error
Operation was rolled back
Exception Module Line
(SHP_ONLINE_READ_ERROR) SHP_FILE 4239
Platform Status Message Text
00000800 %SYSTEM-W-ACCONFLICT, file access conflict
Exception Module Line
SHP_ONLINE_ERROR SHP_REQUEST 7672
Exception Module Line
SHP_ONLINE_WRITE_ERROR SHP_ONLINE 5567
It's a %SYSTEM-W-ACCONFLICT error that I'm currently investigating from the SLS
backup side of the house. Have not yet received the SLS log to know exactly
what's happening there.
\
\David
|
341.4 | correction to .1 | COOKIE::HOLSINGER | HSM Engineering, DTN 522-2843 | Thu May 29 1997 18:05 | 32 |
| Hello David.
I believe I was mistaken in my .1 analysis of the backup error. Since it is
%BACKUP-E-OPENOUT, the error occured when backup tried to open the temporary
restore file during the unshelve. The error did not occur because backup could
not open the file on the archive tape. The error may indicate a problem with
the HSM$MANAGER device or directory.
Please find and post the corresponding HSM$LOG:HSM$SHELF_HANDLER.LOG file.
Note, there is a new file created each time HSM is started. This file should
contain additional error info from backup.
WRT the %SYSTEM-W-ACCONFLICT error, this may be the result of a race condition
within HSM. The error you posted shows a shelve command failing on a preshelved
file, probably during file truncation. Most all conflicting requests are caught
by HSM, with the exception of conflicts due to cache flushing. The was done for
performance reasons. If the scenario is as I suspect, a cache flush was in
progress during a makespace shelve. In any case, the error is not serious, as
no data is affected, and the makespace operation will simply continue with the
next candidate file to shelve.
Please verify the situation by posting the following:
1. SMU LOCATE/FULL the file in question
2. DIR/FULL the file in question
3. locate a cache flush entry in HSM$LOG:HSM$SHP_AUDIT.LOG which
possesses the 22-MAY-1997 05:25:16.25 error timestamp
Also, what version of HSM is the customer running?
Regards,
/Paul
|
341.5 | | CX3PST::WSC217::SWANK | David | Fri May 30 1997 09:52 | 11 |
| \Paul,
\
The customer has deleted old HSM$LOG:*.LOG files and the shelf command
is now working. Customer is going to disable the flush on the RW500 devices
that had a flush interval. If the problem re-occures they will send the
log files HSM$LOG:HSM$SHP_AUDIT.LOG & HSM$LOG:HSM$SHP_ERROR.LOG that correspond
to the time fram of the incident as well as the output of SMU LOCATE/FULL
and DIR/FULL of the file in question.
\
\Thanks for your help so far,
\David
|