| Title: | File Shelving |
| Moderator: | COOKIE::HOLSINGER |
| Created: | Mon Mar 15 1993 |
| Last Modified: | Thu Jun 05 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 346 |
| Total number of notes: | 1204 |
We're continuing to have some problems with HSM. The H/W configuration is dual
VAX 8400's connected to dual TL820s via HSJ controllers. We were experiencing
"volume not S/W enabled" and "media offline" errors associated with a bug in
SLS version 2.8. I wasn't sure whether the HSM problems were a side effect of
this or not.
Recently we installed SLS version 2.8A and the "volume not S/W enabled" and
"media offline" errors stopped, but HSM continues to act up.
Here's the way HSM is set up:
Version 2.0A of HSM
Version 6.2 of VMS
$ smu sho cache
Cache device _$4$JBA286: is enabled, Cache Flush is held until after
19-SEP-1996 07:05:11.22, Backup is performed at shelving time
Cache files are not held on delte of online file
Blocksize: 0
Highwater mark: 100%
Flush interval: <none>
Cache device _$4$JBB286: is enabled, Cache Flush is held until after
19-SEP-1996 07:31:33.32, Backup is performed at shelving time
Cache files are not held on delte of online file
Blocksize: 0
Highwater mark: 100%
Flush interval: <none>
Cache device _$4$JBA287: is enabled, Cache Flush is held until after
19-SEP-1996 07:04:44.84, Backup is performed at shelving time
Cache files are not held on delte of online file
Blocksize: 0
Highwater mark: 100%
Flush interval: <none>
Cache device _$4$JBB287: is enabled, Cache Flush is held until after
17-SEP-1996 07:31:37.88, Backup is performed at shelving time
Cache files are not held on delte of online file
Blocksize: 0
Highwater mark: 100%
Flush interval: <none>
$ smu sho archive
HSM$ARCHIVE01 has not been used
Identifier: 1
Media type: TK87K
Density: <none>
Label: HS0001
Position: 0
Device refs: 0
Shelf refs: 2
Current pool: <none>
Enabled pools: <none>
HSM$ARCHIVE02 has been used
Identifier: 2
Media type: TK87K
Density: <none>
Label: BEF001
Position: 9
Device refs: 1
Shelf refs: 2
Current pool: XHF_ARCH_A
Enabled pools: XHF_ARCH_A
HSM$ARCHIVE03 has been used
Identifier: 3
Media type: TK87K
Density: <none>
Label: BEF709
Position: 9
Device refs: 1
Shelf refs: 2
Current pool: XHF_ARCH_B
Enabled pools: XHF_ARCH_B
HSM$ARCHIVE04 has been used
Identifier: 4
Media type: TK87K
Density: <none>
Label: BDN300
Position: 9
Device refs: 1
Shelf refs: 2
Current pool: XHF_ARCH_A1
Enabled pools: XHF_ARCH_A1
HSM$ARCHIVE05 has been used
Identifier: 5
Media type: TK87K
Density: <none>
Label: BDN400
Position: 9
Device refs: 1
Shelf refs: 2
Current pool: XHF_ARCH_B1
Enabled pools: XHF_ARCH_B1
$ SMU SHO DEVICE
HSM drive HSM$DEFAULT_DEVICE is enabled.
Shared access: < shelve, unshelve >
MDMS status: Not configured
Enabled archives: <none>
HSM drive _$3$MUA450: is enabled.
Dedicated access: < shelve, unshelve >
MDMS status: Configured
Enabled archives: HSM$ARCHIVE02 id: 2
HSM$ARCHIVE04 id: 4
HSM drive _$3$MUA650: is enabled.
Dedicated access: < shelve, unshelve >
MDMS status: Configured
Enabled archives: HSM$ARCHIVE03 id: 3
HSM$ARCHIVE05 id: 5
HSM drive _$3$MUA530: is enabled.
Shared access: < shelve, unshelve >
MDMS status: Configured
Enabled archives: HSM$ARCHIVE03 id: 3
HSM$ARCHIVE05 id: 5
HSM drive _$3$MUA330: is enabled.
Shared access: < shelve, unshelve >
MDMS status: Configured
Enabled archives: HSM$ARCHIVE02 id: 2
HSM$ARCHIVE04 id: 4
$smu show facility
Polycenter HSM is enabled for Shelving and Unshelving
Facility history:
Created: 2-MAY-1996 17:05:10.04
Revised: 17-SEP-1996 08:16:06.12
Designated servers: HOBBES
CALVIN
Event logging: Audit
Error
Exception
HSM mode: Plus
Remaining license: 999 Gigabytes
$ smu show policy
Policy HSM$DEFAULT_OCCUPANCY is enabled for shelving
Policy HSM$DEFAULT_POLICY is enabled for shelving
Policy HSM$DEFAULT_QUOTA is enabled for shelving
$ smu show shelf
Shelf HSM$DEFAULT_SHELF is enabled for Shelving and Unshelving
Shelf history:
Created: 2-MAY-1996 17:05:09.36
Revised: 3-JUL-1996 16:55:46.64
Backup Verification: Off
Archive Classes:
Archive list: HSM$ARCHIVE01 id: 1
Restore list: HSM$ARCHIVE01 id: 1
Shelf XHF_SHELF_23 is enabled for Shelving and Unshelving
Shelf history:
Created: 11-SEP-1996 13:35:02.10
Revised: 11-SEP-1996 14:11:13.11
Backup Verification: Off
Archive Classes:
Archive list: HSM$ARCHIVE04 id: 4
HSM$ARCHIVE05 id: 5
Restore list: HSM$ARCHIVE04 id: 4
HSM$ARCHIVE02 id: 2
HSM$ARCHIVE05 id: 5
HSM$ARCHIVE03 id: 3
Shelf XHF_SHELF_32 is enabled for Shelving and Unshelving
Shelf history:
Created: 11-SEP-1996 13:35:02.10
Revised: 11-SEP-1996 14:11:13.11
Backup Verification: Off
Archive Classes:
Archive list: HSM$ARCHIVE04 id: 4
HSM$ARCHIVE05 id: 5
Restore list: HSM$ARCHIVE05 id: 5
HSM$ARCHIVE03 id: 3
HSM$ARCHIVE04 id: 4
HSM$ARCHIVE02 id: 2
Our application is a data archive for several types of historical
files. Some file types are extremely large (500K-800K blocks). We keep
1-4 days of data on online storage (depending on data types and file
sizes). This typically keeps our 2 Gbyte disks at 60% full or so. Shelved
data whose file headers still reside on the disk are retained for 90 days.
The archive classes and shelf is set up so that (at least) one copy of
all data is on each TL820. Each TL820 has two tape drives enabled for HSM
use. The reason that there are 4 active archives is that due to some operator
errors we got some HSM volumes wiped in the original pair of classes (2 & 3).
The only recovery method we could come up with was to create 2 new classes
for shelving and retain the old classes for 90 days until all their data is
aged out.
The problems listed below are roughly in priority order:
1) Shelve requests are often cancelled for no reason I can explain. Our
computer OPS people try to keep things tidy by re-shelving files
when the users are done with them, but especially on the largest of files,
shelving requests often get cancelled by the system. The entry in
HSM$SHP_ERROR.LOG always looks the same and is:
** Request Disposition:
Non-fatal shelf handler error
Fatal request error
Operation was rolled back
** Exception information:
Exception Module Line
(SHP_ONLINE_READ_ERROR) SHP_FILE 2736
Exception Module Line
(SHP_ONLINE_READ_ERROR) SHP_FILE 2677
Platform Status Message TExt
00000800 %SYSTEM-W-ACCONFLICT, file access conflict
It looks to me like this might be caused by a second user trying to read
a file while it's being shelved; but I have verified that that isn't happening.
Users MAY be doing DIR or even DIR/FULL on the directories containing the files
though. One of the two TL820s seems to count up errors on its two tape drives
much more quickly than the other. The counts aren't unheard of for tape units
(50-60 on one TL820 and less than 10 on the other). Could these errors be
interfering with shelving. I've tried to correlate the system error log
entries with entries in HSM$SHP_ERROR.LOG, without success. Any idea what's
causing this and what could be done about it?
2) Can you describe the algorithm whereby cartridges are removed from tapes
and/or tape drives are released for other use once a request completes. We
have occurences where a request completes, the cartridge is left in the
drive and when a request needing a different cartridge is generated it
stalls sometimes for hours and sometimes indefinitely. OPCOM messages
indicating that the request for volume "blah" is stalled are generated.
Shouldn't the tape get removed? Is there any workaround our operators could
perform when we get into this situation?
3) Similar to two: Describe the algorithm for sending a request to the second
(or subsequent) archive class in the restore list for a shelf. As in #2
above, we might have a case when both tapes in one TL820 are busy, but
there's a free one in the other. The request seems to hang up rather than
get passed on to other archive classes. This is the reason you see the dual
shelf definitions above. We've created two separate shelves and split the
disk volumes used for this activity across them. Then by manually defining
the restore lists in a different order on the two shelves we get some
split up of unshelving activity.
4) Given that some of these files are SOOO big, is there some way we can point
HSM at a different working area for the HSM$xxxxxxxx.RST files created during
an unshelve operation?
Thanks for help or advice you can provide.
-Doug Smith
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 317.1 | Grade up to HSM 2.1 | VNABRW::KARTNER_M | HOUSTON, we have a problem | Thu Feb 13 1997 00:35 | 11 |
Hi!
I would recomend to grade up to HSM V2.1 witch is SSB allready
COOKIE::AIM$PUBLIC:[HSM.KITS.V21]
This is a bugfix release. There were several problems with HSM2.0A
including response to OPCOM messages,...
I hope this helps
Michael
| |||||
| 317.2 | VAX or ALPHA | WOTVAX::SMITHD | Fri Feb 14 1997 08:36 | 18 | |
> I would recomend to grade up to HSM V2.1 witch is SSB allready
>
> COOKIE::AIM$PUBLIC:[HSM.KITS.V21]
Is this the kit? One might infer from the filenames that this is a VAX
architecture release not alpha?
HSM021.A-DCX_VAXEXE;1
929/932 27-JAN-1997 17:17:10.00 (R,RWED,,)
HSM021.B-DCX_VAXEXE;1
6135/6136 27-JAN-1997 17:17:11.00 (R,RWED,,)
HSM021.C-DCX_VAXEXE;1
9760/9760 27-JAN-1997 17:17:15.00 (R,RWED,,)
HSM021.D-DCX_VAXEXE;1
9888/9888 27-JAN-1997 17:17:21.00 (R,RWED,,)
Thanks
Doug
| |||||
| 317.3 | COMEUP::SIMMONDS | lock (M); while (not *SOMETHING) { Wait(C,M); } unlock(M) | Sun Feb 16 1997 21:46 | 12 | |
Re: .2 (VAX kits?)
.0> We're continuing to have some problems with HSM. The H/W configuration is dual
.0> VAX 8400's connected to dual TL820s via HSJ controllers. We were experiencing
~~~~~~~~
Who's confused? :):)
Have you tried decompressing the .DCX_VAXEXE files? (on a VAX..)
and installing the resulting savesets on your Alpha ?
(The HSM kits have typically been dual Arch.)
John.
| |||||
| 317.4 | SLS-F-MRD_START_FAIL | WOTVAX::SMITHD | Tue Feb 25 1997 09:20 | 12 | |
| Have you tried decompressing the .DCX_VAXEXE files? (on a VAX..) Yep, installed the 2.1 release on top of 2.8a SLS and this seems to improve (possibly fix?) the cancel problem, but now SLS is reporting: SLS-F-MRD_START_FAIL - media robot driver startup failure We are attempting to work this ongoing fun with Ted Saul in the CSC. Any help would be appreciated. Thanks, Doug | |||||
| 317.5 | MRD help in SLS conference | COOKIE::HOLSINGER | HSM Engineering, DTN 522-2843 | Mon Mar 31 1997 10:21 | 20 |
re: <<< Note 317.4 by WOTVAX::SMITHD >>> > SLS-F-MRD_START_FAIL - media robot driver startup failure Hello Doug, This message is indicative of a problem (usually configuration) between the robot device and the lowest level software (above the SCSI port driver) used by SLS/MDMS to manage the load /unload operations. HSM is at the top of the food chain here, and is probably not the most efficient way to troubleshoot the problem. If the problem persists, I would recommend you re-post the TL8xx/HSJ config details in the COOKIE::SLS conference (KP7 if needed). There are several entries in that conference already which discuss this exact error message. If this problem has been fixed, please disregard this reply (we can use note #329 to pursue the dual TL820 problem). Regards, /Paul | |||||