Title: | File Shelving |
Moderator: | COOKIE::HOLSINGER |
Created: | Mon Mar 15 1993 |
Last Modified: | Thu Jun 05 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 346 |
Total number of notes: | 1204 |
We're continuing to have some problems with HSM. The H/W configuration is dual VAX 8400's connected to dual TL820s via HSJ controllers. We were experiencing "volume not S/W enabled" and "media offline" errors associated with a bug in SLS version 2.8. I wasn't sure whether the HSM problems were a side effect of this or not. Recently we installed SLS version 2.8A and the "volume not S/W enabled" and "media offline" errors stopped, but HSM continues to act up. Here's the way HSM is set up: Version 2.0A of HSM Version 6.2 of VMS $ smu sho cache Cache device _$4$JBA286: is enabled, Cache Flush is held until after 19-SEP-1996 07:05:11.22, Backup is performed at shelving time Cache files are not held on delte of online file Blocksize: 0 Highwater mark: 100% Flush interval: <none> Cache device _$4$JBB286: is enabled, Cache Flush is held until after 19-SEP-1996 07:31:33.32, Backup is performed at shelving time Cache files are not held on delte of online file Blocksize: 0 Highwater mark: 100% Flush interval: <none> Cache device _$4$JBA287: is enabled, Cache Flush is held until after 19-SEP-1996 07:04:44.84, Backup is performed at shelving time Cache files are not held on delte of online file Blocksize: 0 Highwater mark: 100% Flush interval: <none> Cache device _$4$JBB287: is enabled, Cache Flush is held until after 17-SEP-1996 07:31:37.88, Backup is performed at shelving time Cache files are not held on delte of online file Blocksize: 0 Highwater mark: 100% Flush interval: <none> $ smu sho archive HSM$ARCHIVE01 has not been used Identifier: 1 Media type: TK87K Density: <none> Label: HS0001 Position: 0 Device refs: 0 Shelf refs: 2 Current pool: <none> Enabled pools: <none> HSM$ARCHIVE02 has been used Identifier: 2 Media type: TK87K Density: <none> Label: BEF001 Position: 9 Device refs: 1 Shelf refs: 2 Current pool: XHF_ARCH_A Enabled pools: XHF_ARCH_A HSM$ARCHIVE03 has been used Identifier: 3 Media type: TK87K Density: <none> Label: BEF709 Position: 9 Device refs: 1 Shelf refs: 2 Current pool: XHF_ARCH_B Enabled pools: XHF_ARCH_B HSM$ARCHIVE04 has been used Identifier: 4 Media type: TK87K Density: <none> Label: BDN300 Position: 9 Device refs: 1 Shelf refs: 2 Current pool: XHF_ARCH_A1 Enabled pools: XHF_ARCH_A1 HSM$ARCHIVE05 has been used Identifier: 5 Media type: TK87K Density: <none> Label: BDN400 Position: 9 Device refs: 1 Shelf refs: 2 Current pool: XHF_ARCH_B1 Enabled pools: XHF_ARCH_B1 $ SMU SHO DEVICE HSM drive HSM$DEFAULT_DEVICE is enabled. Shared access: < shelve, unshelve > MDMS status: Not configured Enabled archives: <none> HSM drive _$3$MUA450: is enabled. Dedicated access: < shelve, unshelve > MDMS status: Configured Enabled archives: HSM$ARCHIVE02 id: 2 HSM$ARCHIVE04 id: 4 HSM drive _$3$MUA650: is enabled. Dedicated access: < shelve, unshelve > MDMS status: Configured Enabled archives: HSM$ARCHIVE03 id: 3 HSM$ARCHIVE05 id: 5 HSM drive _$3$MUA530: is enabled. Shared access: < shelve, unshelve > MDMS status: Configured Enabled archives: HSM$ARCHIVE03 id: 3 HSM$ARCHIVE05 id: 5 HSM drive _$3$MUA330: is enabled. Shared access: < shelve, unshelve > MDMS status: Configured Enabled archives: HSM$ARCHIVE02 id: 2 HSM$ARCHIVE04 id: 4 $smu show facility Polycenter HSM is enabled for Shelving and Unshelving Facility history: Created: 2-MAY-1996 17:05:10.04 Revised: 17-SEP-1996 08:16:06.12 Designated servers: HOBBES CALVIN Event logging: Audit Error Exception HSM mode: Plus Remaining license: 999 Gigabytes $ smu show policy Policy HSM$DEFAULT_OCCUPANCY is enabled for shelving Policy HSM$DEFAULT_POLICY is enabled for shelving Policy HSM$DEFAULT_QUOTA is enabled for shelving $ smu show shelf Shelf HSM$DEFAULT_SHELF is enabled for Shelving and Unshelving Shelf history: Created: 2-MAY-1996 17:05:09.36 Revised: 3-JUL-1996 16:55:46.64 Backup Verification: Off Archive Classes: Archive list: HSM$ARCHIVE01 id: 1 Restore list: HSM$ARCHIVE01 id: 1 Shelf XHF_SHELF_23 is enabled for Shelving and Unshelving Shelf history: Created: 11-SEP-1996 13:35:02.10 Revised: 11-SEP-1996 14:11:13.11 Backup Verification: Off Archive Classes: Archive list: HSM$ARCHIVE04 id: 4 HSM$ARCHIVE05 id: 5 Restore list: HSM$ARCHIVE04 id: 4 HSM$ARCHIVE02 id: 2 HSM$ARCHIVE05 id: 5 HSM$ARCHIVE03 id: 3 Shelf XHF_SHELF_32 is enabled for Shelving and Unshelving Shelf history: Created: 11-SEP-1996 13:35:02.10 Revised: 11-SEP-1996 14:11:13.11 Backup Verification: Off Archive Classes: Archive list: HSM$ARCHIVE04 id: 4 HSM$ARCHIVE05 id: 5 Restore list: HSM$ARCHIVE05 id: 5 HSM$ARCHIVE03 id: 3 HSM$ARCHIVE04 id: 4 HSM$ARCHIVE02 id: 2 Our application is a data archive for several types of historical files. Some file types are extremely large (500K-800K blocks). We keep 1-4 days of data on online storage (depending on data types and file sizes). This typically keeps our 2 Gbyte disks at 60% full or so. Shelved data whose file headers still reside on the disk are retained for 90 days. The archive classes and shelf is set up so that (at least) one copy of all data is on each TL820. Each TL820 has two tape drives enabled for HSM use. The reason that there are 4 active archives is that due to some operator errors we got some HSM volumes wiped in the original pair of classes (2 & 3). The only recovery method we could come up with was to create 2 new classes for shelving and retain the old classes for 90 days until all their data is aged out. The problems listed below are roughly in priority order: 1) Shelve requests are often cancelled for no reason I can explain. Our computer OPS people try to keep things tidy by re-shelving files when the users are done with them, but especially on the largest of files, shelving requests often get cancelled by the system. The entry in HSM$SHP_ERROR.LOG always looks the same and is: ** Request Disposition: Non-fatal shelf handler error Fatal request error Operation was rolled back ** Exception information: Exception Module Line (SHP_ONLINE_READ_ERROR) SHP_FILE 2736 Exception Module Line (SHP_ONLINE_READ_ERROR) SHP_FILE 2677 Platform Status Message TExt 00000800 %SYSTEM-W-ACCONFLICT, file access conflict It looks to me like this might be caused by a second user trying to read a file while it's being shelved; but I have verified that that isn't happening. Users MAY be doing DIR or even DIR/FULL on the directories containing the files though. One of the two TL820s seems to count up errors on its two tape drives much more quickly than the other. The counts aren't unheard of for tape units (50-60 on one TL820 and less than 10 on the other). Could these errors be interfering with shelving. I've tried to correlate the system error log entries with entries in HSM$SHP_ERROR.LOG, without success. Any idea what's causing this and what could be done about it? 2) Can you describe the algorithm whereby cartridges are removed from tapes and/or tape drives are released for other use once a request completes. We have occurences where a request completes, the cartridge is left in the drive and when a request needing a different cartridge is generated it stalls sometimes for hours and sometimes indefinitely. OPCOM messages indicating that the request for volume "blah" is stalled are generated. Shouldn't the tape get removed? Is there any workaround our operators could perform when we get into this situation? 3) Similar to two: Describe the algorithm for sending a request to the second (or subsequent) archive class in the restore list for a shelf. As in #2 above, we might have a case when both tapes in one TL820 are busy, but there's a free one in the other. The request seems to hang up rather than get passed on to other archive classes. This is the reason you see the dual shelf definitions above. We've created two separate shelves and split the disk volumes used for this activity across them. Then by manually defining the restore lists in a different order on the two shelves we get some split up of unshelving activity. 4) Given that some of these files are SOOO big, is there some way we can point HSM at a different working area for the HSM$xxxxxxxx.RST files created during an unshelve operation? Thanks for help or advice you can provide. -Doug Smith
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
317.1 | Grade up to HSM 2.1 | VNABRW::KARTNER_M | HOUSTON, we have a problem | Thu Feb 13 1997 00:35 | 11 |
Hi! I would recomend to grade up to HSM V2.1 witch is SSB allready COOKIE::AIM$PUBLIC:[HSM.KITS.V21] This is a bugfix release. There were several problems with HSM2.0A including response to OPCOM messages,... I hope this helps Michael | |||||
317.2 | VAX or ALPHA | WOTVAX::SMITHD | Fri Feb 14 1997 08:36 | 18 | |
> I would recomend to grade up to HSM V2.1 witch is SSB allready > > COOKIE::AIM$PUBLIC:[HSM.KITS.V21] Is this the kit? One might infer from the filenames that this is a VAX architecture release not alpha? HSM021.A-DCX_VAXEXE;1 929/932 27-JAN-1997 17:17:10.00 (R,RWED,,) HSM021.B-DCX_VAXEXE;1 6135/6136 27-JAN-1997 17:17:11.00 (R,RWED,,) HSM021.C-DCX_VAXEXE;1 9760/9760 27-JAN-1997 17:17:15.00 (R,RWED,,) HSM021.D-DCX_VAXEXE;1 9888/9888 27-JAN-1997 17:17:21.00 (R,RWED,,) Thanks Doug | |||||
317.3 | COMEUP::SIMMONDS | lock (M); while (not *SOMETHING) { Wait(C,M); } unlock(M) | Sun Feb 16 1997 21:46 | 12 | |
Re: .2 (VAX kits?) .0> We're continuing to have some problems with HSM. The H/W configuration is dual .0> VAX 8400's connected to dual TL820s via HSJ controllers. We were experiencing ~~~~~~~~ Who's confused? :):) Have you tried decompressing the .DCX_VAXEXE files? (on a VAX..) and installing the resulting savesets on your Alpha ? (The HSM kits have typically been dual Arch.) John. | |||||
317.4 | SLS-F-MRD_START_FAIL | WOTVAX::SMITHD | Tue Feb 25 1997 09:20 | 12 | |
| Have you tried decompressing the .DCX_VAXEXE files? (on a VAX..) Yep, installed the 2.1 release on top of 2.8a SLS and this seems to improve (possibly fix?) the cancel problem, but now SLS is reporting: SLS-F-MRD_START_FAIL - media robot driver startup failure We are attempting to work this ongoing fun with Ted Saul in the CSC. Any help would be appreciated. Thanks, Doug | |||||
317.5 | MRD help in SLS conference | COOKIE::HOLSINGER | HSM Engineering, DTN 522-2843 | Mon Mar 31 1997 11:21 | 20 |
re: <<< Note 317.4 by WOTVAX::SMITHD >>> > SLS-F-MRD_START_FAIL - media robot driver startup failure Hello Doug, This message is indicative of a problem (usually configuration) between the robot device and the lowest level software (above the SCSI port driver) used by SLS/MDMS to manage the load /unload operations. HSM is at the top of the food chain here, and is probably not the most efficient way to troubleshoot the problem. If the problem persists, I would recommend you re-post the TL8xx/HSJ config details in the COOKIE::SLS conference (KP7 if needed). There are several entries in that conference already which discuss this exact error message. If this problem has been fixed, please disregard this reply (we can use note #329 to pursue the dual TL820 problem). Regards, /Paul |