[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference decwet::hsm-4-unix

Title:	HSM for UNIX Platforms
Notice:	Kit Info in note 2.1 -- Public Info Pointer in 3.1
Moderator:	DECWET::TRESSEL

Created:	Fri Jul 08 1994
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	238
Total number of notes:	998

225.0. "RWZ52 support?" by MARVEL::NICHOLSONA (rackin frackin varmit) Tue Dec 17 1996 01:45

T.R	Title	User	Personal Name	Date	Lines
225.1	rw531 after all but still not working	MARVEL::NICHOLSONA	rackin frackin varmit	`Tue Dec 17 1996 08:59`	10
225.2		DECWET::TRESSEL	Pat Tressel	`Wed Dec 18 1996 15:45`	70
225.3	here is the right stuff	MARVEL::NICHOLSONA	rackin frackin varmit	`Mon Dec 23 1996 04:54`	400
225.4	Bad platters?	DECWET::TRESSEL	Pat Tressel	`Thu Jan 02 1997 23:10`	116
225.5	system now crashing when jukebox attached	KERNEL::NICHOLSONA	one suite too many can cause truth decay	`Thu Feb 27 1997 03:27`	339
	Hi, Customer did have some hardware problems with the jukebox. The jukebox was sent away and now has come back. When the boot the system up with the jukebox powered on the system now crashes. Only if the system is booted up and the jukebox is powered off does the system remain up. If the jukebox is powered up after startup has completed the system does not crash. I have advised the customer now that I think the best way forward is to start from scratch and reinstall cam layered products and then hsm. What do you think? At least then we would be starting from a known place. I am leaving the company tomorrow so if you could reply on the conference that would be great. Below is a section of the crash-data file.. _crash_data_collection_time: Thu Feb 13 18:29:34 GMT 1997 _current_directory: / _crash_kernel: /var/adm/crash/vmunix.15 _crash_core: /var/adm/crash/vmcore.15 _crash_arch: alpha _crash_os: Dec 08 Dec 08 Dec 08 Dec 08 Dec 08 DECnet/OSI for DECnet/OSI for Digital UNIX _host_version: Dec 08 00:00 OSF1 V3.2 (Rev. 214.61) Dec 08 00:00 OSF1 V3.2 (Rev. 214.61) Dec 08 00:00 OSF1 V3.2 (Rev. 214.61) Dec 08 10:09 OSF1 V3.2 (Rev. 214.61) Dec 08 10:09 OSF1 V3.2 (Rev. 214.61) DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53 EDT 1995 DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53 EDT 1995 Digital UNIX V3.2D-1 (Rev. 41); Wed Jan 15 19:53:18 GMT 1997 _crash_version: Dec 08 00:00 OSF1 V3.2 (Rev. 214.61) Dec 08 00:00 OSF1 V3.2 (Rev. 214.61) Dec 08 00:00 OSF1 V3.2 (Rev. 214.61) Dec 08 10:09 OSF1 V3.2 (Rev. 214.61) Dec 08 10:09 OSF1 V3.2 (Rev. 214.61) DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53 EDT 1995 DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53 EDT 1995 Digital UNIX V3.2D-1 (Rev. 41); Wed Jan 15 19:53:18 GMT 1997 _crashtime: struct { tv_sec = 855857203 tv_usec = 914609 } _boottime: struct { tv_sec = 855857032 tv_usec = 609024 } _config: struct { sysname = "OSF1" nodename = "kofim8" release = "V3.2" version = "41" machine = "alpha" } _cpu: 35 _system_string: 0xffffffffff801048 = "AlphaServer 2100 4/200" _ncpus: 1 _avail_cpus: 1 _partial_dump: 1 _physmem(MBytes): 127 _panic_string: 0xfffffc000067b5a0 = "bread: size 0" _paniccpu: 0 _panic_thread: 0xfffffc0004dddb80 _preserved_message_buffer_begin: struct { msg_magic = 0x63061 msg_bufx = 0x968 msg_bufr = 0x86f msg_bufc = "PCXAL keyboard, language English (American) Alpha boot: available memory from 0xa44000 to 0x7ffe000 Digital UNIX V3.2D-1 (Rev. 41); Wed Jan 15 19:53:18 GMT 1997 physical memory = 128.00 megabytes. available memory = 117.78 megabytes. using 484 buffers containing 3.78 megabytes of memory Firmware revision: 3.9 PALcode: OSF version 1.35 ibus0 at nexus AlphaServer 2100 4/200 cpu 0 EV-4s 1mb b-cache gpc0 at ibus0 pci0 at ibus0 slot 0 tu0: DECchip 21040-AA: Revision: 2.3 tu0 at pci0 slot 0 tu0: DEC TULIP Ethernet Interface, hardware address: 08-00-2B-E2-68-05 tu0: auto sensing: selected AUI (10Base2\|5) port psiop0 at pci0 slot 1 Loading SIOP: script 1001b00, reg 81000000, data 40759a08 scsi0 at psiop0 slot 0 rz0 at scsi0 bus 0 target 0 lun 0 (DEC RZ28 (C) DEC D41C) rz1 at scsi0 bus 0 target 1 lun 0 (DEC RZ26L (C) DEC 440C) rz2 at scsi0 bus 0 target 2 lun 0 (DEC RZ26L (C) DEC 440C) rz3 at scsi0 bus 0 target 3 lun 0 (DEC RZ26L (C) DEC 440C) rz4 at scsi0 bus 0 target 4 lun 0 (DEC RZ26L (C) DEC 440C) rz5 at scsi0 bus 0 target 5 lun 0 (DEC RZ29B (C) DEC 0014) rz6 at scsi0 bus 0 target 6 lun 0 (DEC RRD43 (C) DEC 1084) eisa0 at pci0 ace0 at eisa0 ace1 at eisa0 lp0 at eisa0 fdi0 at eisa0 fd0 at fdi0 unit 0 vga0 at eisa0 1024x768 (QVision ) aha0 at eisa0 slot 3 scsi1 at aha0 lvm0: configured. lvm1: configured. dli: configured SuperLAT. Copyright 1993 Meridian Technology Corp. All rights reserved. datalink: links=64, macs=6 knbinit: sessions=256, names=64 knbtcp: configured knbtcpd: configured knbadm configured nbeadmin_configure netbeuid_configure netbeui_configure x25_access: configured x25_ip: configured x25_relay: configured wandd_base: configured wandd_llc2: configured wan_utilities: configured ctf_base: configured Node ID is 08-00-2b-e2-68-05 (from device tu0) dna_netman: configured dna_dli: configured ADVFS: using 1152 buffers containing 9.00 megabytes of memory vm_swap_init: warning /sbin/swapdefault swap device not found vm_swap_init: in swap over commitment mode Node UID is c390bf00-85cb-11d0-8008-08002be26805 dna_base: configured dna_rfc1006: configured dna_xti: configured panic (cpu 0): bread: size 0 syncing disks... done device string for dump = SCSI 0 1 0 1 100 0 0 . DUMP.prom: dev SCSI 0 1 0 1 100 0 0 , block 131072 device string for dump = SCSI 0 1 0 1 100 0 0 . DUMP.prom: dev SCSI 0 1 0 1 100 0 0 , block 131072 " } _preserved_message_buffer_end: _kernel_process_status_begin: PID COMM 00000 kernel idle 00001 init 00008 kloadsrv 00038 update 01066 knblink 01068 sendmail 01074 dllink 01117 timed 01172 mold 01175 internet_mom 01184 snmp_pe 01190 inetd 01195 cron 01223 pwlic.reg 01229 lpd 01295 jmd 01299 qr 01326 coolsrvr 01334 xdm 01340 jmd 01342 jmd 01343 oss_exec 01345 getty 01352 nbelink 01354 Xdec 01364 pwalrtr 01368 xdm 01372 lmx.ctrl 01383 getty 01384 getty 01385 getty 01386 getty 01387 getty 01389 getty 01392 getty 01394 getty 01395 getty 01397 getty 01398 getty 01399 getty 01401 getty 01402 getty 01405 getty 01406 getty 01407 getty 01409 getty 01410 getty 01411 getty 01412 getty 01413 getty 01414 getty 01415 getty 01416 getty 01424 getty 01441 Xsetup_0 01451 dxconsole 00670 usd 00740 syslogd 00742 binlogd 00769 gated 00839 named 00872 portmap 00874 mountd 00876 nfsd 00878 nfsiod 00879 nfsiod 00880 nfsiod 00881 nfsiod 00882 nfsiod 00883 nfsiod 00884 nfsiod 00890 automount 00912 dnalimd 00915 dnaevld 00922 ctfd 00952 dnascd 00953 dnansd 00954 dnaksd 00958 dnsadv 00962 dtssd 00965 dnanoded 00976 dnamopd 00986 rfc1006d 01000 osaknmd 01005 ftam_listener 01007 ftam_listener _kernel_process_status_end: _current_pid: 1340 _current_tid: 0xfffffc0004dddb80 _proc_thread_list_begin: thread 0xfffffc0004dddb80 stopped at [boot:1730 ,0xfffffc00004800cc] Source not available _proc_thread_list_end: _dump_begin: > 0 boot(0x0, 0x0, 0xfffffc000067b5a0, 0xffffffffffffffff, 0xffffffff88b8f0a0) ["../../../../src/kernel/arch/alpha/machdep.c":1730, 0xfffffc00004800cc] 1 panic(s = 0xfffffc000067b5a0 = "bread: size 0") ["../../../../src/kernel/bs d/subr_prf.c":757, 0xfffffc000043f3b4] pcpu = 0x2fa i = 0 bootopt = 7606120 mycpu = 4151676 spl = 0 prevcc = 18446739675667192340 nextcc = 18446739675783402496 timer = 1 limit = 0 2 bread(vp = 0xfffffc0005273c00, blkno = 0, size = 0, cred = 0xffffffffffffff ff, bpp = 0xffffffff88b8f0a0) ["../../../../src/kernel/vfs/vfs_bio.c":393, 0xfff ffc000044d46c] bp = (nil) error = -2144736992 metadatatype = 0 3 blkatoff(0x8052000006c1, 0x0, 0x0, 0xffffffff88b8f1d8, 0xffffffff88b8f250) ["../../../../src/kernel/ufs/ufs_lookup.c":1332, 0xfffffc0000270e34] 4 scandir(0xfffffc0005273c00, 0x0, 0x0, 0xfffffc0000212270, 0x0) ["../../../. ./src/kernel/ufs/ufs_lookup.c":530, 0xfffffc000026fd34] 5 ufs_lookup(0xfffffc0005273c00, 0xffffffff88b8f550, 0x3d, 0xfffffc0000295610 , 0xfffffc0000295988) ["../../../../src/kernel/ufs/ufs_lookup.c":386, 0xfffffc00 0026fab0] 6 namei(0xfffffc00006d61a0, 0xfffffc0000000001, 0xfffffc0005273c00, 0xfffffc0 000000001, 0xfffffc0000299cf4) ["../../../../src/kernel/vfs/vfs_lookup.c":563, 0 xfffffc00002954fc] 7 vn_open(0xffffffff88b8f720, 0x1001, 0x0, 0x633231706f2f7665, 0xfffffc000048 eaac) ["../../../../src/kernel/vfs/vfs_vnops.c":515, 0xfffffc0000298ee0] 8 copen(0xfffffc0004ddd210, 0xffffffff88b8f8c8, 0xffffffff88b8f8b8, 0x0, 0x40 289374bc6a7efa) ["../../../../src/kernel/vfs/vfs_syscalls.c":1824, 0xfffffc00004 54f98] 9 open(0xffffffff88b8f8b8, 0x0, 0x40289374bc6a7efa, 0xfffffc000048e4d4, 0xfff ffc000048d8d8) ["../../../../src/kernel/vfs/vfs_syscalls.c":1781, 0xfffffc000045 4ebc] 10 syscall(0x140019228, 0x1, 0x0, 0x0, 0x2d) ["../../../../src/kernel/arch/alp ha/syscall_trap.c":519, 0xfffffc000048d8d4] 11 _Xsyscall(0x8, 0x1200477c8, 0x1400276e0, 0x11ffff258, 0x0) ["../../../../sr c/kernel/arch/alpha/locore.s":1094, 0xfffffc000047d024] _dump_end: Anyway thanks for all the help and advice you have given me. Many thanks Avril
225.6		DECWET::TRESSEL	Pat Tressel	`Thu Feb 27 1997 16:18`	84
	Avril -- > I am leaving the company tomorrow I wish you well in your future endeavors! -- Pat ------------------------------------------------------------------------------- About the case: > Customer did have some hardware problems with the jukebox. The jukebox > was sent away and now has come back. Sounds like it's worse than it was before... What was done to the jukebox? Is there any way a different jukebox (preferably one known to be working) could be brought to the customer site for testing? > When the boot the system up with > the jukebox powered on the system now crashes. Only if the system is > booted up and the jukebox is powered off does the system remain up. > If the jukebox is powered up after startup has completed the system > does not crash. The crash info shows the jmd trying to open something, probably on one of the optical drives, and probably while trying to inventory the jukebox. So the fact that the system stays up if the jukebox is turned on later is only because the jmd decided there was no jukebox. It won't go back and try to look for, and inventory, a newly turned on jukebox, unless it's restarted. So if they restart HSM after the jukebox is turned on, I'd presume the system would crash then. The crash appears to have happened because function bread was passed a 0 size argument -- it checks for this, and panics if it gets it. The stack trace shown in the crash info looks wrong -- there are impossible arguments in some of the calls, for instance, the open and copen calls should both have the proc pointer in the first argument...but their first arguments are not the same. Also, the open flag appears to be 1, O_WRONLY, which the jmd does not use. So maybe the stack wasn't valid at the time of the dump. Or I'm confused about what it means... I don't have access to v3.2d-1 sources, but I looked at both v3.2c and v3.2f sources. There are several possibilities: 1) The jukebox is still sick. Although the stack trace shows the panic happening before the open gets to the point of trying to access the device, device-related things may have happened earlier, that messed up internal data structures or some such thing, and this panic is a consequence of that previous damage. 2) Their vmunix is corrupted. Why don't they rebuild their kernel, and compare the new and old versions? They should make sure the rebuild succeeds. They can rebuild by doing: doconfig -c XXXX where XXXX is the name of their machine in uppercase. They should answer no when doconfig asks them if they want to edit the configuration file. > I have advised the customer now that I think the best way forward is to > start from scratch and reinstall cam layered products and then hsm. I don't see anything HSM-specific in this case, but if they did reinstall, they'd rebuild the kernel along the way... It might confuse the issue to do both at once, though. 3) There's a bug in the os. I didn't see any patches in the v3.2d-1 patch set that mention a "bread: size 0" panic, but not all patches give their precise symptoms. If a kernel rebuild doesn't help, it may be necessary to get help from the Unix group, to look at the crash dump. 4) Something else that talks to the device is incompatible with it. What SCSI adapter are they using on the bus the jukebox is on? Are any other devices on the same bus? But before we get to these second-level possibilities, the first approach would be to replace or repair the jukebox...after all, it was getting those huge numbers of errors, and the behavior did change when it was "fixed". -- Pat
225.7		DECWET::TRESSEL	Pat Tressel	`Thu Feb 27 1997 22:04`	138
	In order to prevent the system from crashing, another thing they can do instead of powering off the jukebox is to take HSM out of the system startup. To not start HSM during system startup, remove the link to its startup procedure: rm /sbin/rc3.d/S70hsm One of the tests I'll suggest below requires that HSM be running, so it would have to be started by hand later, but we don't want it trying an inventory when it starts. To turn off the inventory, change the file /usr/efs/PolycenterHSM.startup to set the inventory type to "none" by replacing the line: /usr/efs/bin/jmd with /usr/efs/bin/jmd -i none ------------------------------------------------------------------------------- Going back to some previous evidence: # file /dev/r0op13c /dev/rop13c: character special (54/21506) SCSI #1 RWZ52 disk #104 (SCSI ID #5) errors = 0/176 offline # file /dev/rop12c /dev/rop12c: character special (54/20482) SCSI #1 RWZ52 disk #96 (SCSI ID #4) errors = 0/263 offline There are errors on both drives, so whatever the problem is, it's not drive-specific, i.e. there isn't a "bad" drive. Anything common to both drives, and anything on the path between the operating system, which reports the errors, and the drives is suspect. This includes the SCSI bus itself, which is why I asked about other devices on that bus. Going back to the scu output, I see there's a tape drive on that bus too -- is the tape drive behaving ok? It would be a simple test to replace all the cables and terminators, and see if that makes a difference. Another thing to try is to temporarily take the tape drive (and anything else that may have been put on that bus since) off the bus -- have just one cable straight from the 2100 to the jukebox, and terminate the bus at the jukebox. (I recognize that "testing" is likely to crash the system if the cables aren't the problem, so they should schedule this and not have users on at the time.) The external cables aren't all the cables there are: There are also cables inside the jukebox. We had one case where a jukebox was shipped from the factory that didn't have its internal cables connected. So it would be good to have all the internal connections checked. Since the problems got worse after the jukebox was serviced, one obvious thing to check is whether everything was put back together correctly. Speaking of putting things back together correctly... Where were the platters while the jukebox was being serviced? Were they all taken out and kept at the customer site? If so, were they put back in the same slots, with the same side up, as before? (Since the jmd can't inventory the platters, if they're moved, HSM will have obsolete info as to what platter is in what slot.) Were they left in the jukebox? If so, was Field Service aware that the platters had data on them? One thing in common between the drives is their firmware revision level. But my drives are at the same firmware rev level, though with the slight difference that mine are pre-release models and didn't have their names and vendor ids changed to RWZ52 and DEC yet. I've used them under v3.2d-1 also, but on a different model of processor. So the problem probably isn't in the firmware, or the higher levels of Digital Unix. The lower levels of the drivers (the port drivers) are specific to the type of SCSI adapter -- a problem here would not be ruled out by the fact that I've used RWZ52s under v3.2d-1, because I probably don't have the same type of adapter. On the other hand, neither 2100s nor RWZ52s are new, so it would be surprising for something to surface now. So the first thing to rule out is still a problem with the bus, whether in or out of the jukebox. ----------------------------------------------------------------------------- During system startup, the drivers are all told to initialize themselves. Since the crash happened only after the system was fully up, then startup got past the initialization. But that doesn't mean it got through without errors, so...were there any errors reported during startup? These are recorded in /var/adm/messages -- look for the section where all the devices are being listed. Also during startup, when the bus is reset, the jukebox should display its revision level on its front panel -- did anyone see this happen? Did anyone notice if HSM is able to load platters into drives? That would help us tell whether this was a problem affecting the optical drives only, or the changer as well. (If it's all the devices, the problem is likely to be a generic bus problem; if the changer works fine, then either internal connections in the jukebox, or the port driver / adapter interaction with the optical drives become more likely.) We can check whether platters can be loaded, without having HSM try to read their filesystems. (If they took HSM out of the system startup, they'll need to start it by hand, by running the modified /usr/efs/PolycenterHSM.startup with inventory set to none.) The following won't work quite right if platters are not in the same slots as they were originally -- we're going to ask HSM to load platters, so we ask it to load something from a slot that used to have a platter in it, and there's no platter there now, it'll report an error. But that won't mean there's any problem with loading, just that the platters aren't in their expected slots. It would be best to do the following while near the jukebox, to see if it is actually doing anything. Before starting HSM: Do "file" on both raw optical devices, as before -- both should be offline, since HSM hasn't been asked to load anything yet. If one *isn't offline, that means a platter was left in a drive somewhere along the way -- this is going to complicate things, so use the jukebox front panel to unload the platter. After HSM is up, do listm -f to get a list of platter names. Choose some platter name and use the loadm command to tell HSM to load it, e.g. loadm 1004a Does the jukebox do anything? Do file on both rop devices again. Loadm reported which drive it put the platter in. That drive should no longer say offline. Now loadm another platter, which should go in the other drive. Use file to check that that drive is no longer offline. If the changer operates, the problem is less likely to be external SCSI cables, or other devices on the bus. If the changer does not operate, the problem is less likely to be specific to the optical drives, e.g. less likely to be their firmware, but we don't really suspect that anyway. This test doesn't rule out any of the other possibilities. -- Pat
225.8	is CLC31a supported in HSM?	NETRIX::"[email protected]"	decatl::johnson	`Tue Apr 01 1997 14:48`	14
	It appears that they may be using CLC31a (not CLC310) and HSM 12a. (the call's work units seem to also confirm this CLCMC311 and CLCOP311. * Has CLC131a been tested with HSM? According to the NJA12a spd, only CLC300 and CLC310 is supported. sid johnson customer support center/Atlanta [Posted by WWW Notes gateway]
225.9	is there still a hardware problem?	DECWET::TRESSEL	Pat Tressel	`Tue Apr 01 1997 15:58`	55
	Sid -- CLC 3.1a is fine for HSM -- it's what most sites are using now. The changes between 3.1 and 3.1a were mainly for DUnix v4.0. > According to the NJA12a spd, only CLC300 and CLC310 is supported. That's because the spd was last updated before CLC 3.1a was released. In general, the latest version of CLC should be used, since CLC contains different copies of the driver for each version of DUnix, and will pick whichever is appropriate. The exception is OSMS/OSDS, because those products include drivers that interact with CLC in the kernel -- they currently ship the particular version of CLC that they need. * * * So this problem has not been resolved? I don't think Avril tried the things in .4 on the repaired jukebox before he left -- that might be a good starting point. In fact, since the repair may have left the jukebox with odd settings, the whole configuration should be checked. So it would also be good to re-do the earlier checks, e.g. to make sure that the SCSI ids are in the right order, so that platters get loaded into the correct drives. I think it's most likely that there is still a hardware problem -- that the repair didn't fix it. To see if it is a hardware problem, the simplest thing would be to start swapping things one at a time, changing the easiest things first: SCSI cables + terminators, SCSI adapter, the jukebox itself... Is anything on the same bus with the jukebox? Does that other thing work? Could the jukebox be moved to a different bus? Any way you could get the jukebox connected to a different machine? How about one with a totally different type of SCSI adapter)? (This last actually isn't a check for a hardware problem -- it's to see if there might be a SCSI protocol "difference of opinion" between the Unix port driver and the jukebox firmware.) Some (other) non-hardware things to try: Was the jukebox used for something else previously? If so, maybe it needs to be reset to factory default settings. There should be a "test" through the front panel that will do this. Do they have the jukebox manual that lists these "tests"? If they don't, Field Service should have it. (I haven't found one here, yet.) Or someone in the Optical group should know. (The optical support folks got downsized, and my contact there transferred to another group.) Maybe the firmware is messed up and needs to be reloaded? Field Service, again, should be able to do this. -- Pat