|
Hi,
Customer did have some hardware problems with the jukebox. The jukebox
was sent away and now has come back. When the boot the system up with
the jukebox powered on the system now crashes. Only if the system is
booted up and the jukebox is powered off does the system remain up.
If the jukebox is powered up after startup has completed the system
does not crash.
I have advised the customer now that I think the best way forward is to
start from scratch and reinstall cam layered products and then hsm.
What do you think?
At least then we would be starting from a known place.
I am leaving the company tomorrow so if you could reply on the
conference that would be great.
Below is a section of the crash-data file..
_crash_data_collection_time: Thu Feb 13 18:29:34 GMT 1997
_current_directory: /
_crash_kernel: /var/adm/crash/vmunix.15
_crash_core: /var/adm/crash/vmcore.15
_crash_arch: alpha
_crash_os: Dec 08
Dec 08
Dec 08
Dec 08
Dec 08
DECnet/OSI for
DECnet/OSI for
Digital UNIX
_host_version: Dec 08 00:00 OSF1 V3.2 (Rev. 214.61)
Dec 08 00:00 OSF1 V3.2 (Rev. 214.61)
Dec 08 00:00 OSF1 V3.2 (Rev. 214.61)
Dec 08 10:09 OSF1 V3.2 (Rev. 214.61)
Dec 08 10:09 OSF1 V3.2 (Rev. 214.61)
DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53
EDT 1995
DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53
EDT 1995
Digital UNIX V3.2D-1 (Rev. 41); Wed Jan 15 19:53:18 GMT 1997
_crash_version: Dec 08 00:00 OSF1 V3.2 (Rev. 214.61)
Dec 08 00:00 OSF1 V3.2 (Rev. 214.61)
Dec 08 00:00 OSF1 V3.2 (Rev. 214.61)
Dec 08 10:09 OSF1 V3.2 (Rev. 214.61)
Dec 08 10:09 OSF1 V3.2 (Rev. 214.61)
DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53
EDT 1995
DECnet/OSI for Digital UNIX V3.2A-0 (Rev. 23.19); Fri Sep 15 13:21:53
EDT 1995
Digital UNIX V3.2D-1 (Rev. 41); Wed Jan 15 19:53:18 GMT 1997
_crashtime: struct {
tv_sec = 855857203
tv_usec = 914609
}
_boottime: struct {
tv_sec = 855857032
tv_usec = 609024
}
_config: struct {
sysname = "OSF1"
nodename = "kofim8"
release = "V3.2"
version = "41"
machine = "alpha"
}
_cpu: 35
_system_string: 0xffffffffff801048 = "AlphaServer 2100 4/200"
_ncpus: 1
_avail_cpus: 1
_partial_dump: 1
_physmem(MBytes): 127
_panic_string: 0xfffffc000067b5a0 = "bread: size 0"
_paniccpu: 0
_panic_thread: 0xfffffc0004dddb80
_preserved_message_buffer_begin:
struct {
msg_magic = 0x63061
msg_bufx = 0x968
msg_bufr = 0x86f
msg_bufc = "PCXAL keyboard, language English (American)
Alpha boot: available memory from 0xa44000 to 0x7ffe000
Digital UNIX V3.2D-1 (Rev. 41); Wed Jan 15 19:53:18 GMT 1997
physical memory = 128.00 megabytes.
available memory = 117.78 megabytes.
using 484 buffers containing 3.78 megabytes of memory
Firmware revision: 3.9
PALcode: OSF version 1.35
ibus0 at nexus
AlphaServer 2100 4/200
cpu 0 EV-4s 1mb b-cache
gpc0 at ibus0
pci0 at ibus0 slot 0
tu0: DECchip 21040-AA: Revision: 2.3
tu0 at pci0 slot 0
tu0: DEC TULIP Ethernet Interface, hardware address: 08-00-2B-E2-68-05
tu0: auto sensing: selected AUI (10Base2|5) port
psiop0 at pci0 slot 1
Loading SIOP: script 1001b00, reg 81000000, data 40759a08
scsi0 at psiop0 slot 0
rz0 at scsi0 bus 0 target 0 lun 0 (DEC RZ28 (C) DEC D41C)
rz1 at scsi0 bus 0 target 1 lun 0 (DEC RZ26L (C) DEC 440C)
rz2 at scsi0 bus 0 target 2 lun 0 (DEC RZ26L (C) DEC 440C)
rz3 at scsi0 bus 0 target 3 lun 0 (DEC RZ26L (C) DEC 440C)
rz4 at scsi0 bus 0 target 4 lun 0 (DEC RZ26L (C) DEC 440C)
rz5 at scsi0 bus 0 target 5 lun 0 (DEC RZ29B (C) DEC 0014)
rz6 at scsi0 bus 0 target 6 lun 0 (DEC RRD43 (C) DEC 1084)
eisa0 at pci0
ace0 at eisa0
ace1 at eisa0
lp0 at eisa0
fdi0 at eisa0
fd0 at fdi0 unit 0
vga0 at eisa0
1024x768 (QVision )
aha0 at eisa0 slot 3
scsi1 at aha0
lvm0: configured.
lvm1: configured.
dli: configured
SuperLAT. Copyright 1993 Meridian Technology Corp. All rights reserved.
datalink: links=64, macs=6
knbinit: sessions=256, names=64
knbtcp: configured
knbtcpd: configured
knbadm configured
nbeadmin_configure
netbeuid_configure
netbeui_configure
x25_access: configured
x25_ip: configured
x25_relay: configured
wandd_base: configured
wandd_llc2: configured
wan_utilities: configured
ctf_base: configured
Node ID is 08-00-2b-e2-68-05 (from device tu0)
dna_netman: configured
dna_dli: configured
ADVFS: using 1152 buffers containing 9.00 megabytes of memory
vm_swap_init: warning /sbin/swapdefault swap device not found
vm_swap_init: in swap over commitment mode
Node UID is c390bf00-85cb-11d0-8008-08002be26805
dna_base: configured
dna_rfc1006: configured
dna_xti: configured
panic (cpu 0): bread: size 0
syncing disks... done
device string for dump = SCSI 0 1 0 1 100 0 0 .
DUMP.prom: dev SCSI 0 1 0 1 100 0 0 , block 131072
device string for dump = SCSI 0 1 0 1 100 0 0 .
DUMP.prom: dev SCSI 0 1 0 1 100 0 0 , block 131072
"
}
_preserved_message_buffer_end:
_kernel_process_status_begin:
PID COMM
00000 kernel idle
00001 init
00008 kloadsrv
00038 update
01066 knblink
01068 sendmail
01074 dllink
01117 timed
01172 mold
01175 internet_mom
01184 snmp_pe
01190 inetd
01195 cron
01223 pwlic.reg
01229 lpd
01295 jmd
01299 qr
01326 coolsrvr
01334 xdm
01340 jmd
01342 jmd
01343 oss_exec
01345 getty
01352 nbelink
01354 Xdec
01364 pwalrtr
01368 xdm
01372 lmx.ctrl
01383 getty
01384 getty
01385 getty
01386 getty
01387 getty
01389 getty
01392 getty
01394 getty
01395 getty
01397 getty
01398 getty
01399 getty
01401 getty
01402 getty
01405 getty
01406 getty
01407 getty
01409 getty
01410 getty
01411 getty
01412 getty
01413 getty
01414 getty
01415 getty
01416 getty
01424 getty
01441 Xsetup_0
01451 dxconsole
00670 usd
00740 syslogd
00742 binlogd
00769 gated
00839 named
00872 portmap
00874 mountd
00876 nfsd
00878 nfsiod
00879 nfsiod
00880 nfsiod
00881 nfsiod
00882 nfsiod
00883 nfsiod
00884 nfsiod
00890 automount
00912 dnalimd
00915 dnaevld
00922 ctfd
00952 dnascd
00953 dnansd
00954 dnaksd
00958 dnsadv
00962 dtssd
00965 dnanoded
00976 dnamopd
00986 rfc1006d
01000 osaknmd
01005 ftam_listener
01007 ftam_listener
_kernel_process_status_end:
_current_pid: 1340
_current_tid: 0xfffffc0004dddb80
_proc_thread_list_begin:
thread 0xfffffc0004dddb80 stopped at [boot:1730 ,0xfffffc00004800cc]
Source
not available
_proc_thread_list_end:
_dump_begin:
> 0 boot(0x0, 0x0, 0xfffffc000067b5a0, 0xffffffffffffffff,
0xffffffff88b8f0a0)
["../../../../src/kernel/arch/alpha/machdep.c":1730,
0xfffffc00004800cc]
1 panic(s = 0xfffffc000067b5a0 = "bread: size 0")
["../../../../src/kernel/bs
d/subr_prf.c":757, 0xfffffc000043f3b4]
pcpu = 0x2fa
i = 0
bootopt = 7606120
mycpu = 4151676
spl = 0
prevcc = 18446739675667192340
nextcc = 18446739675783402496
timer = 1
limit = 0
2 bread(vp = 0xfffffc0005273c00, blkno = 0, size = 0, cred =
0xffffffffffffff
ff, bpp = 0xffffffff88b8f0a0)
["../../../../src/kernel/vfs/vfs_bio.c":393, 0xfff
ffc000044d46c]
bp = (nil)
error = -2144736992
metadatatype = 0
3 blkatoff(0x8052000006c1, 0x0, 0x0, 0xffffffff88b8f1d8,
0xffffffff88b8f250)
["../../../../src/kernel/ufs/ufs_lookup.c":1332, 0xfffffc0000270e34]
4 scandir(0xfffffc0005273c00, 0x0, 0x0, 0xfffffc0000212270, 0x0)
["../../../.
./src/kernel/ufs/ufs_lookup.c":530, 0xfffffc000026fd34]
5 ufs_lookup(0xfffffc0005273c00, 0xffffffff88b8f550, 0x3d,
0xfffffc0000295610
, 0xfffffc0000295988) ["../../../../src/kernel/ufs/ufs_lookup.c":386,
0xfffffc00
0026fab0]
6 namei(0xfffffc00006d61a0, 0xfffffc0000000001, 0xfffffc0005273c00,
0xfffffc0
000000001, 0xfffffc0000299cf4)
["../../../../src/kernel/vfs/vfs_lookup.c":563, 0
xfffffc00002954fc]
7 vn_open(0xffffffff88b8f720, 0x1001, 0x0, 0x633231706f2f7665,
0xfffffc000048
eaac) ["../../../../src/kernel/vfs/vfs_vnops.c":515,
0xfffffc0000298ee0]
8 copen(0xfffffc0004ddd210, 0xffffffff88b8f8c8, 0xffffffff88b8f8b8,
0x0, 0x40
289374bc6a7efa) ["../../../../src/kernel/vfs/vfs_syscalls.c":1824,
0xfffffc00004
54f98]
9 open(0xffffffff88b8f8b8, 0x0, 0x40289374bc6a7efa,
0xfffffc000048e4d4, 0xfff
ffc000048d8d8) ["../../../../src/kernel/vfs/vfs_syscalls.c":1781,
0xfffffc000045
4ebc]
10 syscall(0x140019228, 0x1, 0x0, 0x0, 0x2d)
["../../../../src/kernel/arch/alp
ha/syscall_trap.c":519, 0xfffffc000048d8d4]
11 _Xsyscall(0x8, 0x1200477c8, 0x1400276e0, 0x11ffff258, 0x0)
["../../../../sr
c/kernel/arch/alpha/locore.s":1094, 0xfffffc000047d024]
_dump_end:
Anyway thanks for all the help and advice you have given me.
Many thanks
Avril
|
| Avril --
> I am leaving the company tomorrow
I wish you well in your future endeavors!
-- Pat
-------------------------------------------------------------------------------
About the case:
> Customer did have some hardware problems with the jukebox. The jukebox
> was sent away and now has come back.
Sounds like it's worse than it was before... What was done to the jukebox?
Is there any way a different jukebox (preferably one known to be working)
could be brought to the customer site for testing?
> When the boot the system up with
> the jukebox powered on the system now crashes. Only if the system is
> booted up and the jukebox is powered off does the system remain up.
> If the jukebox is powered up after startup has completed the system
> does not crash.
The crash info shows the jmd trying to open something, probably on one of
the optical drives, and probably while trying to inventory the jukebox.
So the fact that the system stays up if the jukebox is turned on later is
only because the jmd decided there was no jukebox. It won't go back and
try to look for, and inventory, a newly turned on jukebox, unless it's
restarted. So if they restart HSM after the jukebox is turned on, I'd
presume the system would crash then.
The crash appears to have happened because function bread was passed a 0
size argument -- it checks for this, and panics if it gets it.
The stack trace shown in the crash info looks wrong -- there are impossible
arguments in some of the calls, for instance, the open and copen calls
should both have the proc pointer in the first argument...but their first
arguments are not the same. Also, the open flag appears to be 1, O_WRONLY,
which the jmd does not use. So maybe the stack wasn't valid at the time
of the dump. Or I'm confused about what it means... I don't have access
to v3.2d-1 sources, but I looked at both v3.2c and v3.2f sources.
There are several possibilities:
1) The jukebox is still sick. Although the stack trace shows the panic
happening before the open gets to the point of trying to access the
device, device-related things may have happened earlier, that messed up
internal data structures or some such thing, and this panic is a
consequence of that previous damage.
2) Their vmunix is corrupted. Why don't they rebuild their kernel, and
compare the new and old versions? They should make sure the rebuild
succeeds. They can rebuild by doing:
doconfig -c XXXX
where XXXX is the name of their machine in uppercase. They should
answer no when doconfig asks them if they want to edit the configuration
file.
> I have advised the customer now that I think the best way forward is to
> start from scratch and reinstall cam layered products and then hsm.
I don't see anything HSM-specific in this case, but if they did reinstall,
they'd rebuild the kernel along the way... It might confuse the issue
to do both at once, though.
3) There's a bug in the os. I didn't see any patches in the v3.2d-1 patch
set that mention a "bread: size 0" panic, but not all patches give their
precise symptoms. If a kernel rebuild doesn't help, it may be necessary
to get help from the Unix group, to look at the crash dump.
4) Something else that talks to the device is incompatible with it.
What SCSI adapter are they using on the bus the jukebox is on? Are
any other devices on the same bus?
But before we get to these second-level possibilities, the first approach
would be to replace or repair the jukebox...after all, it *was* getting
those huge numbers of errors, and the behavior *did* change when it was
"fixed".
-- Pat
|
| In order to prevent the system from crashing, another thing they can do
instead of powering off the jukebox is to take HSM out of the system startup.
To not start HSM during system startup, remove the link to its startup
procedure:
rm /sbin/rc3.d/S70hsm
One of the tests I'll suggest below requires that HSM be running, so it
would have to be started by hand later, but we don't want it trying an
inventory when it starts.
To turn off the inventory, change the file /usr/efs/PolycenterHSM.startup to
set the inventory type to "none" by replacing the line:
/usr/efs/bin/jmd
with
/usr/efs/bin/jmd -i none
-------------------------------------------------------------------------------
Going back to some previous evidence:
# file /dev/r0op13c
/dev/rop13c: character special (54/21506) SCSI #1 RWZ52 disk
#104 (SCSI ID #5) errors = 0/176 offline
# file /dev/rop12c
/dev/rop12c: character special (54/20482) SCSI #1 RWZ52 disk #96
(SCSI ID #4) errors = 0/263 offline
There are errors on both drives, so whatever the problem is, it's not
drive-specific, i.e. there isn't a "bad" drive. Anything common to both
drives, and anything on the path between the operating system, which reports
the errors, and the drives is suspect.
This includes the SCSI bus itself, which is why I asked about other devices
on that bus. Going back to the scu output, I see there's a tape drive on
that bus too -- is the tape drive behaving ok? It would be a simple test
to replace all the cables and terminators, and see if that makes a difference.
Another thing to try is to temporarily take the tape drive (and anything
else that may have been put on that bus since) off the bus -- have just one
cable straight from the 2100 to the jukebox, and terminate the bus at the
jukebox.
(I recognize that "testing" is likely to crash the system if the cables
aren't the problem, so they should schedule this and not have users on at
the time.)
The external cables aren't all the cables there are: There are also cables
inside the jukebox. We had one case where a jukebox was shipped from the
factory that didn't have its internal cables connected. So it would be good
to have all the internal connections checked. Since the problems got worse
after the jukebox was serviced, one obvious thing to check is whether
everything was put back together correctly.
Speaking of putting things back together correctly... Where were the platters
while the jukebox was being serviced? Were they all taken out and kept at the
customer site? If so, were they put back in the same slots, with the same
side up, as before? (Since the jmd can't inventory the platters, if they're
moved, HSM will have obsolete info as to what platter is in what slot.)
Were they left in the jukebox? If so, was Field Service aware that the
platters had data on them?
One thing in common between the drives is their firmware revision level.
But my drives are at the same firmware rev level, though with the slight
difference that mine are pre-release models and didn't have their names
and vendor ids changed to RWZ52 and DEC yet. I've used them under v3.2d-1
also, but on a different model of processor. So the problem probably isn't
in the firmware, or the higher levels of Digital Unix.
The lower levels of the drivers (the port drivers) are specific to the type
of SCSI adapter -- a problem here would not be ruled out by the fact that
I've used RWZ52s under v3.2d-1, because I probably don't have the same type
of adapter. On the other hand, neither 2100s nor RWZ52s are new, so it
would be surprising for something to surface now.
So the first thing to rule out is still a problem with the bus, whether
in or out of the jukebox.
-----------------------------------------------------------------------------
During system startup, the drivers are all told to initialize themselves.
Since the crash happened only after the system was fully up, then startup
got past the initialization. But that doesn't mean it got through without
errors, so...were there any errors reported during startup? These are
recorded in /var/adm/messages -- look for the section where all the devices
are being listed. Also during startup, when the bus is reset, the jukebox
should display its revision level on its front panel -- did anyone see this
happen?
Did anyone notice if HSM is able to load platters into drives? That would
help us tell whether this was a problem affecting the optical drives only,
or the changer as well. (If it's *all* the devices, the problem is likely
to be a generic bus problem; if the changer works fine, then either internal
connections in the jukebox, or the port driver / adapter interaction with
the optical drives become more likely.) We can check whether platters can
be loaded, without having HSM try to read their filesystems. (If they took
HSM out of the system startup, they'll need to start it by hand, by running
the modified /usr/efs/PolycenterHSM.startup with inventory set to none.)
The following won't work quite right if platters are not in the same slots
as they were originally -- we're going to ask HSM to load platters, so we
ask it to load something from a slot that used to have a platter in it, and
there's no platter there now, it'll report an error. But that won't mean
there's any problem with loading, just that the platters aren't in their
expected slots.
It would be best to do the following while near the jukebox, to see if it
is actually doing anything.
Before starting HSM:
Do "file" on both raw optical devices, as before -- both should be offline,
since HSM hasn't been asked to load anything yet. If one *isn't offline,
that means a platter was left in a drive somewhere along the way -- this
is going to complicate things, so use the jukebox front panel to unload
the platter.
After HSM is up, do listm -f to get a list of platter names. Choose some
platter name and use the loadm command to tell HSM to load it, e.g.
loadm 1004a
Does the jukebox do anything?
Do file on both rop devices again. Loadm reported which drive it put the
platter in. That drive should no longer say offline.
Now loadm another platter, which should go in the other drive. Use file to
check that that drive is no longer offline.
If the changer operates, the problem is less likely to be external SCSI
cables, or other devices on the bus. If the changer does not operate,
the problem is less likely to be specific to the optical drives, e.g.
less likely to be their firmware, but we don't really suspect that anyway.
This test doesn't rule out any of the other possibilities.
-- Pat
|
| Sid --
CLC 3.1a is fine for HSM -- it's what most sites are using now. The changes
between 3.1 and 3.1a were mainly for DUnix v4.0.
> According to the NJA12a spd, only CLC300 and CLC310 is supported.
That's because the spd was last updated before CLC 3.1a was released.
In general, the latest version of CLC should be used, since CLC contains
different copies of the driver for each version of DUnix, and will pick
whichever is appropriate.
The exception is OSMS/OSDS, because those products include drivers that
interact with CLC in the kernel -- they currently ship the particular
version of CLC that they need.
* * *
So this problem has not been resolved?
I don't think Avril tried the things in .4 on the repaired jukebox before
he left -- that might be a good starting point. In fact, since the repair
may have left the jukebox with odd settings, the whole configuration should
be checked. So it would also be good to re-do the earlier checks, e.g. to
make sure that the SCSI ids are in the right order, so that platters get
loaded into the correct drives.
I think it's most likely that there is still a hardware problem -- that the
repair didn't fix it. To see if it is a hardware problem, the simplest thing
would be to start swapping things one at a time, changing the easiest things
first: SCSI cables + terminators, SCSI adapter, the jukebox itself...
Is anything on the same bus with the jukebox? Does that other thing work?
Could the jukebox be moved to a different bus?
Any way you could get the jukebox connected to a different machine?
How about one with a totally different type of SCSI adapter)? (This last
actually isn't a check for a hardware problem -- it's to see if there might
be a SCSI protocol "difference of opinion" between the Unix port driver and
the jukebox firmware.)
Some (other) non-hardware things to try:
Was the jukebox used for something else previously? If so, maybe it needs
to be reset to factory default settings. There should be a "test" through
the front panel that will do this. Do they have the jukebox manual that
lists these "tests"? If they don't, Field Service should have it. (I haven't
found one here, yet.) Or someone in the Optical group should know. (The
optical support folks got downsized, and my contact there transferred to
another group.)
Maybe the firmware is messed up and needs to be reloaded? Field Service,
again, should be able to do this.
-- Pat
|