[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::sable

Title:SABLE SYSTEM PUBLIC DISCUSSION
Moderator:COSMIC::PETERSON
Created:Mon Jan 11 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2614
Total number of notes:10244

2576.0. "2100A RM, crash/halt, no dump, no errlog" by OHFSS1::FULLER (Never confuse a memo with reality) Thu Apr 10 1997 15:40

    [Crossposted, when I was reminded that ALPHASTATION is not SABLE...]
    
          <<< WRKSYS::SYS_TOOLS:[NOTES$LIBRARY]ALPHASTATION.NOTE;1 >>>
                       -< Alpha Workstation Conference >-
================================================================================
Note 1919.0         2100A RM, crash/halt, no dump, no errlog           4 replies
OHFSS1::FULLER "Never confuse a memo with reality"   57 lines  10-APR-1997 10:36
--------------------------------------------------------------------------------
    My customer has a 2100A RM system 5/300 with cpus and 1GB of memory.
    On the PCI bus, there are:
    
    	1 PB2GA-JB	S3TRIO 64 VGA video
    	1 DE435		Ethernet
    	1 KZPAA		SCSI, with TZ87 at SCSI target 2
    	1 KZPDA		SCSI (FWSE), with 2 CDROM drives (RRD45)
    	1 KZPDA		SCSI (FWSE), with several disks (RZ28, RZ29)
    
    The system is located about 20 miles from the system manager's desk, so
    he likes to do as much remote system management as possible, which
    includes an occasional reboot (shutdown -r).
    
    Every now and then, when he attempts to reboot the system, it fails to
    come back, so he drives the 20 miles to the system to find out what
    happened, and to reboot the system.  When he gets to the system, he
    finds that it's sitting at the >>> prompt.  So, he types BOOT and away
    it goes...sometimes.
    
    When it fails to reboot, we've noted the following:
    
    Part way through the boot process, the systems appears to crash, or at
    least try to, then it halts.  There is NO crash information on the
    screen; it just halts to the >>> prompt.
    
    Now, bear with me while I point out what we see on the screen during a
    boot:
    
    	. Type >>> BOOT
    	. Digital Unix (V3.2F) loads, showing text/data/bss sizes
    	. The screen font changes (take note; this is important)
    	. Unix displays hardware inventory
    	. Unix starts the init process, which boots up everything else
    
    What we're seeing is that at some time between the hardware inventory
    display and the rest of the booting, the screen font changes back to
    the font used by the console, and the screen *contents* changes back to
    that which was there when the screen font changed from the console font
    to the Unix font.  Then, it just halts to the >>> prompt.
    
    Since the screen contents revert back to that prior to the hardware
    inventory, there is no information on the screen to provide a hint as
    to the source of the crash.  Since the error logger process had not yet
    started, there is no error log information.  And, there is no crash
    dump.
    
    I spent a day looking at the hardware configuration, and after a long
    series of "try this" and "try that", I found that if I move the tape
    drive from the KZPAA SCSI controller to one of the KZPDA SCSI
    controllers, this crashing/halting problem no longer occurs.  However,
    with the tape on the same SCSI channel as the RZ28/RZ29 disks, this
    creates another problem having to do with backups, which I won't get
    into at this time.
    
    Any takers for this problem?  Thanks!
    
    	Stu
================================================================================
Note 1919.1         2100A RM, crash/halt, no dump, no errlog              1 of 4
OHFSS1::FULLER "Never confuse a memo with reality"    5 lines  10-APR-1997 10:38
                     -< Only fails with console=graphics >-
--------------------------------------------------------------------------------
    Oh, one more thing.  If we run the system with a serial port as the
    console, we don't have the problem.  Unfortunately, this is not an
    option for us.
    
    	Stu
================================================================================
Note 1919.2*        2100A RM, crash/halt, no dump, no errlog              2 of 4
WRKSYS::HOUSE "Kenny House, Workstations Engineering" 6 lines  10-APR-1997 11:07
                             -< try MVBLAB::SABLE >-
--------------------------------------------------------------------------------
    This .. is .. NOT .. an .. AlphaServer .. conference.
    
    Try MVBLAB::SABLE for AlphaServer 2100A questions (press KP7 to add
    this entry to your notebook).
    
    -- Kenny House
================================================================================
Note 1919.3         2100A RM, crash/halt, no dump, no errlog              3 of 4
UTOPIE::OETTL "hide bug until worst time"             8 lines  10-APR-1997 13:38
--------------------------------------------------------------------------------

You have an S3TRIO, is it in the secondary PCI? If yes, move it to the
primary PCI-BUS. I had some problems with S3TRIO's causing crashes when sitting
in the secondary PCI of Lynxes.

Hope this helps,
�tzi
    
    =================================================================
    
    Oh, and the S3TRIO is on the primary PCI bus.
    
    Thanks!
    
    	Stu
T.RTitleUserPersonal
Name
DateLines
2576.1AFW3::MAZURFri Apr 11 1997 09:033
Aside from the underlying problem your customer experiences, would you
think the customer would benefit from an RCM (Remote Console Management)
card to save him some 20 mile drives.
2576.2Bad Video Card (not just placement)SOLVIT::BAZARNICKContemplating BuoyancyFri Apr 11 1997 18:024
    The response from support engineering:
    
    "I have seen a malfunctioning video card cause this sort of problem.
    Give that a try first."
2576.3Already tried a new video cardOHFSS1::FULLERNever confuse a memo with realityFri Apr 11 1997 18:138
>    "I have seen a malfunctioning video card cause this sort of problem.
>    Give that a try first."
    
    Well, great minds thinking alike and all, "been there, done that".  
    
    Thanks for the suggestion, though.
    
    	Stu
2576.4Resolution found!OHFSS1::FULLERNever confuse a memo with realityThu May 15 1997 16:5141
    For the benefit of those interested...
    
    The original PCI configuration of the machine was:
    
    	DE450		>  After the PCI bridge on the I/O mother board
    	KZPAA	-> TZ87, (2)RRD45
    	KZPDA	-> (6) RZ28M-VW
    	KZPDA	-> (3) RZ28M-VW
    [PCI bridge on the I/O mother board]
    	[empty] 	>  Before the PCI bridge
    	[empty] 	>
    	[empty] 	>
    	#9 SVGA card	>
    
    In this configuration, at some point between Digital Unix (V3.2f)
    displaying the hardware inventory and the single user mode prompt, the
    screen would revert to the console font (from when the boot first
    started) and to the original screen contents (from just before when the
    kernel was called after loading).  Occasionally, depending on how
    closely we looked, we would see "PCI LOCAL BUS FAULT", and/or some
    message(s) indicating that a dump was attempted and failed (giving up).
    
    As a workaround, we found that by moving the TZ87 from the KZPAA to one
    of the KZPDA channels, this problem stopped.  However, with the tape
    drive on the same channel with the disks, we would get intermittent
    (once a week, on average) device timeouts.  When the SCSI bus was reset
    as a result of the timeout, the tape would rewind, DecNSR would
    complain, etc., etc...
    
    Putting the tape drive back on the KZPAA channel brought back the LOCAL
    BUS FAULTs, however.  
    
    We ended up moving all the SCSI channels to before the PCI bridge on
    the I/O mother board, and with the TZ87 on the KZPAA, we have been
    running for 1.5 weeks without a problem.  I was going to experiment
    with moving just the KZPAA before the bridge, leaving the KZPDAs after
    the bridge, but the customer was more interested in going home.  Maybe
    I'll be able to perform the KZPDA-after-the-bridge experiment some
    other time.  If so, I'll post the results here if anyone's interested.
    
    	Stu