[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:VAX and Alpha VMS
Notice:This is a new VMSnotes, please read note 2.1
Moderator:VAXAXP::BERNARDO
Created:Wed Jan 22 1997
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:703
Total number of notes:3722

577.0. ""looping" kernel stack not valid ?????" by GIDDAY::FLAWN () Fri May 09 1997 09:35

Hi,

I'm dealing with a problem with two Alphaserver 4100 single 400Mhz CPU systems
which when upgraded from 1Gb memory to 2Gb show intermittent failures about
once a day.

We're seeing the console showing

halted CPU 0
 halt code = 2
 kernel stack not valid halt
 PC = ffffffff8004d290

 CPU 0 restarting

 halted CPU 0

 halt code = 2
 kernel stack not valid halt
 PC = ffffffff8004bde0

 CPU 0 restarting

etc.

And this loops continuosly. With AUTO_ACTION then set to halt it will halt at
console but we can't get a dump. I've looked over the hardware and everything
seems in order. It looks like this is a KRNLSTAKNV crash, but what I don't 
understand is why I'm not getting a crash - and instead get this looping thing.

I tried running a program which puts junk (7FFFFF000) in PHD$Q_KSP of a process
and this takes a KRNLSTAKNV bugcheck but also does not write a dump (but the
system reboots when AUTO_ACTION is set to RESTART rather than looping. So I 
think my test method is not close enough to the actual problem.

The customer system had KSTACKPAGES set to default 1 - I've now raised it to 6,
we don't know if that will help yet.

Does anyone have any ideas on how to track this problem down - the only
thought I have is to have it halt and then examine the registers hoping one is
a PCB and couple this with a background job running regularly to do something
like SDA>show proc all .... in the hope that I can at least find out the
process.

I'm not sure if this is purely an OpenVMS problem or if it's a combination of 
a software problem causing the crash and hardware stopping it writing a dump.
DUMPBUG is 1 and so is DUMPSTYLE. I'll be trying a program to force and 
INVEXCEPTN or SSRVEXCEPTN but I think these will work ok.

I'm open to any suggestions.

Regards and thanks,
Dave Flawn
CSC Sydney
T.RTitleUserPersonal
Name
DateLines
577.1try thisMILORD::BISHOPThe punishment that brought us peace was upon HimFri May 09 1997 10:3022
    What version of VMS?
    
    Set AUTO_ACTION to RESTART and set DUMPSTYLE to 3 (I want to see how
    far into BUGCHECK it gets before the second KSTAKNV occurs.)
    
    And use the following hack to force the KSTAKNV...
    
	.psect	data,rd,noexe,wrt
zero:	.long	0
	.psect	code,rd,exe,nowrt
foo::	.call_entry
	pushab	zero
	pushab	bar
	calls	#2,sys$cmkrnl
	movl	#ss$_normal,r0
	ret
bar::	.call_entry
	calls	#0,bar
	ret
	.end	foo
    
    - Richard.
577.2thanks, sounds like a good idea....GIDDAY::FLAWNFri May 09 1997 10:349
    Sorry, it's 6.2-1H3. Yes, setting DUMPSTYLE to 3 sounds good as we'll
    get more console output - was discusd briefly this afternoon. The
    customer may go back to 1GB for a bit (though they really need the
    extra memory). Thanks for the code idea - I was thinking of just
    continually increasing KSP but doing repeated calls to overflow the
    stack looks much better and more realistic.
    
    Regards and thanks,
    Dave.
577.3Questions, Suggestions...XDELTA::HOFFMANSteve, OpenVMS EngineeringFri May 09 1997 10:5022
   What version of firmware is in use, what version of OpenVMS is in use,
   and what are the settings of the various console environment variables,
   including BOOT_RESET?  Are you getting any odd messages during a system
   bootstrap -- messages possibly associated with a too-low ERLBUFFERPAGES
   -- or anything odd in the error logs?  Is shadowing (with the current
   ECO) and/or the V7.1 compatibility kit (with the current ECO) in use?

   What is the output from the commands:

	>>> SHOW FRU
	>>> INFO 1
	>>> INFO 2
	>>> INFO 3
	>>> INFO 4

   There is a known RMS problem that can crop up on some V6.2-vintage
   systems, see the ALPRMSnn_nnn ECO kit for details.

   Also note, the AlphaServer 4100 (Rawhide) series conference is located at
   MVBLAB::ALPHASERVER_4100.

577.4thanks, some good ideas I hadn't been intoGIDDAY::FLAWNFri May 09 1997 11:4732
    Thanks Steve,

    Firmware is latest, 4.8-6, OpenVMS 6.2-1H3. BOOT_RESET is OFF.
    I'm not sure which way I should go on that one given what we're
    seeing?
    
    Yes - this system did not have ERLBUFFERPAGES set up right - I have now
    corrected that - the error log was, as I understand it, writing time
    stamps but I think we need this right to ensure we get error info. It
    is now.

    Not sure on VOLSHAD - hadn't occurred to me. So you're suggesting
    ALPSHAD as a precation ? Also sounds good. 
    
    Thanks for the RMS one - that's a mandatory so should be recommended.

    I've found a possible KRNLSTAKNV from NETACP which may be relevant. 

    I can get the INFO/SHOW FRU output (it's been captured).
    I'm not sure what registers to look at for this sort of thing (right
    now I only have hardcopy).
    
    The system has a DEFPA (FDDI), 2 x KZPDA, 1 x CIPCA, S3 VGA, KFPSA.
    Same fault even if we move the DEFPA to the other PCI bus or remove the
    VGA card.....
    
    I've run down the 4100 track but we think it's software. We don't know
    why it doesn't dump. Can't find any h/w issues. 
    
    Regards and thanks,
    Dave.
         
577.5see also note 589 in MVBLAB::ALPHASERVER_4100GIDDAY::FLAWNFri May 09 1997 12:054
    Sorry, I forgot (that what happens at 1am !) note 589 in 4100
    conference carries exploration of this and I've been in contact with
    4100 eng. Sorry - this wasn't really a cross post but I should have
    mentioned it.
577.6First Step...XDELTA::HOFFMANSteve, OpenVMS EngineeringFri May 09 1997 12:523
   Turn BOOT_RESET *on*.

577.7UTRTSC::utoras-198-48-95.uto.dec.com::JurVanDerBurgChange mode to Panic!Tue May 13 1997 02:119
The kernel stack not valid crash with RMS occurs when a kernel mode caller
(netacp for example) calls RMS while there's heavy kernel stack usage
(for example from host based raid). It took me while to find that one.
Increasing kernelstackpages does not help in that case, but you should be
able to get a valid crashdump. If it loops then i'm starting to think about
a hardware problem.

Jur.

577.8thanks, yes, could be hardware or software....GIDDAY::FLAWNTue May 13 1997 05:5125
    Thanks,
    
    Yes, I agree the symptom does look like a hardware problem but I can't
    see anything we haven't covered there unless it's some kind of rev
    level incompatibility. We've now applied the RMS kit along with a
    DECNET phase IV ECO. It looks like DUMPSTYLE was set to 3 for this, I
    must have missed that before but the parameter output I have shows we
    get this same scenario with that DUMPSTYLE value.
    
    If the problem is hardware it's very specific - the kernel stack not
    valid halts are the only problem we appear to have. When the customer
    returns to 2GB memory (which they need to do soon, but may not be for
    about 10 days yet) if the problem recurs I'm probably going to have to
    escalate it as an OpenVMS problem while at the same time (if the
    customer will take the hits) trying more hardware by using 1GB modules
    and, if neceesary, subbing the whole machine with a 2100.
    
    We also see this same problem with 1.5 GB, so I suppose that's some
    sort of pointer down the hardware side.... using the extra memory
    slots... The 4100 motherboard does handle row/column address logic for
    memory but I can't see how this could reliably get us a kernel stack
    not valid..... and nothing else going wrong.
    
    Regards,
    Dave.