[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::vmsnotes

Title:	VAX and Alpha VMS
Notice:	This is a new VMSnotes, please read note 2.1
Moderator:	VAXAXP::BERNARDO

Created:	Wed Jan 22 1997
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	703
Total number of notes:	3722

577.0. ""looping" kernel stack not valid ?????" by GIDDAY::FLAWN () Fri May 09 1997 08:35

Hi,

I'm dealing with a problem with two Alphaserver 4100 single 400Mhz CPU systems
which when upgraded from 1Gb memory to 2Gb show intermittent failures about
once a day.

We're seeing the console showing

halted CPU 0
 halt code = 2
 kernel stack not valid halt
 PC = ffffffff8004d290

 CPU 0 restarting

 halted CPU 0

 halt code = 2
 kernel stack not valid halt
 PC = ffffffff8004bde0

 CPU 0 restarting

etc.

And this loops continuosly. With AUTO_ACTION then set to halt it will halt at
console but we can't get a dump. I've looked over the hardware and everything
seems in order. It looks like this is a KRNLSTAKNV crash, but what I don't 
understand is why I'm not getting a crash - and instead get this looping thing.

I tried running a program which puts junk (7FFFFF000) in PHD$Q_KSP of a process
and this takes a KRNLSTAKNV bugcheck but also does not write a dump (but the
system reboots when AUTO_ACTION is set to RESTART rather than looping. So I 
think my test method is not close enough to the actual problem.

The customer system had KSTACKPAGES set to default 1 - I've now raised it to 6,
we don't know if that will help yet.

Does anyone have any ideas on how to track this problem down - the only
thought I have is to have it halt and then examine the registers hoping one is
a PCB and couple this with a background job running regularly to do something
like SDA>show proc all .... in the hope that I can at least find out the
process.

I'm not sure if this is purely an OpenVMS problem or if it's a combination of 
a software problem causing the crash and hardware stopping it writing a dump.
DUMPBUG is 1 and so is DUMPSTYLE. I'll be trying a program to force and 
INVEXCEPTN or SSRVEXCEPTN but I think these will work ok.

I'm open to any suggestions.

Regards and thanks,
Dave Flawn
CSC Sydney

T.R	Title	User	Personal Name	Date	Lines
577.1	try this	MILORD::BISHOP	The punishment that brought us peace was upon Him	`Fri May 09 1997 09:30`	22
	What version of VMS? Set AUTO_ACTION to RESTART and set DUMPSTYLE to 3 (I want to see how far into BUGCHECK it gets before the second KSTAKNV occurs.) And use the following hack to force the KSTAKNV... .psect data,rd,noexe,wrt zero: .long 0 .psect code,rd,exe,nowrt foo:: .call_entry pushab zero pushab bar calls #2,sys$cmkrnl movl #ss$_normal,r0 ret bar:: .call_entry calls #0,bar ret .end foo - Richard.
577.2	thanks, sounds like a good idea....	GIDDAY::FLAWN		`Fri May 09 1997 09:34`	9
	Sorry, it's 6.2-1H3. Yes, setting DUMPSTYLE to 3 sounds good as we'll get more console output - was discusd briefly this afternoon. The customer may go back to 1GB for a bit (though they really need the extra memory). Thanks for the code idea - I was thinking of just continually increasing KSP but doing repeated calls to overflow the stack looks much better and more realistic. Regards and thanks, Dave.
577.3	Questions, Suggestions...	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Fri May 09 1997 09:50`	22
	What version of firmware is in use, what version of OpenVMS is in use, and what are the settings of the various console environment variables, including BOOT_RESET? Are you getting any odd messages during a system bootstrap -- messages possibly associated with a too-low ERLBUFFERPAGES -- or anything odd in the error logs? Is shadowing (with the current ECO) and/or the V7.1 compatibility kit (with the current ECO) in use? What is the output from the commands: >>> SHOW FRU >>> INFO 1 >>> INFO 2 >>> INFO 3 >>> INFO 4 There is a known RMS problem that can crop up on some V6.2-vintage systems, see the ALPRMSnn_nnn ECO kit for details. Also note, the AlphaServer 4100 (Rawhide) series conference is located at MVBLAB::ALPHASERVER_4100.
577.4	thanks, some good ideas I hadn't been into	GIDDAY::FLAWN		`Fri May 09 1997 10:47`	32
	Thanks Steve, Firmware is latest, 4.8-6, OpenVMS 6.2-1H3. BOOT_RESET is OFF. I'm not sure which way I should go on that one given what we're seeing? Yes - this system did not have ERLBUFFERPAGES set up right - I have now corrected that - the error log was, as I understand it, writing time stamps but I think we need this right to ensure we get error info. It is now. Not sure on VOLSHAD - hadn't occurred to me. So you're suggesting ALPSHAD as a precation ? Also sounds good. Thanks for the RMS one - that's a mandatory so should be recommended. I've found a possible KRNLSTAKNV from NETACP which may be relevant. I can get the INFO/SHOW FRU output (it's been captured). I'm not sure what registers to look at for this sort of thing (right now I only have hardcopy). The system has a DEFPA (FDDI), 2 x KZPDA, 1 x CIPCA, S3 VGA, KFPSA. Same fault even if we move the DEFPA to the other PCI bus or remove the VGA card..... I've run down the 4100 track but we think it's software. We don't know why it doesn't dump. Can't find any h/w issues. Regards and thanks, Dave.
577.5	see also note 589 in MVBLAB::ALPHASERVER_4100	GIDDAY::FLAWN		`Fri May 09 1997 11:05`	4
	Sorry, I forgot (that what happens at 1am !) note 589 in 4100 conference carries exploration of this and I've been in contact with 4100 eng. Sorry - this wasn't really a cross post but I should have mentioned it.
577.6	First Step...	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Fri May 09 1997 11:52`	3
	Turn BOOT_RESET on.
577.7		UTRTSC::utoras-198-48-95.uto.dec.com::JurVanDerBurg	Change mode to Panic!	`Tue May 13 1997 01:11`	9
	The kernel stack not valid crash with RMS occurs when a kernel mode caller (netacp for example) calls RMS while there's heavy kernel stack usage (for example from host based raid). It took me while to find that one. Increasing kernelstackpages does not help in that case, but you should be able to get a valid crashdump. If it loops then i'm starting to think about a hardware problem. Jur.
577.8	thanks, yes, could be hardware or software....	GIDDAY::FLAWN		`Tue May 13 1997 04:51`	25
	Thanks, Yes, I agree the symptom does look like a hardware problem but I can't see anything we haven't covered there unless it's some kind of rev level incompatibility. We've now applied the RMS kit along with a DECNET phase IV ECO. It looks like DUMPSTYLE was set to 3 for this, I must have missed that before but the parameter output I have shows we get this same scenario with that DUMPSTYLE value. If the problem is hardware it's very specific - the kernel stack not valid halts are the only problem we appear to have. When the customer returns to 2GB memory (which they need to do soon, but may not be for about 10 days yet) if the problem recurs I'm probably going to have to escalate it as an OpenVMS problem while at the same time (if the customer will take the hits) trying more hardware by using 1GB modules and, if neceesary, subbing the whole machine with a 2100. We also see this same problem with 1.5 GB, so I suppose that's some sort of pointer down the hardware side.... using the extra memory slots... The 4100 motherboard does handle row/column address logic for memory but I can't see how this could reliably get us a kernel stack not valid..... and nothing else going wrong. Regards, Dave.