[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference kernel::csguk_systems

Title:	CSGUK_SYSTEMS
Notice:	No restrictions on keyword creation
Moderator:	KERNEL::ADAMS

Created:	Wed Mar 01 1989
Last Modified:	Thu Nov 28 1996
Last Successful Update:	Fri Jun 06 1997
Number of topics:	242
Total number of notes:	1855

41.0. "RH780 causes INT STK INVAL" by KERNEL::WRIGHTON (Pass me a +L-14005) Thu Jun 01 1989 08:10

    This is extracted from the STARS database . We caught a cold with
    a problem last week with a system that was going INT STK INVAL 
    during autoconfigure when we added the RH780.
                                                 
    V5 expects to see a BR level of 5 for RH's 
    
    
    
    RH780 at BR6 causes CPU to halt with INT. STK. Invalid


********************   CAUTION:  FOR INTERNAL USE ONLY   *********************
*                                                                            *
*      THIS INFORMATION IS FOR USE BY DIGITAL EQUIPMENT CORP. AND ITS        *
*      EMPLOYEES ONLY.  PLEASE USE EXTREME CARE IF YOU MUST DISCUSS ANY      *
*      PART OF THIS INFORMATION WITH ANYONE WHO IS NOT A DIGITAL EMPLOYEE.   *
*                                                                            *
******************************************************************************

\\None                    
\\Product Set  : VAX3                                    
\\PFE          : SOLEIL::ROOSE       
\\Techn. Editor: TIMA::ARTICLE       
PROBLEM/SOLUTION ARTICLE OUTLINE

-------------------------------------------------------------------------------

TITLE: Interrupt Stack Invalid caused by Massbus at BR6

PRODUCT: Massbus

LAYERED PRODUCT:

DATE: 6-March-1989



SYMPTOMS/PROBLEM:

A tape drive connected to an RH780 massbus adapter with a BR level of 6 will
cause the CPU to halt with an Interrupt Stack Invalid when the on - line button
is pressed.

The tape drive in question was a TM78/TU78.

ANALYSIS:

The incoming interrupt from the tape drive was at IPL 16. This is arrived at
by the RH780 being at TR9 and BR6. The UCB$B_DIPL field in the devices UCB
was set at 15.

This value is used by the device driver to synchronize access to device register
and UCB fields. If this value was increased to 16 (by halting CPU and depositing
16 hex in the SPL$B_IPL field for the devices spinlock) the problem could be
masked. 

The device was continuing to interrupt at IPL 16 and was not being serviced 
correctly so each time the IPL was set to device IPL another interrupt at IPL 16
would occur.



SOLUTION:

The BR level for the RH780 was set to its correct value.

An SPR has been submitted to flag the problem.




        **************************************************************
        *    COPYRIGHT (c) 1988 by Digital Equipment Corporation.    *
        *                    ALL RIGHTS RESERVED.                    *
        *     NO distribution except as provided under contract.     *
        **************************************************************

T.R	Title	User	Personal Name	Date	Lines
41.1	Subj: INTERRUPT STACK INVALID TROUBLESHOOTING	COMICS::ROBB		`Thu Jul 20 1989 18:02`	183
	Interrupt Stack Invalid - troubleshooting notes ----------------------------------------------- Sources: -------- VAXNEWS 5 page 39: article by Ian Loughlin. ESKAA VAX 11/780 LOCAL CONSOLE frame B6 ESOAB-1 VAX 11/780 PCS page 3, frame G11 line 21639 frame G16 line 24449 What is it and how is it handled? --------------------------------- The 11/780 has five stacks: each process has its own User, Supervisor, Executive and Kernel stacks; and the system has one Interrupt Stack to handle system-wide events such as I/O and serious errors. If any of the first three stacks become invalid, the error is handled by VMS, and the result will be at worst a bugcheck. If the Kernel Stack is invalid, the microcode forces an exception (abort) which is handled on the Interrupt Stack at IPL 1F (i.e. same urgency as a machine check). (Please note, incidentally, that if an error which would normally cause a machine check occurs on a reference to the Kernel Stack, the exception will be Kernel Stack Not Valid, NOT machine check. A bugcheck will happen and only a bugcheck entry will appear in the error log, reducing the amount of hardware register information available for diagnosis.) By contrast, the Interrupt Stack is usually two pages in length, and is flanked above and below by pages which are set up by VMS to be no-access to any mode. This guarantees that an attempt to push to a full stack or pop from an empty stack will cause an access violation, and the microcode is clever enough to realise that the access violation has happened on an access to the IS, whereupon it declares an ISI condition. These violations can be due to software bugs or to runaway devices perhaps using up the stack faster than it can be cleaned off. A third possibility is that the ISP is in the "wild blue yonder", i.e. nowhere near where it should be. This can be due to hardware errors such as picking a bit in the CPU or the memory controller. ISI is one of the most serious and immediate error conditions recognised by the 11/780 CPU. As such, it is handled by aborting the current instruction and halting the CPU out of hand. The halt is detected by the LSI-11, which prints out the familiar message: ?INT-STK INVLD HALTED AT XXXXXXXX This action (immediate halt, with no attempt to handle the error) is taken because serious errors are handled on the Kernel or Interrupt Stack, and machine checks and "Kernel Stack Invalid" exceptions in particular are handled on the IS. If the IS is no longer valid, then the outcome of such handling becomes INDETERMINATE (as the handbooks say). Other error conditions dealt with in the same way (instant microcode halt) are: ?CPU DBLE-ERR HLT ?ILL I/E VECTOR ?NO USER WCS ?CHM ERR In each case the microcode determines that it is not safe to attempt the execution of even one more instruction, and halts after leaving the corresponding code in ID register 2E (ID D.SV). The console software detects that the CPU has halted, and uses the code from ID D.SV to index into a table of messages so that it can print out the right one. It is important to realize that absolutely no changes of the CPU state occur up till now, and the internal registers hold complete information on the immediate cause of the halt. However if the system reboots, all this will be wiped out and no record whatever will be kept. What happens next? ------------------ This depends purely on the state of the Auto-Restart switch. This is what tells the console software whether or not to attempt to restart VMS after a powerfail or a halt. In general, as soon as ISI problems appear, you should arrange for the Auto-Restart switch to be turned off. This will ensure that the machine state is preserved until you can look at it. An alternative approach is to edit RESTAR.CMD to dump the desired registers and memory locations. However this is far from ideal as it lacks flexibility. If the console attempts to restart it does so by executing RESTAR.CMD. This starts the ISP ROM at 20003004, which causes it to look for a valid Restart Parameter Block in memory. If it finds one, it assumes memory is good and does a warm restart by handing control back to VMS. In this case VMS immediately performs a fatal bugcheck, as it refuses to carry on after an ISI failure. If there is no RPB, a cold boot is done; so in either case the outcome is the same: a cold boot restart. What should you examine? ------------------------ Of course, ISI may be a symptom of a fairly solid fault, in which case the micro and macro diagnostics are indicated. Overnight runs of UETP etc are useful (RDC will do these for RD systems). Read no further unless the problem is intermittent and flakey. In that case start by switching OFF Auto-Restart, and persuading the customer to leave the machine halted. If you have a machine which has just had an ISI, still in the halted condition, several things can be examined. It is assumed that the Interrupt Stack is the default 2 pages in length; this can be altered with the SYSGEN parameter INTSTKPAGES. The only case in which this should be necessary is if there are a large number of devices or suspect user-written device drivers. It is useful to start as follows: ^P >>> SHOW >>> SHOW V >>> E/ID 0/N:3F >>> E/G 0/N:F These commands dump the entire CPU state. For clarity one might then type >>> E PSL One would expect this to be 04XXXXXX. Also >>> E SP This highlights the value of ISP which caused the halt. Assuming VMS was running, SP should have a value of in S0 space, probably 800XXXXX. It should be longword aligned (end in 0,4,8 or C). The base of the Interrupt Stack (the value of ISP when the stack is empty) is held in the VMS location EXE$GL_INTSTK. This location is 80002D8C for VMS V2.X, 80003478 for VMS V3.X. Be careful to use the command >>> E/V/L 80002D8C Subtract 400 (hex) from the value in this location to get an address NNNNNN00; then you can dump out the genuine stack by typing >>> E/V/L NNNNNN00/N:FF This dumps out two pages of virtual memory, ending with a location XXXXXXFC (we hope). Examine one location further, and you will get a console error ?MEM-MAN FAULT,CODE=08 Examination of the stack contents will allow you to see if the stack was full, nearly full, nearly empty etc: and if it was full, why? In this context, it is useful to get the system up again and use SDA to dump out the I/O database: often this will reveal which devices were being handled on the stack at the time. You need to be able to decode exception frames and find your way around SDA for this; RDC or Support can usually help if needed. Get everything on hardcopy for reference. Obviously you can dump out other things as well; e.g. get the KSP and ESP from their ID registers and dump them out as well. Then after booting VMS you use SDA to examine the running system or the latest dump file and establish all the addresses in the I/O database (SDA> SHOW DEVICE command). Lastly, if other approaches fail, it is worth considering powerfail problems. If memory fails transiently and then recovers ISI may result; inspection of the memory configuration registers after an ISI halt may show something. DCLO and ACLO should be checked out routinely. Ian Loughlin has some good stuff on this subject in VAXNEWS 5 (page 39) from which most of this is stolen.
41.2	ps to .1	KERNEL::SCOTT	Eat,sleep,screw. That's a dogs life?	`Sun Jul 23 1989 00:29`	23
	EXE$GL_INTSTK locations for:- V4 = 80003f28 V5 = 8000842c You may also wish to check EXE$GL_INTSTKLM to see how big the stack is so that you can be sure of dumping the whole stack. Not all V4 systems use the default of 2 pages - BUNTY is an example of a V4 system using 4 pages for the interrupt stack. If we wanted to dump the stack from BUNTY using the procedure in .1 we would have to change the "/N:FF" to "/N:1FF" to get all 4 pages. V5 has 4 pages for the interrupt stack as default. EXE$GL_INTSTKLM locations:- V4 = 80002c60 If the number from this location is 400 away from EXE$GL_INTSTK then the stack is 2 pages. 600 is 3 pages and so on... V5 = 80004514