|
Interrupt Stack Invalid - troubleshooting notes
-----------------------------------------------
Sources:
--------
VAXNEWS 5 page 39: article by Ian Loughlin.
ESKAA VAX 11/780 LOCAL CONSOLE frame B6
ESOAB-1 VAX 11/780 PCS page 3, frame G11 line 21639
frame G16 line 24449
What is it and how is it handled?
---------------------------------
The 11/780 has five stacks: each process has its own
User, Supervisor, Executive and Kernel stacks; and the system has one
Interrupt Stack to handle system-wide events such as I/O and serious errors.
If any of the first three stacks become invalid, the error is handled by
VMS, and the result will be at worst a bugcheck. If the Kernel Stack is
invalid, the microcode forces an exception (abort) which is handled on the
Interrupt Stack at IPL 1F (i.e. same urgency as a machine check). (Please
note, incidentally, that if an error which would normally cause a machine
check occurs on a reference to the Kernel Stack, the exception will be
Kernel Stack Not Valid, NOT machine check. A bugcheck will happen and
only a bugcheck entry will appear in the error log, reducing the amount
of hardware register information available for diagnosis.)
By contrast, the Interrupt Stack is usually two pages in
length, and is flanked above and below by pages which are set up by VMS
to be no-access to any mode. This guarantees that an attempt to push to
a full stack or pop from an empty stack will cause an access violation, and
the microcode is clever enough to realise that the access violation has
happened on an access to the IS, whereupon it declares an ISI condition.
These violations can be due to software bugs or to runaway devices perhaps
using up the stack faster than it can be cleaned off.
A third possibility is that the ISP is in the "wild blue
yonder", i.e. nowhere near where it should be. This can be due to hardware
errors such as picking a bit in the CPU or the memory controller.
ISI is one of the most serious and immediate error
conditions recognised by the 11/780 CPU. As such, it is handled by
aborting the current instruction and halting the CPU out of hand.
The halt is detected by the LSI-11, which prints out the familiar
message:
?INT-STK INVLD
HALTED AT XXXXXXXX
This action (immediate halt, with no attempt to handle
the error) is taken because serious errors are handled on the Kernel or
Interrupt Stack, and machine checks and "Kernel Stack Invalid" exceptions
in particular are handled on the IS. If the IS is no longer valid, then
the outcome of such handling becomes INDETERMINATE (as the handbooks say).
Other error conditions dealt with in the same way (instant
microcode halt) are:
?CPU DBLE-ERR HLT
?ILL I/E VECTOR
?NO USER WCS
?CHM ERR
In each case the microcode determines that it is not safe to
attempt the execution of even one more instruction, and halts after leaving
the corresponding code in ID register 2E (ID D.SV). The console software
detects that the CPU has halted, and uses the code from ID D.SV to index
into a table of messages so that it can print out the right one.
It is important to realize that absolutely no changes of
the CPU state occur up till now, and the internal registers hold complete
information on the immediate cause of the halt. However if the system reboots,
all this will be wiped out and no record whatever will be kept.
What happens next?
------------------
This depends purely on the state of the Auto-Restart switch.
This is what tells the console software whether or not to attempt to restart
VMS after a powerfail or a halt. In general, as soon as ISI problems appear,
you should arrange for the Auto-Restart switch to be turned off. This will
ensure that the machine state is preserved until you can look at it.
An alternative approach is to edit RESTAR.CMD to dump the
desired registers and memory locations. However this is far from ideal as
it lacks flexibility.
If the console attempts to restart it does so by executing
RESTAR.CMD. This starts the ISP ROM at 20003004, which causes it to look
for a valid Restart Parameter Block in memory. If it finds one, it assumes
memory is good and does a warm restart by handing control back to VMS. In
this case VMS immediately performs a fatal bugcheck, as it refuses to carry
on after an ISI failure. If there is no RPB, a cold boot is done; so in either
case the outcome is the same: a cold boot restart.
What should you examine?
------------------------
Of course, ISI may be a symptom of a fairly solid fault, in
which case the micro and macro diagnostics are indicated. Overnight runs of
UETP etc are useful (RDC will do these for RD systems). Read no further unless
the problem is intermittent and flakey. In that case start by switching OFF
Auto-Restart, and persuading the customer to leave the machine halted.
If you have a machine which has just had an ISI, still in
the halted condition, several things can be examined. It is assumed that the
Interrupt Stack is the default 2 pages in length; this can be altered with
the SYSGEN parameter INTSTKPAGES. The only case in which this should be
necessary is if there are a large number of devices or suspect user-written
device drivers.
It is useful to start as follows:
^P
>>> SHOW
>>> SHOW V
>>> E/ID 0/N:3F
>>> E/G 0/N:F
These commands dump the entire CPU state.
For clarity one might then type
>>> E PSL
One would expect this to be 04XXXXXX.
Also
>>> E SP
This highlights the value of ISP which caused the halt.
Assuming VMS was running, SP should have a value of in S0 space, probably
800XXXXX. It should be longword aligned (end in 0,4,8 or C).
The base of the Interrupt Stack (the value of ISP when
the stack is empty) is held in the VMS location EXE$GL_INTSTK. This
location is 80002D8C for VMS V2.X, 80003478 for VMS V3.X. Be careful to
use the command
>>> E/V/L 80002D8C
Subtract 400 (hex) from the value in this location to get
an address NNNNNN00; then you can dump out the genuine stack by typing
>>> E/V/L NNNNNN00/N:FF
This dumps out two pages of virtual memory, ending with
a location XXXXXXFC (we hope). Examine one location further, and you
will get a console error
?MEM-MAN FAULT,CODE=08
Examination of the stack contents will allow you to see
if the stack was full, nearly full, nearly empty etc: and if it was
full, why? In this context, it is useful to get the system up again and
use SDA to dump out the I/O database: often this will reveal which devices
were being handled on the stack at the time. You need to be able to decode
exception frames and find your way around SDA for this; RDC or Support
can usually help if needed. Get everything on hardcopy for reference.
Obviously you can dump out other things as well; e.g.
get the KSP and ESP from their ID registers and dump them out as well.
Then after booting VMS you use SDA to examine the running system or the
latest dump file and establish all the addresses in the I/O database
(SDA> SHOW DEVICE command).
Lastly, if other approaches fail, it is worth considering
powerfail problems. If memory fails transiently and then recovers ISI may
result; inspection of the memory configuration registers after an ISI halt
may show something. DCLO and ACLO should be checked out routinely. Ian
Loughlin has some good stuff on this subject in VAXNEWS 5 (page 39) from
which most of this is stolen.
|