[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | SABLE SYSTEM PUBLIC DISCUSSION |
|
Moderator: | COSMIC::PETERSON |
|
Created: | Mon Jan 11 1993 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 2614 |
Total number of notes: | 10244 |
2600.0. "URGENT - Unexplained crashes on AlphaServer 2100 5/250 running OpenVMS V6.2-1H3" by ROBSON::WARNE () Mon May 19 1997 07:30
Customer has two AlphaServer 2100 5/250s running OpenVMS V6.2-1H3 in a cluster, connected to six SW300 cabinets
with HSZ40s, via KZPSA controllers (three in each Alpha). Each system also has a KZESC RAID controller.
A couple of weeks ago (when the system load was increased due to more users being brought online) they had a
problem when one of the Alphas crashed for no apparent reason. Nothing is written to the SYSDUMP file (though
DUMPFILE and DUMPBUG are set set to one), no errors are logged, and all that's written to the console is the
following ...
HALTED CPU 0
KERNAL STACK NOT VALID HALT
PC = FFFFFFFF80029050
This happened again at the weekend, on the other cluster node! Unfortnately, they've got AUTOACTION set to HALT,
so we haven't got a post crash dump either. I know this all but rules out a dfinitive answer to the problem, but
I'd be grateful if anyone could give me a pointer as to what the problem MIGHT be.
CPU and config details are as follows:
$ SHOW CPU/FUL
TSLV13, a AlphaServer 2100 5/250
Multiprocessing is DISABLED. Uniprocessing synchronization image loaded.
Minimum multiprocessing revision levels: CPU = 1
System Page Size = 8192
System Revision Code =
System Serial Number = ay52507111
Default CPU Capabilities:
QUORUM RUN
Default Process Capabilities:
QUORUM RUN
PRIMARY CPU = 00
CPU 00 is in RUN state
Current Process: _RTA2: PID = 206009E7
Serial Number:
Revision:
VAX floating point operations supported.
IEEE floating point operations and data types supported.
Processor is Primary Eligible.
PALCODE: Revision Code = 1.18
PALcode Compatibility = 1
Maximum Shared Processors = 4
Memory Space: Physical address = 00000000 00000000
Length = 0
Scratch Space: Physical address = 00000000 00000000
Length = 0
Capabilities of this CPU:
PRIMARY QUORUM RUN
Processes which can only execute on this CPU:
*** None ***
SDA> clue config
System Configuration:
---------------------
System Information:
System Type AlphaServer 2100 5/250 Primary CPU ID 00
Cycle Time 4.0 nsec (250 MHz) Pagesize 8192 Byte
Memory Configuration:
Cluster PFN Start PFN Count Range (MByte) Usage
#03 0 256 0.0 MB - 2.0 MB Console
#04 256 130815 2.0 MB - 1023.9 MB System
#05 131071 1 1023.9 MB - 1024.0 MB Console
Per-CPU Slot Processor Information:
CPU ID 00 CPU State rc,pa,pp,cv,pv,pmv,pl
CPU Type EV5 Pass 4 (21164) Halt PC 00000000 20000000
PAL Code 1.18-1 Halt PS 00000000 00001F00
CPU Revision .... Halt Code 00000000 00000000
Serial Number .......... Bootstrap or Powerfail
Console Vers V4.5-55
Adapter Configuration:
----------------------
TR Adapter Name (Address) Hose Bus Node Device Name HW-Id/SW
-- ---------------------- ---- ----------- ---- ---------------- --------
1 KA0905 (80D84080) 0 CBUS
0 KA0902_CPU 00000017
4 KA0902_MEM 00000018
5 KA0902_MEM 00000018
8 KA0902_IIO 00000019
2 PCI (80D84480) 0 PCI
EWA: 0 TULIP 00021011
PKA: 1 NCR53C810 00011000
2 MERCURY 04828086
PKB: 6 KZPSA 00081011
PKC: 7 KZPSA 00081011
PKD: 8 KZPSA 00081011
3 EISA (80D84B40) 1 EISA
0 012AA310
GQA: 2 CPQ3011 1130110E
FRA: 4 DEFEA_2 0230A310
DRA: 7 MLX0075 75009835
Adapter Configuration:
----------------------
TR Adapter Name (Address) Hose Bus Node Device Name HW-Id/SW
-- ---------------------- ---- ----------- ---- ---------------- --------
4 XBUS (80D85040) 0 XBUS
0 EISA_SYSTEM_BOAR 00000016
DVA: 1 AHA1742A_FLOPPY 504F4C46
LRA: 2 VTI82C106_PP 00000015
TTA: 3 NS16450 00016450
This system serves the customer's sites throughiut the UK, so it's critical the problem is sorted asap.
many thanks,
Chris Warne
T.R | Title | User | Personal Name | Date | Lines |
---|
2600.1 | Need crash dumpfile for more info | STAR::jacobi.zko.dec.com::jacobi | Paul A. Jacobi - OpenVMS Systems Group | Mon May 19 1997 14:22 | 13 |
|
I don't see any obvious problems.
Be sure the console environment varible AUTO_ACTION is set to
RESTART, so a crash dump will be generated the next time the problem
occurs.
>>>set auto_action restart
>>>init
-Paul
|
2600.2 | Crash dumpfile | ROBSON::WARNE | | Tue May 20 1997 04:42 | 9 |
| I was afraid you'd say that! As I say, AUTO_ACTION is currently set to HALT, and they don't want to shut down
their systems unless they absolutely have to. So it means it will have to crash twice more before I get any
answers - not good.
So, nobody's come across a similar problem, or has any idea the sort of area the problem might be stemming from. I
need something to take into a customer meeting, and "wait 'til it crashes, then set a console variable, and I
might be able to tell you something when it crashes next time ... " isn't really what I had in mind!
Chris
|