[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | + OpenVMS Clusters - The best clusters in the world! + |
Notice: | This conference is COMPANY CONFIDENTIAL. See #1.3 |
Moderator: | PROXY::MOORE |
|
Created: | Fri Aug 26 1988 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 5320 |
Total number of notes: | 23384 |
5303.0. "2 nodes in OpenVMS 6.1 VAXcluster BRK_NON, 1 CLUEXIT" by CHOWDA::GLICKMAN (writing from Newport,RI) Mon May 05 1997 14:13
On Saturday evening at 20:59, one node in a VAXcluster running OpenVMS 6.1
was timed out on the other nodes (Virtual Circuit timeout). The last
message on the console for this node was at 20:35. The last message in
the error log was a time stamp at 20:47.
The following error log messages are from one of the other nodes in the
VAXcluster. At 20:59 the CI started giving these messages and at 21:03
this node also had a Virtual Circuit Timeout message on the other nodes.
On this node's console was this message:
%PAA0: Error Bit(s) Set - CNF/PMC/PSR 00000000/00000004/80000440
and that the port is reinitializing. Check the Error log.
Eventually the console reflects that this VAX crashes with the CLUEXIT
error message. Unfortunately no Crashdump is saved.
Someone came in at 400 Sunday morning and did a SHOW CLUSTER then.
Both of these machines were in the BRK_NON status.
Can anyone help me make some sense of what was going on with the information
I have provided? Is there any other information I can try to provide?
Why would one node not have any crash information associated with it
and be in BRK_NON while the other one did? Are the two node's problems
somehow associated?
If I'm in the wrong notes conference please point me to the correct one.
Thanks.
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 1.
******************************* ENTRY 8132. *******************************
ERROR SEQUENCE 2708. LOGGED ON: SID 17000201
DATE/TIME 3-MAY-1997 20:59:57.80 SYS_TYPE 01410201
SYSTEM UPTIME: 10 DAYS 05:25:51
SCS NODE: VAXB VAX/VMS V6.1
ERL$LOGMESSAGE KA7AA-AA CPU FW REV# 1. CONSOLE FW REV# 4.1
CIXCD SUB-SYSTEM, _VAXB$PAA0:
VIRTUAL CIRCUIT TIMEOUT
LOCAL STATION ADDRESS, 000000000007(X)
LOCAL SYSTEM ID, 0000000005C8(X)
REMOTE STATION ADDRESS, 00000000000E(X)
REMOTE SYSTEM ID, 000000000441(X)
UCB$B_ERTCNT 32
50. RETRIES REMAINING
UCB$B_ERTMAX 32
50. RETRIES ALLOWABLE
UCB$W_ERRCNT 0002
2. ERRORS THIS UNIT
PPD$B_PORT 00
REMOTE NODE # 0.
PPD$B_STATUS 00
PPD$B_OPC 00
UNKNOWN OPCODE
PPD$B_FLAGS 00
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 2.
******************************* ENTRY 8133. *******************************
ERROR SEQUENCE 2709. LOGGED ON: SID 17000201
DATE/TIME 3-MAY-1997 21:00:06.19 SYS_TYPE 01410201
SYSTEM UPTIME: 10 DAYS 05:25:59
SCS NODE: VAXB VAX/VMS V6.1
ERL$LOGMESSAGE KA7AA-AA CPU FW REV# 1. CONSOLE FW REV# 4.1
CIXCD SUB-SYSTEM, _VAXB$PAB0:
VIRTUAL CIRCUIT TIMEOUT
LOCAL STATION ADDRESS, 000000000007(X)
LOCAL SYSTEM ID, 0000000005C8(X)
REMOTE STATION ADDRESS, 00000000000E(X)
REMOTE SYSTEM ID, 000000000441(X)
UCB$B_ERTCNT 32
50. RETRIES REMAINING
UCB$B_ERTMAX 32
50. RETRIES ALLOWABLE
UCB$W_ERRCNT 0001
1. ERRORS THIS UNIT
PPD$B_PORT 00
REMOTE NODE # 0.
PPD$B_STATUS 00
PPD$B_OPC 00
UNKNOWN OPCODE
PPD$B_FLAGS 00
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 3.
******************************* ENTRY 8134. *******************************
ERROR SEQUENCE 2710. LOGGED ON: SID 17000201
DATE/TIME 3-MAY-1997 21:00:07.11 SYS_TYPE 01410201
SYSTEM UPTIME: 10 DAYS 05:26:00
SCS NODE: VAXB VAX/VMS V6.1
ERL$LOGMSCP KA7AA-AA CPU FW REV# 1. CONSOLE FW REV# 4.1
MESSAGE TYPE 000B
DATAGRAM FOR NON-EXISTING "UCB"
CLASS DRIVER 4B534944
/DISK/
CDDB$Q_CNTRLID 33700893
01280009
UNIQUE IDENTIFIER, 000933700893(X)
MASS STORAGE CONTROLLER
HSJ40
CDDB$B_SYSTEMID 10083DC2
4200
MSLG$L_CMD_REF 00000000
MSLG$W_SEQ_NUM 0018
SEQUENCE #24.
MSLG$B_FORMAT 00
CONTROLLER LOG
MSLG$B_FLAGS 00
UNRECOVERABLE ERROR
MSLG$W_EVENT 006A
CONTROLLER ERROR
INTERNAL DATA-STRUCTURE ERROR
MSLG$Q_CNT_ID 33700893
01280009
UNIQUE IDENTIFIER, 000933700893(X)
MASS STORAGE CONTROLLER
HSJ40
MSLG$B_CNT_SVR 27
CONTROLLER SOFTWARE VERSION #39.
MSLG$B_CNT_HVR 4C
CONTROLLER HARDWARE REVISION #76.
FIB DEPENDENT DATA
PORT DRIVER PACKET
INSTANCE 4007640A
COMPONENT ID = HOST INTERCONNECT
EVENT NUMBER = 07(X)
UNKNOWN EVENT
REPAIR ACTION = 64(X)
NR THRESHOLD = 0A(X)
NR CLASSIFICATION = SOFT
TEMPL 32
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 4.
TDISIZE 10
EVENT TIME 048F3CF0
00000000
21248. HRS, 55. MINS, 12. SECS
PORT STATUS 0009
BFLGS = 0001(X)
Exception occurred
PORT A STATUS
NAK retry limit reached
PORT B STATUS
Sucessful transmit
HIS STATUS 000A
UNKNOWN HIS STATUS
ERROR ID 200D7930
HIS ADDRESS = 200D7930(X)
SRC 08
SRC NODE ADDRESS = 08(X)
DST 0E
DST NODE ADDRESS = 0E(X)
INTOPCD 00
OPCODE = 00(X)
RESERVED
VCSTATE 85
UNKNOWN VCSTATE
PPD OPCODE 0000
START
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 5.
******************************* ENTRY 8135. *******************************
ERROR SEQUENCE 2711. LOGGED ON: SID 17000201
DATE/TIME 3-MAY-1997 21:00:11.38 SYS_TYPE 01410201
SYSTEM UPTIME: 10 DAYS 05:26:04
SCS NODE: VAXB VAX/VMS V6.1
ERL$LOGMSCP KA7AA-AA CPU FW REV# 1. CONSOLE FW REV# 4.1
MESSAGE TYPE 000B
DATAGRAM FOR NON-EXISTING "UCB"
CLASS DRIVER 4B534944
/DISK/
CDDB$Q_CNTRLID 41003001
01280009
UNIQUE IDENTIFIER, 000941003001(X)
MASS STORAGE CONTROLLER
HSJ40
CDDB$B_SYSTEMID 100A3DC4
4200
MSLG$L_CMD_REF 00000000
MSLG$W_SEQ_NUM 0006
SEQUENCE #6.
MSLG$B_FORMAT 00
CONTROLLER LOG
MSLG$B_FLAGS 00
UNRECOVERABLE ERROR
MSLG$W_EVENT 006A
CONTROLLER ERROR
INTERNAL DATA-STRUCTURE ERROR
MSLG$Q_CNT_ID 41003001
01280009
UNIQUE IDENTIFIER, 000941003001(X)
MASS STORAGE CONTROLLER
HSJ40
MSLG$B_CNT_SVR 27
CONTROLLER SOFTWARE VERSION #39.
MSLG$B_CNT_HVR 4A
CONTROLLER HARDWARE REVISION #74.
FIB DEPENDENT DATA
PORT DRIVER PACKET
INSTANCE 4007640A
COMPONENT ID = HOST INTERCONNECT
EVENT NUMBER = 07(X)
UNKNOWN EVENT
REPAIR ACTION = 64(X)
NR THRESHOLD = 0A(X)
NR CLASSIFICATION = SOFT
TEMPL 32
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 6.
TDISIZE 10
EVENT TIME 023ACF09
00000000
10391. HRS, 15. MINS, 21. SECS
PORT STATUS 0009
BFLGS = 0001(X)
Exception occurred
PORT A STATUS
NAK retry limit reached
PORT B STATUS
Sucessful transmit
HIS STATUS 000A
UNKNOWN HIS STATUS
ERROR ID 200D7930
HIS ADDRESS = 200D7930(X)
SRC 0A
SRC NODE ADDRESS = 0A(X)
DST 0E
DST NODE ADDRESS = 0E(X)
INTOPCD 00
OPCODE = 00(X)
RESERVED
VCSTATE 85
UNKNOWN VCSTATE
PPD OPCODE 0000
START
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 7.
******************************* ENTRY 8136. *******************************
ERROR SEQUENCE 2712. LOGGED ON: SID 17000201
DATE/TIME 3-MAY-1997 21:00:34.40 SYS_TYPE 01410201
SYSTEM UPTIME: 10 DAYS 05:26:27
SCS NODE: VAXB VAX/VMS V6.1
ERL$LOGMSCP KA7AA-AA CPU FW REV# 1. CONSOLE FW REV# 4.1
MESSAGE TYPE 000B
DATAGRAM FOR NON-EXISTING "UCB"
CLASS DRIVER 4B534944
/DISK/
CDDB$Q_CNTRLID 41003165
01280009
UNIQUE IDENTIFIER, 000941003165(X)
MASS STORAGE CONTROLLER
HSJ40
CDDB$B_SYSTEMID 100921C2
4200
MSLG$L_CMD_REF 00000000
MSLG$W_SEQ_NUM 0085
SEQUENCE #133.
MSLG$B_FORMAT 00
CONTROLLER LOG
MSLG$B_FLAGS 00
UNRECOVERABLE ERROR
MSLG$W_EVENT 006A
CONTROLLER ERROR
INTERNAL DATA-STRUCTURE ERROR
MSLG$Q_CNT_ID 41003165
01280009
UNIQUE IDENTIFIER, 000941003165(X)
MASS STORAGE CONTROLLER
HSJ40
MSLG$B_CNT_SVR 27
CONTROLLER SOFTWARE VERSION #39.
MSLG$B_CNT_HVR 4A
CONTROLLER HARDWARE REVISION #74.
FIB DEPENDENT DATA
PORT DRIVER PACKET
INSTANCE 4007640A
COMPONENT ID = HOST INTERCONNECT
EVENT NUMBER = 07(X)
UNKNOWN EVENT
REPAIR ACTION = 64(X)
NR THRESHOLD = 0A(X)
NR CLASSIFICATION = SOFT
TEMPL 32
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 8.
TDISIZE 10
EVENT TIME 0206BE90
00000000
9443. HRS, 27. MINS, 12. SECS
PORT STATUS 0009
BFLGS = 0001(X)
Exception occurred
PORT A STATUS
NAK retry limit reached
PORT B STATUS
Sucessful transmit
HIS STATUS 000A
UNKNOWN HIS STATUS
ERROR ID 200D7930
HIS ADDRESS = 200D7930(X)
SRC 09
SRC NODE ADDRESS = 09(X)
DST 0E
DST NODE ADDRESS = 0E(X)
INTOPCD 00
OPCODE = 00(X)
RESERVED
VCSTATE 85
UNKNOWN VCSTATE
PPD OPCODE 0000
START
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 9.
******************************* ENTRY 8137. *******************************
ERROR SEQUENCE 2713. LOGGED ON: SID 17000201
DATE/TIME 3-MAY-1997 21:00:51.54 SYS_TYPE 01410201
SYSTEM UPTIME: 10 DAYS 05:26:44
SCS NODE: VAXB VAX/VMS V6.1
ERL$LOGMSCP KA7AA-AA CPU FW REV# 1. CONSOLE FW REV# 4.1
MESSAGE TYPE 000B
DATAGRAM FOR NON-EXISTING "UCB"
CLASS DRIVER 4B534944
/DISK/
CDDB$Q_CNTRLID 40702498
01280009
UNIQUE IDENTIFIER, 000940702498(X)
MASS STORAGE CONTROLLER
HSJ40
CDDB$B_SYSTEMID 100621C4
4200
MSLG$L_CMD_REF 00000000
MSLG$W_SEQ_NUM 0027
SEQUENCE #39.
MSLG$B_FORMAT 00
CONTROLLER LOG
MSLG$B_FLAGS 01
SEQUENCE NUMBER RESET
UNRECOVERABLE ERROR
MSLG$W_EVENT 006A
CONTROLLER ERROR
INTERNAL DATA-STRUCTURE ERROR
MSLG$Q_CNT_ID 40702498
01280009
UNIQUE IDENTIFIER, 000940702498(X)
MASS STORAGE CONTROLLER
HSJ40
MSLG$B_CNT_SVR 27
CONTROLLER SOFTWARE VERSION #39.
MSLG$B_CNT_HVR 49
CONTROLLER HARDWARE REVISION #73.
FIB DEPENDENT DATA
PORT DRIVER PACKET
INSTANCE 4007640A
COMPONENT ID = HOST INTERCONNECT
EVENT NUMBER = 07(X)
UNKNOWN EVENT
REPAIR ACTION = 64(X)
NR THRESHOLD = 0A(X)
NR CLASSIFICATION = SOFT
V A X / V M S SYSTEM ERROR REPORT COMPILED 5-MAY-1997 10:25:00
PAGE 10.
TEMPL 32
TDISIZE 10
EVENT TIME 02060D05
00000000
9430. HRS, 49. MINS, 41. SECS
PORT STATUS 0009
BFLGS = 0001(X)
Exception occurred
PORT A STATUS
NAK retry limit reached
PORT B STATUS
Sucessful transmit
HIS STATUS 000A
UNKNOWN HIS STATUS
ERROR ID 200D7930
HIS ADDRESS = 200D7930(X)
SRC 06
SRC NODE ADDRESS = 06(X)
DST 0E
DST NODE ADDRESS = 0E(X)
INTOPCD 00
OPCODE = 00(X)
RESERVED
VCSTATE 85
UNKNOWN VCSTATE
PPD OPCODE 0000
START
T.R | Title | User | Personal Name | Date | Lines |
---|
5303.1 | Call Harware Support To Look At This Box... | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Mon May 05 1997 14:36 | 18 |
|
Please elevate this through formal channels, and get the organization's
hardware service organization in to look at this configuration.
And in particular, I'd check the firmware revision presently loaded
into the HSJ40 controller at CI node E, and I'd look for any HSJ-related
or storage-related errors or oddities, and for any HSJ-related reboots.
I'd also check the revision of the CIXCD, and for any associated updates.
And I'd look for any "unusual" XMI or BI widgets that might be around...
Also see SPEZKO::CLUSTER 4431.*, 5218.2, and -- if you've got a CIPCA in
the mix -- 4871.*.
Also consider an upgrade to a more recent OpenVMS version.
And see SSDEVO::HSJ40_PRODUCT note 385.*, for more information on the
`DATAGRAM FOR NON-EXISTING "UCB"' errors.
|
5303.2 | MTE, tuning problems - also check cluster parameters | CSC32::B_HIBBERT | When in doubt, PANIC | Wed May 07 1997 11:25 | 21 |
| The node that was showing the port error bits set messages was logging
maitenance timer expired errors. This error indicates the system was
hanging at or above IPL 8 for an extended period. This is commonly
caused by system resource problems such as runnning out of non-paged
pool or page file space. Check the system resources on this node and
look for shortages. Use DecPS if you have it on the system, it may
help identify resource problems.
I think you said that one of the other systems actually had the
CLUEXIT. If so, check your cluster parameters. Make sure that such
things as RECNXTINTERVAL and EXPECTED_VOTES are set consistently
throughout the cluster. The node that gets the MTE errors is commonly
the node that gets the CLUEXIT unless the cluster parameters are set
strangely.
The HSJ errors are just virtual circuit closures to the CI node E. I
suspect this is either the system with the MTE or the system that
crashed.
Brian Hibbert
|