[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5303.0. "2 nodes in OpenVMS 6.1 VAXcluster BRK_NON, 1 CLUEXIT" by CHOWDA::GLICKMAN (writing from Newport,RI) Mon May 05 1997 14:13

On Saturday evening at 20:59, one node in a VAXcluster running OpenVMS 6.1
was timed out on the other nodes (Virtual Circuit timeout).  The last
message on the console for this node was at 20:35.  The last message in
the error log was a time stamp at 20:47.  

The following error log messages are from one of the other nodes in the
VAXcluster.  At 20:59 the CI started giving these messages and at 21:03
this node also had a Virtual Circuit Timeout message on the other nodes.

On this node's console was this message:

 %PAA0: Error Bit(s) Set - CNF/PMC/PSR 00000000/00000004/80000440
 and that the port is reinitializing. Check the Error log.

Eventually the console reflects that this VAX crashes with the CLUEXIT
error message.  Unfortunately no Crashdump is saved.

Someone came in at 400 Sunday morning and did a SHOW CLUSTER then.
Both of these machines were in the BRK_NON status.

Can anyone help me make some sense of what was going on with the information
I have provided?  Is there any other information I can try to provide?

Why would one node not have any crash information associated with it
and be in BRK_NON while the other one did?  Are the two node's problems
somehow associated?

If I'm in the wrong notes conference please point me to the correct one.
Thanks. 
                         

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   1.

 ******************************* ENTRY    8132. *******************************
 ERROR SEQUENCE 2708.                            LOGGED ON:        SID 17000201
 DATE/TIME  3-MAY-1997 20:59:57.80                            SYS_TYPE 01410201
 SYSTEM UPTIME: 10 DAYS 05:25:51
 SCS NODE: VAXB                                                VAX/VMS V6.1

 ERL$LOGMESSAGE KA7AA-AA  CPU FW REV# 1.  CONSOLE FW REV# 4.1

 CIXCD SUB-SYSTEM, _VAXB$PAA0:

       VIRTUAL CIRCUIT TIMEOUT

       LOCAL STATION ADDRESS, 000000000007(X)
       LOCAL SYSTEM ID, 0000000005C8(X)

       REMOTE STATION ADDRESS, 00000000000E(X)
       REMOTE SYSTEM ID, 000000000441(X)

       UCB$B_ERTCNT          32
                                       50. RETRIES REMAINING
       UCB$B_ERTMAX          32
                                       50. RETRIES ALLOWABLE
       UCB$W_ERRCNT        0002
                                       2. ERRORS THIS UNIT
       PPD$B_PORT            00
                                       REMOTE NODE # 0.
       PPD$B_STATUS          00
       PPD$B_OPC             00
                                       UNKNOWN OPCODE
       PPD$B_FLAGS           00

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   2.

 ******************************* ENTRY    8133. *******************************
 ERROR SEQUENCE 2709.                            LOGGED ON:        SID 17000201
 DATE/TIME  3-MAY-1997 21:00:06.19                            SYS_TYPE 01410201
 SYSTEM UPTIME: 10 DAYS 05:25:59
 SCS NODE: VAXB                                                VAX/VMS V6.1

 ERL$LOGMESSAGE KA7AA-AA  CPU FW REV# 1.  CONSOLE FW REV# 4.1

 CIXCD SUB-SYSTEM, _VAXB$PAB0:

       VIRTUAL CIRCUIT TIMEOUT

       LOCAL STATION ADDRESS, 000000000007(X)
       LOCAL SYSTEM ID, 0000000005C8(X)

       REMOTE STATION ADDRESS, 00000000000E(X)
       REMOTE SYSTEM ID, 000000000441(X)

       UCB$B_ERTCNT          32
                                       50. RETRIES REMAINING
       UCB$B_ERTMAX          32
                                       50. RETRIES ALLOWABLE
       UCB$W_ERRCNT        0001
                                       1. ERRORS THIS UNIT
       PPD$B_PORT            00
                                       REMOTE NODE # 0.
       PPD$B_STATUS          00
       PPD$B_OPC             00
                                       UNKNOWN OPCODE
       PPD$B_FLAGS           00

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   3.

 ******************************* ENTRY    8134. *******************************
 ERROR SEQUENCE 2710.                            LOGGED ON:        SID 17000201
 DATE/TIME  3-MAY-1997 21:00:07.11                            SYS_TYPE 01410201
 SYSTEM UPTIME: 10 DAYS 05:26:00
 SCS NODE: VAXB                                                VAX/VMS V6.1

 ERL$LOGMSCP KA7AA-AA  CPU FW REV# 1.  CONSOLE FW REV# 4.1

       MESSAGE TYPE        000B
                                       DATAGRAM FOR NON-EXISTING "UCB"
       CLASS DRIVER    4B534944
                                       /DISK/
       CDDB$Q_CNTRLID  33700893
                       01280009
                                       UNIQUE IDENTIFIER, 000933700893(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       CDDB$B_SYSTEMID 10083DC2
                           4200
       MSLG$L_CMD_REF  00000000
       MSLG$W_SEQ_NUM      0018
                                       SEQUENCE #24.
       MSLG$B_FORMAT         00
                                       CONTROLLER LOG
       MSLG$B_FLAGS          00
                                       UNRECOVERABLE ERROR
       MSLG$W_EVENT        006A
                                       CONTROLLER ERROR
                                       INTERNAL DATA-STRUCTURE ERROR
       MSLG$Q_CNT_ID   33700893
                       01280009
                                       UNIQUE IDENTIFIER, 000933700893(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       MSLG$B_CNT_SVR        27
                                       CONTROLLER SOFTWARE VERSION #39.
       MSLG$B_CNT_HVR        4C
                                       CONTROLLER HARDWARE REVISION #76.

 FIB DEPENDENT DATA

 PORT DRIVER PACKET

       INSTANCE        4007640A
                                       COMPONENT ID = HOST INTERCONNECT

                                       EVENT NUMBER = 07(X)
                                       UNKNOWN EVENT
                                        

                                       REPAIR ACTION = 64(X)

                                       NR THRESHOLD = 0A(X)
                                       NR CLASSIFICATION = SOFT

       TEMPL                 32

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   4.

       TDISIZE               10
       EVENT TIME      048F3CF0
                       00000000
                                       21248. HRS, 55. MINS, 12. SECS
       PORT STATUS         0009
                                       BFLGS = 0001(X)
                                       Exception occurred

                                       PORT A STATUS
                                       NAK retry limit reached

                                       PORT B STATUS
                                       Sucessful transmit

       HIS STATUS          000A
                                       UNKNOWN HIS STATUS
                                        

       ERROR ID        200D7930
                                       HIS ADDRESS = 200D7930(X)

       SRC                   08
                                       SRC NODE ADDRESS = 08(X)

       DST                   0E
                                       DST NODE ADDRESS = 0E(X)

       INTOPCD               00
                                       OPCODE = 00(X)
                                       RESERVED

       VCSTATE               85
                                       UNKNOWN VCSTATE

       PPD OPCODE          0000
                                       START


 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   5.

 ******************************* ENTRY    8135. *******************************
 ERROR SEQUENCE 2711.                            LOGGED ON:        SID 17000201
 DATE/TIME  3-MAY-1997 21:00:11.38                            SYS_TYPE 01410201
 SYSTEM UPTIME: 10 DAYS 05:26:04
 SCS NODE: VAXB                                                VAX/VMS V6.1

 ERL$LOGMSCP KA7AA-AA  CPU FW REV# 1.  CONSOLE FW REV# 4.1

       MESSAGE TYPE        000B
                                       DATAGRAM FOR NON-EXISTING "UCB"
       CLASS DRIVER    4B534944
                                       /DISK/
       CDDB$Q_CNTRLID  41003001
                       01280009
                                       UNIQUE IDENTIFIER, 000941003001(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       CDDB$B_SYSTEMID 100A3DC4
                           4200
       MSLG$L_CMD_REF  00000000
       MSLG$W_SEQ_NUM      0006
                                       SEQUENCE #6.
       MSLG$B_FORMAT         00
                                       CONTROLLER LOG
       MSLG$B_FLAGS          00
                                       UNRECOVERABLE ERROR
       MSLG$W_EVENT        006A
                                       CONTROLLER ERROR
                                       INTERNAL DATA-STRUCTURE ERROR
       MSLG$Q_CNT_ID   41003001
                       01280009
                                       UNIQUE IDENTIFIER, 000941003001(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       MSLG$B_CNT_SVR        27
                                       CONTROLLER SOFTWARE VERSION #39.
       MSLG$B_CNT_HVR        4A
                                       CONTROLLER HARDWARE REVISION #74.

 FIB DEPENDENT DATA

 PORT DRIVER PACKET

       INSTANCE        4007640A
                                       COMPONENT ID = HOST INTERCONNECT

                                       EVENT NUMBER = 07(X)
                                       UNKNOWN EVENT
                                        

                                       REPAIR ACTION = 64(X)

                                       NR THRESHOLD = 0A(X)
                                       NR CLASSIFICATION = SOFT

       TEMPL                 32

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   6.

       TDISIZE               10
       EVENT TIME      023ACF09
                       00000000
                                       10391. HRS, 15. MINS, 21. SECS
       PORT STATUS         0009
                                       BFLGS = 0001(X)
                                       Exception occurred

                                       PORT A STATUS
                                       NAK retry limit reached

                                       PORT B STATUS
                                       Sucessful transmit

       HIS STATUS          000A
                                       UNKNOWN HIS STATUS
                                        

       ERROR ID        200D7930
                                       HIS ADDRESS = 200D7930(X)

       SRC                   0A
                                       SRC NODE ADDRESS = 0A(X)

       DST                   0E
                                       DST NODE ADDRESS = 0E(X)

       INTOPCD               00
                                       OPCODE = 00(X)
                                       RESERVED

       VCSTATE               85
                                       UNKNOWN VCSTATE

       PPD OPCODE          0000
                                       START


 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   7.

 ******************************* ENTRY    8136. *******************************
 ERROR SEQUENCE 2712.                            LOGGED ON:        SID 17000201
 DATE/TIME  3-MAY-1997 21:00:34.40                            SYS_TYPE 01410201
 SYSTEM UPTIME: 10 DAYS 05:26:27
 SCS NODE: VAXB                                                VAX/VMS V6.1

 ERL$LOGMSCP KA7AA-AA  CPU FW REV# 1.  CONSOLE FW REV# 4.1

       MESSAGE TYPE        000B
                                       DATAGRAM FOR NON-EXISTING "UCB"
       CLASS DRIVER    4B534944
                                       /DISK/
       CDDB$Q_CNTRLID  41003165
                       01280009
                                       UNIQUE IDENTIFIER, 000941003165(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       CDDB$B_SYSTEMID 100921C2
                           4200
       MSLG$L_CMD_REF  00000000
       MSLG$W_SEQ_NUM      0085
                                       SEQUENCE #133.
       MSLG$B_FORMAT         00
                                       CONTROLLER LOG
       MSLG$B_FLAGS          00
                                       UNRECOVERABLE ERROR
       MSLG$W_EVENT        006A
                                       CONTROLLER ERROR
                                       INTERNAL DATA-STRUCTURE ERROR
       MSLG$Q_CNT_ID   41003165
                       01280009
                                       UNIQUE IDENTIFIER, 000941003165(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       MSLG$B_CNT_SVR        27
                                       CONTROLLER SOFTWARE VERSION #39.
       MSLG$B_CNT_HVR        4A
                                       CONTROLLER HARDWARE REVISION #74.

 FIB DEPENDENT DATA

 PORT DRIVER PACKET

       INSTANCE        4007640A
                                       COMPONENT ID = HOST INTERCONNECT

                                       EVENT NUMBER = 07(X)
                                       UNKNOWN EVENT
                                        

                                       REPAIR ACTION = 64(X)

                                       NR THRESHOLD = 0A(X)
                                       NR CLASSIFICATION = SOFT

       TEMPL                 32

 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   8.

       TDISIZE               10
       EVENT TIME      0206BE90
                       00000000
                                       9443. HRS, 27. MINS, 12. SECS
       PORT STATUS         0009
                                       BFLGS = 0001(X)
                                       Exception occurred

                                       PORT A STATUS
                                       NAK retry limit reached

                                       PORT B STATUS
                                       Sucessful transmit

       HIS STATUS          000A
                                       UNKNOWN HIS STATUS
                                        

       ERROR ID        200D7930
                                       HIS ADDRESS = 200D7930(X)

       SRC                   09
                                       SRC NODE ADDRESS = 09(X)

       DST                   0E
                                       DST NODE ADDRESS = 0E(X)

       INTOPCD               00
                                       OPCODE = 00(X)
                                       RESERVED

       VCSTATE               85
                                       UNKNOWN VCSTATE

       PPD OPCODE          0000
                                       START


 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE   9.

 ******************************* ENTRY    8137. *******************************
 ERROR SEQUENCE 2713.                            LOGGED ON:        SID 17000201
 DATE/TIME  3-MAY-1997 21:00:51.54                            SYS_TYPE 01410201
 SYSTEM UPTIME: 10 DAYS 05:26:44
 SCS NODE: VAXB                                                VAX/VMS V6.1

 ERL$LOGMSCP KA7AA-AA  CPU FW REV# 1.  CONSOLE FW REV# 4.1

       MESSAGE TYPE        000B
                                       DATAGRAM FOR NON-EXISTING "UCB"
       CLASS DRIVER    4B534944
                                       /DISK/
       CDDB$Q_CNTRLID  40702498
                       01280009
                                       UNIQUE IDENTIFIER, 000940702498(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       CDDB$B_SYSTEMID 100621C4
                           4200
       MSLG$L_CMD_REF  00000000
       MSLG$W_SEQ_NUM      0027
                                       SEQUENCE #39.
       MSLG$B_FORMAT         00
                                       CONTROLLER LOG
       MSLG$B_FLAGS          01
                                       SEQUENCE NUMBER RESET
                                       UNRECOVERABLE ERROR
       MSLG$W_EVENT        006A
                                       CONTROLLER ERROR
                                       INTERNAL DATA-STRUCTURE ERROR
       MSLG$Q_CNT_ID   40702498
                       01280009
                                       UNIQUE IDENTIFIER, 000940702498(X)
                                       MASS STORAGE CONTROLLER
                                       HSJ40
       MSLG$B_CNT_SVR        27
                                       CONTROLLER SOFTWARE VERSION #39.
       MSLG$B_CNT_HVR        49
                                       CONTROLLER HARDWARE REVISION #73.

 FIB DEPENDENT DATA

 PORT DRIVER PACKET

       INSTANCE        4007640A
                                       COMPONENT ID = HOST INTERCONNECT

                                       EVENT NUMBER = 07(X)
                                       UNKNOWN EVENT
                                        

                                       REPAIR ACTION = 64(X)

                                       NR THRESHOLD = 0A(X)
                                       NR CLASSIFICATION = SOFT


 V A X / V M S        SYSTEM ERROR REPORT         COMPILED  5-MAY-1997 10:25:00
                                                                      PAGE  10.

       TEMPL                 32
       TDISIZE               10
       EVENT TIME      02060D05
                       00000000
                                       9430. HRS, 49. MINS, 41. SECS
       PORT STATUS         0009
                                       BFLGS = 0001(X)
                                       Exception occurred

                                       PORT A STATUS
                                       NAK retry limit reached

                                       PORT B STATUS
                                       Sucessful transmit

       HIS STATUS          000A
                                       UNKNOWN HIS STATUS
                                        

       ERROR ID        200D7930
                                       HIS ADDRESS = 200D7930(X)

       SRC                   06
                                       SRC NODE ADDRESS = 06(X)

       DST                   0E
                                       DST NODE ADDRESS = 0E(X)

       INTOPCD               00
                                       OPCODE = 00(X)
                                       RESERVED

       VCSTATE               85
                                       UNKNOWN VCSTATE

       PPD OPCODE          0000
                                       START

T.RTitleUserPersonal
Name
DateLines
5303.1Call Harware Support To Look At This Box...XDELTA::HOFFMANSteve, OpenVMS EngineeringMon May 05 1997 14:3618
   Please elevate this through formal channels, and get the organization's
   hardware service organization in to look at this configuration.

   And in particular, I'd check the firmware revision presently loaded
   into the HSJ40 controller at CI node E, and I'd look for any HSJ-related
   or storage-related errors or oddities, and for any HSJ-related reboots.

   I'd also check the revision of the CIXCD, and for any associated updates.
   And I'd look for any "unusual" XMI or BI widgets that might be around...

   Also see SPEZKO::CLUSTER 4431.*, 5218.2, and -- if you've got a CIPCA in
   the mix -- 4871.*.

   Also consider an upgrade to a more recent OpenVMS version.

   And see SSDEVO::HSJ40_PRODUCT note 385.*, for more information on the
   `DATAGRAM FOR NON-EXISTING "UCB"' errors.
5303.2MTE, tuning problems - also check cluster parametersCSC32::B_HIBBERTWhen in doubt, PANICWed May 07 1997 11:2521
    The node that was showing the port error bits set messages was logging
    maitenance timer expired errors.  This error indicates the system was
    hanging at or above IPL 8 for an extended period.  This is commonly
    caused by system resource problems such as runnning out of non-paged
    pool or page file space.  Check the system resources on this node and
    look for shortages.  Use DecPS if you have it on the system, it may
    help identify resource problems.
    
    I think you said that one of the other systems actually had the
    CLUEXIT.  If so, check your cluster parameters.  Make sure that such
    things as RECNXTINTERVAL and EXPECTED_VOTES are set consistently
    throughout the cluster.  The node that gets the MTE errors is commonly
    the node that gets the CLUEXIT unless the cluster parameters are set
    strangely.
    
    The HSJ errors are just virtual circuit closures to the CI node E.  I
    suspect this is either the system with the MTE or the system that
    crashed. 
    
    Brian Hibbert