[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::fddi

Title:FDDI - The Next Generation
Moderator:NETCAD::STEFANI
Created:Thu Apr 27 1989
Last Modified:Thu Jun 05 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2259
Total number of notes:8590

2065.0. "3-XmtTimeout on FDDI" by 50359::MGRUENWALD () Thu Jun 13 1996 07:44

    We have a problem with a crashing cluster under OpenVMS V6.2-1H2.
    
    Configuration:	OpenVMS V6.2-1H2
    			4 * 2100A with FDDI interface
    			direct connected to a Gigaswitch via one FGL-4
    			DSSI Connection between 2 of the 2100A
    			appr. 50 satellites

    For some reason, one of the Sables gets an error on the FDDI interface
    and starts to init the FGL-4 card. Than, the other 3 machines also get
    errors on the FDDI, like timeouts. Two of them crash with a CLUEXIT, like
    all the satellites! Those two, with the DSSI connection between, survive.
    The problem, of the timeouts generated from the FGL-4 card in the
    gigaswitch hopefully will be fixed with new firmware 3.01.
    The question is, why do we get the first error on the FWA0 interface.
    
    He told me that he had crashes appr. 6 month ago with V6.2. After 
    crash dump analysis he got a new SYS$FWDRIVER. This driver was linked
    on the 13-DEC-1995. No idea, if this was the same or a similiar problem.
    Now, after upgrading to V6.2-1H2 they have a new SYS$FWDRIVER
    with link date/time 27-JAN-1996. 
    Are those changes from the special driver implemented in the V6.2-1H2 one?
    May he use the special one from V6.2 with V6.2-1H2?

    Original V6.2-1H2:	
    		image name: "SYS$FWDRIVER"
    		image file identification: "X-3"
    		image file build identification: "X61Q-SSB-DD00"
    		link date/time:  27-JAN-1996
    
    Special V6.2:
    		image name: "SYS$FWDRIVER"
    		image file identification: "X-3"
    		image file build identification: "X61Q-SSB-0000"
    		link date/time:  13-DEC-1995     <-------------------
    
    Any help will be welcome.
    
    Michael
    
    Here are some extracts from the cluster:
    
    SDA> SHOW LAN /FDDI /ERROR

    LAN Data Structures
    -------------------
                -- FWA Error Information 12-JUN-1996 16:51:47 --

Fatal error count                  6    Last error CSR              00000400
Fatal error code        3-XmtTimeout    Last fatal error     11-JUN 14:54:49
Prev  error code        3-XmtTimeout    Prev fatal error     11-JUN 13:26:07
Transmit timeouts                  6    Last USB time                   None
Control timeouts                   0    Last UUB time        12-JUN 03:58:23
Restart failures                   0    Last CRC time                   None
Power failures                     0    Last CRC srcadr                 None
Bad PTE transmits                  0    Last length erro                None
Loopback failures                  0    Last exc collisi                None
System ID failures                 0    Last carrier fai                None
ReqCounters failures               0    Last late collis                None
    
    And here is a part of the ERRLOG.SYS


 ******************************* ENTRY     759. *******************************
 ERROR SEQUENCE 21121.                           LOGGED ON:  CPU_TYPE 00000002
 DATE/TIME 11-JUN-1996 11:06:59.54                            SYS_TYPE 00000018
 SYSTEM UPTIME: 5 DAYS 17:55:51
 SCS NODE: AXP601                                           OpenVMS AXP V6.2-1H2

 HW_MODEL: 00000423 Hardware Model = 1059.

 ERL$LOGMESSAGE AlphaServer 2100A 4/200

 NI-SCS SUB-SYSTEM, _AXP601$PEA0:

       PORT HAS CLOSED VIRTUAL CIRCUIT

       LOCAL STATION ADDRESS, FFFFFFFFFF00(X)
       LOCAL SYSTEM ID, 00000000F525(X)

       REMOTE STATION ADDRESS, 0000000000CB(X)
       REMOTE SYSTEM ID, 00000000F5D0(X)

       UCB$L_ERTCNT    00000032
                                       50. RETRIES REMAINING
       UCB$L_ERTMAX    00000032
                                       50. RETRIES ALLOWABLE
       UCB$L_ERRCNT    0000003F
                                       63. ERRORS THIS UNIT
       PPD$B_PORT            00
                                       REMOTE NODE # 0.
       PPD$B_STATUS          00
       PPD$B_OPC             00
                                       UNKNOWN OPCODE
       PPD$B_FLAGS           00





 V M S                SYSTEM ERROR REPORT         COMPILED 12-JUN-1996 16:55:07
                                                                      PAGE  25.

 ******************************* ENTRY     765. *******************************
 ERROR SEQUENCE 21127.                           LOGGED ON:  CPU_TYPE 00000002
 DATE/TIME 11-JUN-1996 11:09:19.72                            SYS_TYPE 00000018
 SYSTEM UPTIME: 5 DAYS 17:58:12
 SCS NODE: AXP601                                           OpenVMS AXP V6.2-1H2

 HW_MODEL: 00000423 Hardware Model = 1059.

 DEVICE ATTENTION AlphaServer 2100A 4/200

 NI-SCS SUB-SYSTEM, AXP601$PEA0:

       FATAL ERROR DETECTED BY DATALINK

       STATUS          0000045C
                       00001201
       DATALINK UNIT       0001
       DATALINK NAME   41574603
                       00000000
                       00000000
                       00000000
                                       DATALINK NAME = FWA1:
       REMOTE NODE     00000000
                       00000000
                       00000000
                       00000000
       REMOTE ADDR     00000000
                           0000
       LOCAL ADDR      000400AA
                           F525
                                       ETHERNET ADDR = 0E-01-01-00-00-00
       ERROR CNT           0001

                                       1. ERROR OCCURRENCES THIS ENTRY
       UCB$L_ERRCNT    00000040
                                       64. ERRORS THIS UNIT


V M S                SYSTEM ERROR REPORT         COMPILED 12-JUN-1996 16:55:07
                                                                      PAGE  26.

 ******************************* ENTRY     766. *******************************
 ERROR SEQUENCE 21128.                           LOGGED ON:  CPU_TYPE 00000002
 DATE/TIME 11-JUN-1996 11:09:22.86                            SYS_TYPE 00000018
 SYSTEM UPTIME: 5 DAYS 17:58:15
 SCS NODE: AXP601                                           OpenVMS AXP V6.2-1H2

 HW_MODEL: 00000423 Hardware Model = 1059.

 DEVICE ATTENTION AlphaServer 2100A 4/200

 NI-SCS SUB-SYSTEM, AXP601$PEA0:

       FATAL ERROR DETECTED BY DATALINK

       STATUS          8BD4F200
                       00001200
       DATALINK UNIT       0001
       DATALINK NAME   41574603
                       00000000
                       00000000
                       00000000
                                       DATALINK NAME = FWA1:
       REMOTE NODE     00000000
                       00000000
                       00000000
                       00000000
       REMOTE ADDR     00000000
                           0000
       LOCAL ADDR      000400AA
                           F525
                                       ETHERNET ADDR = 0E-01-01-00-00-00
       ERROR CNT           0001
                                       1. ERROR OCCURRENCES THIS ENTRY
       UCB$L_ERRCNT    00000041
                                       65. ERRORS THIS UNIT
T.RTitleUserPersonal
Name
DateLines
2065.119584::STOCKDALEThu Jun 13 1996 08:4514
I can't answer your question about why the transmit timeout occurred
but normally its because the link became unavailable so rather than
hold on to the outstanding transmit forever, FWDRIVER resets the DEFPA
and returns the transmit with error status.

Perhaps a SHOW LAN/FULL would provide more information.

As to the driver version question.  The V6.2 driver enabled parity
checking when it shouldn't have.  This caused occasional parity error
crashes.  The new driver disabled parity checking.  This change is
included in the V6.2-1H* versions.  This sounds like a much different
problem than what you are having which sounds like a network problem.

Dick
2065.2GIGASwitch crashing?CSC32::J_SOBECKIJohn Sobecki, DTN 592-4101, CXO3-2/D2Thu Jun 13 1996 13:1020
    Hello,
    
    Usually the transmit timeouts are caused by the loss of physical
    connection, aka is the GIGASwitch crashing?  I've never heard of a
    DEFPA causing an FGL-4 card to go down.
    
    Were the previous crashes the UCB R5 cleared crash?  This crash seems
    to not be checked in the recent LAN driver images.  
    
    The V6.2 driver should work fine under V6.2-1H2.  I'd check the errolog
    on the GIGASwitch to see what's causing the transmit timeouts.  If you
    have more than one SCP, and the SCP's are crashing, the errorlog is
    contained on the SCP itself.  So if the Elected SCP is the seconday
    SCP, you'll need to fail back to the primary SCP to check the errorlog.
    
    Maybe this is a new 2100A related problem.  I'd IPMT the driver issue
    if the crashes have returned.  
    
    Good Day,
    John
2065.3Get error log from FGL4 if necessaryNPSS::RLEBLANCThu Jun 13 1996 16:427
    
      If the SCP reports the FGL-4 in question is crashing, please
    also get the error log from the FGL4.
    
    
    						
    
2065.4FRSIT::MAYERMon Jun 17 1996 08:2010
Hi,

as next we will check GIGAswitch Errorlog to see if there are some Problem
regarding the GIGAswitch SCP or Linecard.

Also a sho lan/full is available on FRSIT::GSI_SDA_LAN.TXT

Regards
Juergen Mayer
                                                             
2065.519584::STOCKDALETue Jun 18 1996 12:175
>>Also a sho lan/full is available on FRSIT::GSI_SDA_LAN.TXT

It doesn't appear to be there.

- Dick
2065.6SDA output now availableFRSIT::MAYERThu Jun 20 1996 09:155
    Sorry,
    
    the sho lan/full is now available on FRSIT::GSI_SDA_LAN.TXT
    
    Regards Juergen
2065.719584::STOCKDALEThu Jun 20 1996 16:5763
If I extract the significant information from the counters it shows that
the ring went away and came back a few times, resulting in failed transmits
(either a timeout after the ring went away or transmits while the ring was
not available).  The last error CSR shows the port status register contents
at the time of the transmit timeout, showing 'link available' and nothing
else - this indicates that the FDDI appeared to be ok when the driver
declared a transmit timeout and shut down the adapter.  Note that the
transmit timeout is 5-6 seconds, so the device owned the transmit for
that long before the timeout occurred.

Transmit underrun                  0    Dup tokens detected                7
Ring inits received                5    LEM rejects                        0
DAT test failures                  0    Connections completed             10

No work transmits           59193334    Ring avail transitions            10
Buffer_Addr transmits              0    Ring unavail transitions           7
+00 Device interrupts      296991649    +2C Too many segments              0
+08 Transmits failed            2779    +34 RESETs issued                  3
+0C Receive errors                 0    +38 Fatal errs (soft tmo)          2
+10 Transmit timeouts              2    +3C EEPROM update tmo              0

Fatal error count                  2    Last error CSR              00000400
Fatal error code        3-XmtTimeout    Last fatal error     11-JUN 11:09:38
Prev  error code        3-XmtTimeout    Prev fatal error      7-JUN 16:50:01
Transmit timeouts                  2    Last USB time                   None

The driver version is the V6.2-1H2 version.  There is a later version in
V6.2-1H3 but it only has a bug fix for a DEFAA workaround so although the
version is different, the code is identical since the DEFAA bug fix is in
DEFAA conditional code).

But the driver consists of a port driver plus the LAN common routines.  The
LAN common routines has a couple of fixes in V6.2-1H3, one when more than
11 multicast addresses are enabled (this system has 11 exactly), and one
which affects shared user applications causing the first packet received
by a shared user to be lost (if there was actually a shared user) and in
this case there are no shared users (although there are two users started
in shared mode there are only one for each protocol type).  So neither of
these fixes is significant in your case.

So, my guess is that there was an failure of the ring which is likely
something on the ring and not the DEFPA in the system.  Perhaps a
longer timeout would have allowed the FDDI ring to recover from whatever
was going on, but given that the driver would have restarted the users
automatically immediately after the error, the cluexits shouldn't have
happened, but apparently the FDDI ring did not come back before the
reconnect interval expired so the satellites cluexited.  Increasing
the reconnect interval may give the nodes enough time for the ring
to recover.

>>    The question is, why do we get the first error on the FWA0 interface.

Because the FDDI ring became unavailable for more than 5-6 seconds.

>>    Are those changes from the special driver implemented in the V6.2-1H2 one?

Yes.

>>    May he use the special one from V6.2 with V6.2-1H2?

Yes, as long as he doesn't want a couple of additional bug fixes.

- Dick
2065.8FRSIT::MAYERFri Jun 21 1996 07:3213
Hi Dick,

I also saw the Ring Inits and Connection Completed. So I asked the customer if
he was plugging and unplugging the Systems from the Gigaswitch.
He confirmed that he was moving from one Gigaswitch Port to another ones,
but didn't remember how often.
So in the moment we doesn't know how many Inits are "homemade" or real failures.
Because we have the counters from know, we have to wait until the next failure
occurs.

We also focus on the Gigaswitch counters and errorlogs.

regards Juergen