| Title: | FDDI - The Next Generation |
| Moderator: | NETCAD::STEFANI |
| Created: | Thu Apr 27 1989 |
| Last Modified: | Thu Jun 05 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 2259 |
| Total number of notes: | 8590 |
We have a problem with a crashing cluster under OpenVMS V6.2-1H2.
Configuration: OpenVMS V6.2-1H2
4 * 2100A with FDDI interface
direct connected to a Gigaswitch via one FGL-4
DSSI Connection between 2 of the 2100A
appr. 50 satellites
For some reason, one of the Sables gets an error on the FDDI interface
and starts to init the FGL-4 card. Than, the other 3 machines also get
errors on the FDDI, like timeouts. Two of them crash with a CLUEXIT, like
all the satellites! Those two, with the DSSI connection between, survive.
The problem, of the timeouts generated from the FGL-4 card in the
gigaswitch hopefully will be fixed with new firmware 3.01.
The question is, why do we get the first error on the FWA0 interface.
He told me that he had crashes appr. 6 month ago with V6.2. After
crash dump analysis he got a new SYS$FWDRIVER. This driver was linked
on the 13-DEC-1995. No idea, if this was the same or a similiar problem.
Now, after upgrading to V6.2-1H2 they have a new SYS$FWDRIVER
with link date/time 27-JAN-1996.
Are those changes from the special driver implemented in the V6.2-1H2 one?
May he use the special one from V6.2 with V6.2-1H2?
Original V6.2-1H2:
image name: "SYS$FWDRIVER"
image file identification: "X-3"
image file build identification: "X61Q-SSB-DD00"
link date/time: 27-JAN-1996
Special V6.2:
image name: "SYS$FWDRIVER"
image file identification: "X-3"
image file build identification: "X61Q-SSB-0000"
link date/time: 13-DEC-1995 <-------------------
Any help will be welcome.
Michael
Here are some extracts from the cluster:
SDA> SHOW LAN /FDDI /ERROR
LAN Data Structures
-------------------
-- FWA Error Information 12-JUN-1996 16:51:47 --
Fatal error count 6 Last error CSR 00000400
Fatal error code 3-XmtTimeout Last fatal error 11-JUN 14:54:49
Prev error code 3-XmtTimeout Prev fatal error 11-JUN 13:26:07
Transmit timeouts 6 Last USB time None
Control timeouts 0 Last UUB time 12-JUN 03:58:23
Restart failures 0 Last CRC time None
Power failures 0 Last CRC srcadr None
Bad PTE transmits 0 Last length erro None
Loopback failures 0 Last exc collisi None
System ID failures 0 Last carrier fai None
ReqCounters failures 0 Last late collis None
And here is a part of the ERRLOG.SYS
******************************* ENTRY 759. *******************************
ERROR SEQUENCE 21121. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 11-JUN-1996 11:06:59.54 SYS_TYPE 00000018
SYSTEM UPTIME: 5 DAYS 17:55:51
SCS NODE: AXP601 OpenVMS AXP V6.2-1H2
HW_MODEL: 00000423 Hardware Model = 1059.
ERL$LOGMESSAGE AlphaServer 2100A 4/200
NI-SCS SUB-SYSTEM, _AXP601$PEA0:
PORT HAS CLOSED VIRTUAL CIRCUIT
LOCAL STATION ADDRESS, FFFFFFFFFF00(X)
LOCAL SYSTEM ID, 00000000F525(X)
REMOTE STATION ADDRESS, 0000000000CB(X)
REMOTE SYSTEM ID, 00000000F5D0(X)
UCB$L_ERTCNT 00000032
50. RETRIES REMAINING
UCB$L_ERTMAX 00000032
50. RETRIES ALLOWABLE
UCB$L_ERRCNT 0000003F
63. ERRORS THIS UNIT
PPD$B_PORT 00
REMOTE NODE # 0.
PPD$B_STATUS 00
PPD$B_OPC 00
UNKNOWN OPCODE
PPD$B_FLAGS 00
V M S SYSTEM ERROR REPORT COMPILED 12-JUN-1996 16:55:07
PAGE 25.
******************************* ENTRY 765. *******************************
ERROR SEQUENCE 21127. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 11-JUN-1996 11:09:19.72 SYS_TYPE 00000018
SYSTEM UPTIME: 5 DAYS 17:58:12
SCS NODE: AXP601 OpenVMS AXP V6.2-1H2
HW_MODEL: 00000423 Hardware Model = 1059.
DEVICE ATTENTION AlphaServer 2100A 4/200
NI-SCS SUB-SYSTEM, AXP601$PEA0:
FATAL ERROR DETECTED BY DATALINK
STATUS 0000045C
00001201
DATALINK UNIT 0001
DATALINK NAME 41574603
00000000
00000000
00000000
DATALINK NAME = FWA1:
REMOTE NODE 00000000
00000000
00000000
00000000
REMOTE ADDR 00000000
0000
LOCAL ADDR 000400AA
F525
ETHERNET ADDR = 0E-01-01-00-00-00
ERROR CNT 0001
1. ERROR OCCURRENCES THIS ENTRY
UCB$L_ERRCNT 00000040
64. ERRORS THIS UNIT
V M S SYSTEM ERROR REPORT COMPILED 12-JUN-1996 16:55:07
PAGE 26.
******************************* ENTRY 766. *******************************
ERROR SEQUENCE 21128. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 11-JUN-1996 11:09:22.86 SYS_TYPE 00000018
SYSTEM UPTIME: 5 DAYS 17:58:15
SCS NODE: AXP601 OpenVMS AXP V6.2-1H2
HW_MODEL: 00000423 Hardware Model = 1059.
DEVICE ATTENTION AlphaServer 2100A 4/200
NI-SCS SUB-SYSTEM, AXP601$PEA0:
FATAL ERROR DETECTED BY DATALINK
STATUS 8BD4F200
00001200
DATALINK UNIT 0001
DATALINK NAME 41574603
00000000
00000000
00000000
DATALINK NAME = FWA1:
REMOTE NODE 00000000
00000000
00000000
00000000
REMOTE ADDR 00000000
0000
LOCAL ADDR 000400AA
F525
ETHERNET ADDR = 0E-01-01-00-00-00
ERROR CNT 0001
1. ERROR OCCURRENCES THIS ENTRY
UCB$L_ERRCNT 00000041
65. ERRORS THIS UNIT
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 2065.1 | 19584::STOCKDALE | Thu Jun 13 1996 07:45 | 14 | ||
I can't answer your question about why the transmit timeout occurred but normally its because the link became unavailable so rather than hold on to the outstanding transmit forever, FWDRIVER resets the DEFPA and returns the transmit with error status. Perhaps a SHOW LAN/FULL would provide more information. As to the driver version question. The V6.2 driver enabled parity checking when it shouldn't have. This caused occasional parity error crashes. The new driver disabled parity checking. This change is included in the V6.2-1H* versions. This sounds like a much different problem than what you are having which sounds like a network problem. Dick | |||||
| 2065.2 | GIGASwitch crashing? | CSC32::J_SOBECKI | John Sobecki, DTN 592-4101, CXO3-2/D2 | Thu Jun 13 1996 12:10 | 20 |
Hello,
Usually the transmit timeouts are caused by the loss of physical
connection, aka is the GIGASwitch crashing? I've never heard of a
DEFPA causing an FGL-4 card to go down.
Were the previous crashes the UCB R5 cleared crash? This crash seems
to not be checked in the recent LAN driver images.
The V6.2 driver should work fine under V6.2-1H2. I'd check the errolog
on the GIGASwitch to see what's causing the transmit timeouts. If you
have more than one SCP, and the SCP's are crashing, the errorlog is
contained on the SCP itself. So if the Elected SCP is the seconday
SCP, you'll need to fail back to the primary SCP to check the errorlog.
Maybe this is a new 2100A related problem. I'd IPMT the driver issue
if the crashes have returned.
Good Day,
John
| |||||
| 2065.3 | Get error log from FGL4 if necessary | NPSS::RLEBLANC | Thu Jun 13 1996 15:42 | 7 | |
If the SCP reports the FGL-4 in question is crashing, please
also get the error log from the FGL4.
| |||||
| 2065.4 | FRSIT::MAYER | Mon Jun 17 1996 07:20 | 10 | ||
Hi,
as next we will check GIGAswitch Errorlog to see if there are some Problem
regarding the GIGAswitch SCP or Linecard.
Also a sho lan/full is available on FRSIT::GSI_SDA_LAN.TXT
Regards
Juergen Mayer
| |||||
| 2065.5 | 19584::STOCKDALE | Tue Jun 18 1996 11:17 | 5 | ||
>>Also a sho lan/full is available on FRSIT::GSI_SDA_LAN.TXT It doesn't appear to be there. - Dick | |||||
| 2065.6 | SDA output now available | FRSIT::MAYER | Thu Jun 20 1996 08:15 | 5 | |
Sorry,
the sho lan/full is now available on FRSIT::GSI_SDA_LAN.TXT
Regards Juergen
| |||||
| 2065.7 | 19584::STOCKDALE | Thu Jun 20 1996 15:57 | 63 | ||
If I extract the significant information from the counters it shows that the ring went away and came back a few times, resulting in failed transmits (either a timeout after the ring went away or transmits while the ring was not available). The last error CSR shows the port status register contents at the time of the transmit timeout, showing 'link available' and nothing else - this indicates that the FDDI appeared to be ok when the driver declared a transmit timeout and shut down the adapter. Note that the transmit timeout is 5-6 seconds, so the device owned the transmit for that long before the timeout occurred. Transmit underrun 0 Dup tokens detected 7 Ring inits received 5 LEM rejects 0 DAT test failures 0 Connections completed 10 No work transmits 59193334 Ring avail transitions 10 Buffer_Addr transmits 0 Ring unavail transitions 7 +00 Device interrupts 296991649 +2C Too many segments 0 +08 Transmits failed 2779 +34 RESETs issued 3 +0C Receive errors 0 +38 Fatal errs (soft tmo) 2 +10 Transmit timeouts 2 +3C EEPROM update tmo 0 Fatal error count 2 Last error CSR 00000400 Fatal error code 3-XmtTimeout Last fatal error 11-JUN 11:09:38 Prev error code 3-XmtTimeout Prev fatal error 7-JUN 16:50:01 Transmit timeouts 2 Last USB time None The driver version is the V6.2-1H2 version. There is a later version in V6.2-1H3 but it only has a bug fix for a DEFAA workaround so although the version is different, the code is identical since the DEFAA bug fix is in DEFAA conditional code). But the driver consists of a port driver plus the LAN common routines. The LAN common routines has a couple of fixes in V6.2-1H3, one when more than 11 multicast addresses are enabled (this system has 11 exactly), and one which affects shared user applications causing the first packet received by a shared user to be lost (if there was actually a shared user) and in this case there are no shared users (although there are two users started in shared mode there are only one for each protocol type). So neither of these fixes is significant in your case. So, my guess is that there was an failure of the ring which is likely something on the ring and not the DEFPA in the system. Perhaps a longer timeout would have allowed the FDDI ring to recover from whatever was going on, but given that the driver would have restarted the users automatically immediately after the error, the cluexits shouldn't have happened, but apparently the FDDI ring did not come back before the reconnect interval expired so the satellites cluexited. Increasing the reconnect interval may give the nodes enough time for the ring to recover. >> The question is, why do we get the first error on the FWA0 interface. Because the FDDI ring became unavailable for more than 5-6 seconds. >> Are those changes from the special driver implemented in the V6.2-1H2 one? Yes. >> May he use the special one from V6.2 with V6.2-1H2? Yes, as long as he doesn't want a couple of additional bug fixes. - Dick | |||||
| 2065.8 | FRSIT::MAYER | Fri Jun 21 1996 06:32 | 13 | ||
Hi Dick, I also saw the Ring Inits and Connection Completed. So I asked the customer if he was plugging and unplugging the Systems from the Gigaswitch. He confirmed that he was moving from one Gigaswitch Port to another ones, but didn't remember how often. So in the moment we doesn't know how many Inits are "homemade" or real failures. Because we have the counters from know, we have to wait until the next failure occurs. We also focus on the Gigaswitch counters and errorlogs. regards Juergen | |||||