[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::fddi

Title:FDDI - The Next Generation
Moderator:NETCAD::STEFANI
Created:Thu Apr 27 1989
Last Modified:Thu Jun 05 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2259
Total number of notes:8590

1287.0. "PC_Trace causing reboot??- Questions" by CGOOA::VAOP35::venner (NIS - Networks Team Vancouver) Tue Mar 22 1994 15:50

One of our large FDDI customers is having serious reboot issues with the 
DECconcentrator 500.  While the specific problem has not been 100% confirmed, 
we are suspecting it is around Digitals implementation of the PC_Trace 
feature.  There is a known problem with Cisco 7000's on the ring.  The 
customer problem is that the ring is more unstable with the Concentrators on 
the core than without them.

Questions

1. Is a microcode kit available to enable soft switching on/off of the 
PC_Trace feature in the DECconcentrator 500? OR is it possible to implement a 
patch with the DECconcentrator 500 such that the concentrator does not 
perform a complete diagnostic re-boot.

2. Does the DECconcentrator 900, and DECbridge 900 operate in the same manner 
as the current 500?

3. Does each port on the GigaSwitch operate in the same manner as the current 
500?

What I am trying to conclude is our ability to function in a very diverse 
multi-vendor FDDI shop.  In many situations our products will sit as the core 
ring.  The inability to user define the PC_Trace feature on/off, or the fact 
our products require a diagnostic re-boot is not acceptable to many high 
performance networking customers.  Our customer in particular has a known 
problem with Cisco.  They believe our product should be durable enough to 
handle the Cisco problems without a total re-boot!

Any insight to this issue would be appreciatted.
T.RTitleUserPersonal
Name
DateLines
1287.1KONING::KONINGPaul Koning, B-16504Tue Mar 22 1994 16:00207
No.

If Cisco is doing a PC Trace in a situation where the standard does not
call for it, that is an EXTREMELY SERIOUS bug in that implementation.

Our products do exactly what the standard requires.  I expect it would be
possible to change the firmware so it ignores the PC Trace, but in that
mode the product would VIOLATE the standard.  Not only that, but in that
mode, if a PC Trace occurred that is legitimate (i.e., when the situation
described in the standard occurs) the result would be that your network
would be totally down until the problem is corrected manually.

Cisco is causing this problem for several customers.  They really don't
have any competence in FDDI.  Below is a more detailed discussion of
Trace, and why it works the way it does, which I wrote a while ago.

	paul

------------------------------------------------------------------------
From: KONING "Paul Koning, B-16504"	Date: 24-Jan-94 02:19 PM
To: DELNI::ONEILL
cc: LEVERS::B_CRONIN,QUIVER::PARISEAU,DELNI::BRECHLIN,NPSS::JOHNSON,NPSS::TAYLOR,POBOX::DCARROLL
Subject: RE: Fermi and Unsolicited Reset

>I forwarded the questions to Denis Carroll and got a quick reply from Fermi.
>Indeed it is a Trace and the problem is with Cisco.  However Fermi is saying
>that we should not need to do a reset... (see below).

I've put some responses in the note below; you might forward them to Fermi.

>I will try to set up teleconference with Phil for next week when he gets back
>from vacation.  Please send your availability for a meeting.  We should also
>meet this week to discuss what our strategy is.  How is Friday?  Time?

Fine, anytime (after 10 preferred, before ok if you have to...

	paul
................
>From:	US1RMC::"[email protected]" "The Prince of Insufficient Light" 19-JAN-1994 13:14:14.37
>To:	delni::oneill
>CC:	[email protected], POBOX::DCARROLL, [email protected]
>Subj:	DECconcentrators, PATH TEST, resets, etc...
>
>Hi Sharon,
>
>	I'm Phil DeMar of Fermilab.  Denis Carroll forwarded your e-mail
>on our reported FDDI concentrator problems on to me.  Since I understand
>that you'll be away for the latter part of this week, and I'll be on
>vacation next week (somewhere warm, thank you...), I thought I'd
>shortcut the usual chain of command and respond directly to you.  First
>of all, let me emphasize several key points regarding our specific
>complaint:
>
>    1)  The offending device that causing our TRACE problems happens to
>be a cisco 7000 router, not a SUN workstation.  We are quite aware there
>is a problem there, and are working with cisco to get it resolved. 
>
>    2)  Our issue (concern...) with the DEC concentrator is how it handles
>PATH TEST.  We recognize that a TRACE condition does indicate a problem
>with the ring, and that if a concentrator is identified to be within the
>TRACE fault zone, diagnostic testing of concentrator components is
>necessary.  However, it doesn't seem obvious to us that this should
>always result in a concentrator reset.  If only concentrator PHY ports
>are part of the fault zone (ie., the concentrator MAC is not part of the
>fault zone), then it seems logical to us to only reset the concentrator
>if a problem is indicated on those specific PHY ports.  Absent that, the
>concentrator should simply reinitialize the (remaining...) ring, and let
>the other stations in the fault zone recover from their own PATH TEST
>and rejoin the ring.  
>
>    3)  From the correspondence I've seen from your engineers, it
>sounds like the reason the concentrator resets on PATH TEST is simply
>that seems to be the (convenient?/only?) way to run diags.  The question
>would seem to be whether this is an architectural limitation or simply 
>implementers choice.

The issue here is on what exactly is the "fault zone" indicated by a Trace.

Things would be very simple if it were the port on which the trace arrives,
or that port plus some other ports.  Unfortunately it is not.  The "fault zone"
is the port on which the trace arrives, the next active port "upstream"
AND all the internal data paths between these two.  The fact that the internal
data paths are part of the "fault zone" and therefore are part of what needs
to be tested is the problem.  If there were a fault in that data path, then
a concentrator that simply tests its ports would not see any problem (since
the fault is not in the part being tested); the "Path test" would complete
without error, the concentrator would rejoin the ring, and the ring would
once again be down.

The internal data path between the ports is a common resource used by all the
ports.  It is certainly possible in principle to build a concentrator where
the internal data path consists of several pieces, each of which can be
individually bypassed and then tested.  But such a design is not supported
by ANY of the PHY chipsets on the market.  It can be built only by adding
substantial amounts of additional logic into the datapaths.  I know of no
concentrator on the market that does this.  (Note that a "local" path, or
multiple MACs, or other extra hardware implemented in some concentrators
supposedly for greater diagnostic capbility, do not address this issue at all;
a concentrator that has these things has the same issue as we do, only more
so since it has more paths to partition!)

>    4)  We see this becoming a more significant vulnerability as FDDI
>deployment extends beyond a few routers/bridges and a handful of
>hosts on an FDDI backbone, to one in which there are 100s of hosts (of
>all types & flavors).  The liklihood of ring problems & TRACE conditions
>will become much greater, and we certainly expect our concentrators to
>"show some prudence" about reseting whenever an attached SA decides (for
>whatever reason) to initiate a TRACE.  

The issue we have is that the right answer DOES require us to make assumptions
about the reason why the station initiated Trace.

The FDDI standard specifies a very precise and limited circumstance under
which Trace is initiated.  The condition prescribed by the standard occurs
ONLY when there is a fault internal to some station or concentrator (e.g.,
inside the PHY chip, or in the data path from one PHY chip to another).
Note in particular that NO fault external to the station, or in any optical
component, is a trigger for Trace.   

If a Trace occurs due to the condition prescribed by the standard, your ring
is down and remains down until the failed module has been identified by
a Path test in one of the stations or concentrators.  If shortcuts are taken
in the Path test (for example if the test were to include only the port
components and not the connecting data paths) the consequence would be that
your ring would be permanently down, until the defective station is removed
by manual action.  This is why the standard describes in as strong a wording
as it can do what the path test is required to include.

It is certainly possible to react less strongly to a Trace signal on a port.
If this is done, that essentially means that the network manager has decided
that Trace signals are most likely to be false alarms, and are to be dismissed
in the most efficient way possible.  That is indeed a useful feature if 
connected stations generate Trace signals for reasons other than the one
prescribed in the standard, i.e., due to serious bugs.  However, the price
for doing this is that a Trace that occurs for the correct reason (the type
of fault envisioned in the standard) will NOT be resolved and will take
the ring down until manual removal of the faulty component.

>With regard to your specific questions:
>
>>>>  	1. Can you please confirm that stuck beaconing was happening prior
>>>>  	   to the PC_Trace.  This is the condition that is supposed to
>>>>  	   initiate PC_Trace.
>
>	I've verified that cisco 7000 does initiate a TRACE condition by
>sending out MLS.  For now, I'd just as soon work the cisco problem with
>cisco (ie., as to whether its beaconing, etc.).  The relevent point with
>respect to the concentrator is that it does receive a TRACE condition
>from one of its M-port attachments, and will always reset when it sees
>that TRACE condition.
>
>>>>  	2. Need more info on the system that is causing the PC_Trace.
>>>>  	   If there is a REAL reason
>>>>  	   that PC_Trace is happening, the DEC implementation is 
>>>>  	   abiding to the standard.  We are trying to determine if the PC_Trace
>>>>  	   could be from a poor implementation or there really is a problem 
>>>>  	   with the node.
>
>	With all respects, if there's a TRACE condition, it is *real*. 
>Whether the TRACE-initiating station properly initiated the TRACE
>(claim, beacon, directed beacon, etc.) is irrelevent.  The issue here is
>how the concentrator handles a TRACE, not whether the attached system
>had cause to initiate it...

The issue is not whether the Trace is real, but whether the conditions
given in the standard for initiating Trace are present.  If they are, then
this indicates broken hardware.  If not, it indicates a bug in the station
initiating the trace.

As I mentioned above, if one were to take a shortcut in Trace handling,
this will have positive effects if the Trace signal was caused by a bug, 
but very negative effects if it was caused legitimately by a station fault.

>>>>  	3. Can you please get the error log.  We are trying to determine if
>>>>  	   this PATH test is being initiated by a real PC_Trace or a bug.
>>>>  	   Error log will help to determine this.
>
>	I had a Tekelec analyzer on the line.  The attached station sent
>out MLS.  Its a real TRACE condition, and the concentrator handles it
>properly (passing MLS to the upstream neighbor, then launching into PATH
>TEST when the upstream neighbor drops to QLS).  From our perspective,
>the concentrator does TRACE right;  we're just questioning the PATH TEST
>initiated by that TRACE condition.
>
>	Thanks.  Bye.
>
							-- Phil DeMar

>>>  Has anyone put a moniter on the network to see what was happening?

	More times than I care to think about.  However, as with any
FDDI problem, what's happening on one part of the ring at any instant is
different from everywhere else on the ring.  Additionally, we can't do
physical line tracing and frame grabbing concurrently, so we can only
look at one or the other at any given incident.  Combine all that with
the fact that this is an intermittent (order ~ days) problem, and this
becomes a real SOB to get a handle on...

% ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ======
% Received: by us1rmc.bb.dec.com; id AA23394; Wed, 19 Jan 94 13:13:18 -0500
% Received: from fnnetd.fnal.gov by inet-gw-1.pa.dec.com (5.65/13Jan94) id AA10621; Wed, 19 Jan 94 09:51:32 -080
% Date: Wed, 19 Jan 1994 11:51:26 -0600 (CST)
% From: The Prince of Insufficient Light <[email protected]>
% To: delni::oneill
% Cc: [email protected], POBOX::DCARROLL, [email protected]
% Message-Id: <[email protected]>
% Subject: DECconcentrators, PATH TEST, resets, etc...
1287.2PC_Trace (more options.)QUIVER::PARISEAULuc PariseauWed Mar 23 1994 16:3931
	I agree with Paul that we are doing the right thing by doing
	a complete self test via a reboot.  But because of what is
	going on we will change this.

	Our new FDDI HUB products will (probably not at FCS) allow
	the customer to pick the Self_Test behavior:

	1-) Complete Self Test (via reboot like today; this will be the default)
	2-) Re-enable Station after Trace.  All ports will be disconnected
	    and then restart.  This will perform very little testing
	    (PCM will have to work at both ends).

	I will be in touch with older products and try to get them to
	do this also.

	If a "real" trace condition is on the network and we are the
	ones doing it, method 2 will cause major network problems.
	But,almost all the trace problems we hear about are not
	our problems but by rebooting we are "causing" (as perceived
	by many customers) more problems.

	Also, the network will be down for about 14 secs anyway
	(stuck beacon + trace propagation).  The reboot adds 30 (or so)
	seconds to that.

	Again let me say that we are doing the legal/standard specified
	thing now.  It's not the popular thing so we will give the
	customers a choice.

	Luc
1287.3KONING::KONINGPaul Koning, B-16504Thu Mar 24 1994 18:0210
Let's make perfectly clear that choice (2) explicitly puts the station in
violation of the standard, and should ONLY be used as a workaround for the
gross defects of certain other implementations.

Incidentally, we are not the only ones who handle trace correctly; after 
I posted a discussion of this stuff on the Usenet, I got a reaction from
someone at another company saying something like "ok, so I'm not the only
one doing it this way... glad to see someone thinks it's correct to do that"

	paul