| No.
If Cisco is doing a PC Trace in a situation where the standard does not
call for it, that is an EXTREMELY SERIOUS bug in that implementation.
Our products do exactly what the standard requires. I expect it would be
possible to change the firmware so it ignores the PC Trace, but in that
mode the product would VIOLATE the standard. Not only that, but in that
mode, if a PC Trace occurred that is legitimate (i.e., when the situation
described in the standard occurs) the result would be that your network
would be totally down until the problem is corrected manually.
Cisco is causing this problem for several customers. They really don't
have any competence in FDDI. Below is a more detailed discussion of
Trace, and why it works the way it does, which I wrote a while ago.
paul
------------------------------------------------------------------------
From: KONING "Paul Koning, B-16504" Date: 24-Jan-94 02:19 PM
To: DELNI::ONEILL
cc: LEVERS::B_CRONIN,QUIVER::PARISEAU,DELNI::BRECHLIN,NPSS::JOHNSON,NPSS::TAYLOR,POBOX::DCARROLL
Subject: RE: Fermi and Unsolicited Reset
>I forwarded the questions to Denis Carroll and got a quick reply from Fermi.
>Indeed it is a Trace and the problem is with Cisco. However Fermi is saying
>that we should not need to do a reset... (see below).
I've put some responses in the note below; you might forward them to Fermi.
>I will try to set up teleconference with Phil for next week when he gets back
>from vacation. Please send your availability for a meeting. We should also
>meet this week to discuss what our strategy is. How is Friday? Time?
Fine, anytime (after 10 preferred, before ok if you have to...
paul
................
>From: US1RMC::"[email protected]" "The Prince of Insufficient Light" 19-JAN-1994 13:14:14.37
>To: delni::oneill
>CC: [email protected], POBOX::DCARROLL, [email protected]
>Subj: DECconcentrators, PATH TEST, resets, etc...
>
>Hi Sharon,
>
> I'm Phil DeMar of Fermilab. Denis Carroll forwarded your e-mail
>on our reported FDDI concentrator problems on to me. Since I understand
>that you'll be away for the latter part of this week, and I'll be on
>vacation next week (somewhere warm, thank you...), I thought I'd
>shortcut the usual chain of command and respond directly to you. First
>of all, let me emphasize several key points regarding our specific
>complaint:
>
> 1) The offending device that causing our TRACE problems happens to
>be a cisco 7000 router, not a SUN workstation. We are quite aware there
>is a problem there, and are working with cisco to get it resolved.
>
> 2) Our issue (concern...) with the DEC concentrator is how it handles
>PATH TEST. We recognize that a TRACE condition does indicate a problem
>with the ring, and that if a concentrator is identified to be within the
>TRACE fault zone, diagnostic testing of concentrator components is
>necessary. However, it doesn't seem obvious to us that this should
>always result in a concentrator reset. If only concentrator PHY ports
>are part of the fault zone (ie., the concentrator MAC is not part of the
>fault zone), then it seems logical to us to only reset the concentrator
>if a problem is indicated on those specific PHY ports. Absent that, the
>concentrator should simply reinitialize the (remaining...) ring, and let
>the other stations in the fault zone recover from their own PATH TEST
>and rejoin the ring.
>
> 3) From the correspondence I've seen from your engineers, it
>sounds like the reason the concentrator resets on PATH TEST is simply
>that seems to be the (convenient?/only?) way to run diags. The question
>would seem to be whether this is an architectural limitation or simply
>implementers choice.
The issue here is on what exactly is the "fault zone" indicated by a Trace.
Things would be very simple if it were the port on which the trace arrives,
or that port plus some other ports. Unfortunately it is not. The "fault zone"
is the port on which the trace arrives, the next active port "upstream"
AND all the internal data paths between these two. The fact that the internal
data paths are part of the "fault zone" and therefore are part of what needs
to be tested is the problem. If there were a fault in that data path, then
a concentrator that simply tests its ports would not see any problem (since
the fault is not in the part being tested); the "Path test" would complete
without error, the concentrator would rejoin the ring, and the ring would
once again be down.
The internal data path between the ports is a common resource used by all the
ports. It is certainly possible in principle to build a concentrator where
the internal data path consists of several pieces, each of which can be
individually bypassed and then tested. But such a design is not supported
by ANY of the PHY chipsets on the market. It can be built only by adding
substantial amounts of additional logic into the datapaths. I know of no
concentrator on the market that does this. (Note that a "local" path, or
multiple MACs, or other extra hardware implemented in some concentrators
supposedly for greater diagnostic capbility, do not address this issue at all;
a concentrator that has these things has the same issue as we do, only more
so since it has more paths to partition!)
> 4) We see this becoming a more significant vulnerability as FDDI
>deployment extends beyond a few routers/bridges and a handful of
>hosts on an FDDI backbone, to one in which there are 100s of hosts (of
>all types & flavors). The liklihood of ring problems & TRACE conditions
>will become much greater, and we certainly expect our concentrators to
>"show some prudence" about reseting whenever an attached SA decides (for
>whatever reason) to initiate a TRACE.
The issue we have is that the right answer DOES require us to make assumptions
about the reason why the station initiated Trace.
The FDDI standard specifies a very precise and limited circumstance under
which Trace is initiated. The condition prescribed by the standard occurs
ONLY when there is a fault internal to some station or concentrator (e.g.,
inside the PHY chip, or in the data path from one PHY chip to another).
Note in particular that NO fault external to the station, or in any optical
component, is a trigger for Trace.
If a Trace occurs due to the condition prescribed by the standard, your ring
is down and remains down until the failed module has been identified by
a Path test in one of the stations or concentrators. If shortcuts are taken
in the Path test (for example if the test were to include only the port
components and not the connecting data paths) the consequence would be that
your ring would be permanently down, until the defective station is removed
by manual action. This is why the standard describes in as strong a wording
as it can do what the path test is required to include.
It is certainly possible to react less strongly to a Trace signal on a port.
If this is done, that essentially means that the network manager has decided
that Trace signals are most likely to be false alarms, and are to be dismissed
in the most efficient way possible. That is indeed a useful feature if
connected stations generate Trace signals for reasons other than the one
prescribed in the standard, i.e., due to serious bugs. However, the price
for doing this is that a Trace that occurs for the correct reason (the type
of fault envisioned in the standard) will NOT be resolved and will take
the ring down until manual removal of the faulty component.
>With regard to your specific questions:
>
>>>> 1. Can you please confirm that stuck beaconing was happening prior
>>>> to the PC_Trace. This is the condition that is supposed to
>>>> initiate PC_Trace.
>
> I've verified that cisco 7000 does initiate a TRACE condition by
>sending out MLS. For now, I'd just as soon work the cisco problem with
>cisco (ie., as to whether its beaconing, etc.). The relevent point with
>respect to the concentrator is that it does receive a TRACE condition
>from one of its M-port attachments, and will always reset when it sees
>that TRACE condition.
>
>>>> 2. Need more info on the system that is causing the PC_Trace.
>>>> If there is a REAL reason
>>>> that PC_Trace is happening, the DEC implementation is
>>>> abiding to the standard. We are trying to determine if the PC_Trace
>>>> could be from a poor implementation or there really is a problem
>>>> with the node.
>
> With all respects, if there's a TRACE condition, it is *real*.
>Whether the TRACE-initiating station properly initiated the TRACE
>(claim, beacon, directed beacon, etc.) is irrelevent. The issue here is
>how the concentrator handles a TRACE, not whether the attached system
>had cause to initiate it...
The issue is not whether the Trace is real, but whether the conditions
given in the standard for initiating Trace are present. If they are, then
this indicates broken hardware. If not, it indicates a bug in the station
initiating the trace.
As I mentioned above, if one were to take a shortcut in Trace handling,
this will have positive effects if the Trace signal was caused by a bug,
but very negative effects if it was caused legitimately by a station fault.
>>>> 3. Can you please get the error log. We are trying to determine if
>>>> this PATH test is being initiated by a real PC_Trace or a bug.
>>>> Error log will help to determine this.
>
> I had a Tekelec analyzer on the line. The attached station sent
>out MLS. Its a real TRACE condition, and the concentrator handles it
>properly (passing MLS to the upstream neighbor, then launching into PATH
>TEST when the upstream neighbor drops to QLS). From our perspective,
>the concentrator does TRACE right; we're just questioning the PATH TEST
>initiated by that TRACE condition.
>
> Thanks. Bye.
>
-- Phil DeMar
>>> Has anyone put a moniter on the network to see what was happening?
More times than I care to think about. However, as with any
FDDI problem, what's happening on one part of the ring at any instant is
different from everywhere else on the ring. Additionally, we can't do
physical line tracing and frame grabbing concurrently, so we can only
look at one or the other at any given incident. Combine all that with
the fact that this is an intermittent (order ~ days) problem, and this
becomes a real SOB to get a handle on...
% ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ======
% Received: by us1rmc.bb.dec.com; id AA23394; Wed, 19 Jan 94 13:13:18 -0500
% Received: from fnnetd.fnal.gov by inet-gw-1.pa.dec.com (5.65/13Jan94) id AA10621; Wed, 19 Jan 94 09:51:32 -080
% Date: Wed, 19 Jan 1994 11:51:26 -0600 (CST)
% From: The Prince of Insufficient Light <[email protected]>
% To: delni::oneill
% Cc: [email protected], POBOX::DCARROLL, [email protected]
% Message-Id: <[email protected]>
% Subject: DECconcentrators, PATH TEST, resets, etc...
|