[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference 7.286::fddi

Title:FDDI - The Next Generation
Moderator:NETCAD::STEFANI
Created:Thu Apr 27 1989
Last Modified:Thu Jun 05 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2259
Total number of notes:8590

421.0. "FDDI Maximum Access Delay" by TOOIS1::MIRGHANE () Mon Dec 16 1991 13:40

    On DIGITAL Technical Journal  volume 3 number 3, Raj Jain
    explains very well how maximum access delay (MAD) is linked to
    the TTRT parameter with the formulae: 
    	MAD = Maximum access delay = (n-1)TTRT + 2*D
    Where n is the number of stations and D the ring latency.
    
    These MAD value is very important for designing distributed 
    FDDI realtime systems.
    
    We however think this formulae need to be completed by all 
    unsteady states on the FDDI ring.
    
    We need to know how the token is managed. What happens when the token
    is lost for example. Or what happens when there is an automatic 
    reconfiguration of the ring (i.e. in case of a dual attachement station 
    breakdown).
    Our objective is to understand the causes of such unsteady states
    in order to design systems to minimize the probability
    of such events.
    The second objective is to be able to calculate the Maximum access
    Delay in case of such an unsteady state.
    
    Thanks very much for any help (pointer, document, ...).
    
    Soumetty.
    
T.RTitleUserPersonal
Name
DateLines
421.1KONING::KONINGPaul Koning, NI1DTue Dec 17 1991 11:01124
There are various levels of fault recovery in FDDI, some of which deal with
very esoteric faults (and take a fair amount of time to do so).

There are basically three levels of recovery.

1. Recovery from problems caused by line errors

This category applies to a normal working network, since some small number of
line errors is normal for a working network.  (Note that on fiber optic
networks, the "typical" bit error rate is extremely low and may approach
zero -- however, it can be non-zero even though nothing is "defective".)

Line errors can result in (a) loss of data packets, (b) loss of the token.

Data packet loss is generally detected by higher layer protocols.  Once
detected, the higher layer protocol normally retransmits.  The delay
caused by a data packet loss is therefore the sum of the higher layer timeout
and the FDDI access delay.  Typically, the timeout is a couple of seconds,
which is far larger than the FDDI access delay in normal configurations.

Packet loss affects only the user whose packet was lost; everything else
continues normally.

Given a bit error rate of 10^-12 (which is "typical"), 100 stations, and
4500 byte, you get a packet loss rate of 3.6*10^-6.  In other words, less than
1 in 100,000 packets is lost.

Token loss due to bit errors occurs when a bit error hits the token and
changes it to something else.  There are two timers in FDDI that detect
this.  In general, it is the TVX (Valid Transmission) timer that will go off.

Token loss of course affects everyone; transmissions stop until a new token
is generated.  That's why the recovery here is rather quick.

TVX is typically a few milliseconds.  Our stations use as default 2.62 ms.
So that's the delay from token loss until detection.  Once TVX goes off,
Claim is started to recreate the token.  Claim takes at most two ring
delays (i.e., 2*D_Max, or 3.2 ms, worst case); after that it takes two more
ring delays for the ring to return completely to normal.  So the entire
process takes about 9 ms with default TVX.  (There is no reason I know off
ever to set TVX to any other value -- even though the FDDI standard insists
that it's settable.  Then again, the same thing is true for all but one or
two FDDI parameters...)

2. Recovery from problems caused by link failures

This category covers things that are "broken" (as opposed to bit errors,
which are "normal") -- but that are nevertheless fairly common and
expected.

By "link failure" I mean cable breaks, connector problems, and transceiver 
failures.  All these cause loss of signal.  Cable break is probably the most
common, followed by connector problems.  Transceiver failures aren't all
that common but the transmitter in particular does have a finite (though
quite long) life and will degrade over time.

Two mechanisms recover from link failures.  The simplest case is a total
failure, such as an unplugged connector or a backhoe digging right through
your cable.  That is caught by the Signal Detect function of PMD, or
the Quiet Line State detection in PHY.  This happens within a millisecond.
After that, the affected port is removed from the ring, and the ring is
reinitialized as with a token loss.  So the total hit is under 10 ms.

Degraded links generally take longer to detect but are probably less common.
"Degraded" here means a link with a high bit error rate -- perhaps 10^-6 or so.
This can happen due to damaged cable, or partially plugged-in connectors.
The Link Error Monitor detects this situation.  There is a parameter, set by
default to 8 (i.e., 10^-8) which specifies what link error rate is considered
"excessive".  Normally you should leave it at 8.  The only reason for changing
it is that you have a marginal link that is very important, and you prefer to
leave it on -- even though you're getting high packet and token loss rates --
rather than have it taken out of service, while waiting for it to be repaired.
LEM requires about 5-10 errors over an averaging period of a minute or so
before it decides a link is bad.  If it suddenly becomes bad enough to generate
10 errors in one second, it's shut off then, but if it gradually goes bad, it
will probably take up to a minute or so for the problem to be detected.
Note that this isn't serious, since communication does continue in the meantime.
(You may lose a lot more packets than usual, or lose the token a couple of
times, but that's not all that major given that it doesn't last.)  Once the
problem is detected, the port is removed from the ring and the ring 
reinitialized, as above.

3. Recovery from esoteric problems

If you think long enough and hard enough, you can come up with some cases
that aren't covered by any of the above.  There are two that are rather
improbable but are nevertheless covered by the FDDI standard: internal
station faults, and duplicate addresses.

Internal station faults are faults other than link faults.  The typical
example we use is a MAC whose transmit circuit is broken and always sending
Idle.  In that case, clearly you can't get any data through, and the token
disappears.  Claim also fails, and the subsequent Beacon process will not
complete.  (Getting to this point takes 175 ms.)  Thus we speak of a 
"stuck beaconing" condition.

After about 8 seconds of beaconing, the "Trace" process begins.  This 
propagates a signal to each of the stations that might be the source of
the problem.  The signal propagation typically takes less than a second;
the worst case is 7 seconds.  At that point, each of the stations that
received the trace signal performs a self test.  There is no standard
time bound for self test, of course; it depends on the product.  After
successful self test, the station rejoins the ring.  In the meantime,
as soon as trace finishes the other stations (those not suspected of
being broken) remain in an operating ring; for them there was an interruption
of up to 15 seconds (stuck-beaconing time plus trace time).

Duplication of the station address can cause the ring not to operate.
(You'll have to walk through the details of the Claim procedure to see
why; I'll skip that for now.)  If that happens, a duplicate address test
is invoked after about 1 second, which should identify the offenders and
remove them from the ring quickly (within another second or so).  

Note that I said "station address", i.e., "MLA".  In DEC FDDI products,
the station address is ALWAYS the "hardware address" and comes from a ROM.
Thus the probability of duplicate address is close to zero.  The familiar
DECnet Phase 4 address is NOT the MLA in our FDDI products -- instead it
is an "alias" address.  A major reason for this is that duplication of an
alias address isn't a serious problem.  As many people know, duplicate
DECnet node numbers -- and thus duplicate DECnet-style addresses -- happens
fairly frequently.  We wanted to make sure that this situation would not
crash the ring, which is why alias addresses are used for it.

	paul
421.2ThanksTOOIS1::MIRGHANETue Dec 17 1991 14:204
    Paul thank you very much for this information. 
    That's exactly what we needed.
    
    Soumetty.