[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

8678.0. "Interactive network response 'lumpy'" by IOSG::MARSHALL () Sun Feb 02 1997 09:24

Alphastation 255, Digital Unix V4.0A, firmware CD V3.8

Bit of a hard-to-track-down network problem.

Basically, interactive response time is very variable.  For example
    - rlogin to another machine (Unix or VMS)
    - run up LSE and move around, type stuff, etc

Some of the time the screen updates match what I type 'instantaneously',
sometimes there can be a delay of up to ten seconds before the screen reflects
what I've typed.

During that time I can do other things locally, so my Unix machine isn't
hanging, and other people can do things quite happily on the remote machine, so
that isn't hanging either.  The behaviour is the same regardless of which remote
machine I use.

It isn't network congestion.  We've checked that out and the network is very
lightly loaded, and no-one else has this problem.

It isn't the network configuration.  The problem has existed since this AS255
was first plugged into the net; the previous machine plugged into the same bit
of thinwire didn't have the problem.

Since then, the thinwire has been replaced with UTP, and that made no
difference.  The UTP wire from this AS255 has been moved to a different
repeater, which also makes no difference.  Other people on the same repeaters
have no problems.

All of which is making the problem look very much like it's in the AS255.

Oh, also, the AS255's motherboard (which contains the network hardware, for
those not familiar with this machine) has been replaced for other reasons, and
that made no difference, so it's not mal-functioning hardware.

However, bulk network operations seem to run quickly and smoothly; 'cat'ting or
'type'ing large files from a remote machine works fine.  It's only small
"interactive" packets that seem to get held up.

Is there something I can run on the AS255 to monitor what it's doing with the
net to figure out what's causing the delays?  Is there some tuning that can be
done to improve performance for small packets (nb before anyone says rtfm, I
would if I had one!).

Does anyone have any ideas at all what is causing this?

Many thanks,
Scott

[x-posted in wrksys::alphastation]
T.RTitleUserPersonal
Name
DateLines
8678.1SMURF::MENNERit's just a box of Pax..Mon Feb 03 1997 08:481
    Does netstat show any dropped packets?
8678.2IOSG::MARSHALLMon Feb 03 1997 09:3911
Running netstat in a (local) window while the problem is occurring in a remote
window shows no dropped packets, and no increase in the number of error packets.

While the system is quiescent (ie no explicit network activity), there are about
ten input packets and two output packets per second, presumably related to
keeping links 'alive'.

The total number of error packets is 555 (input), 1 (output); is that reasonable
for a machine that's been up a week or so?

Scott
8678.3too highSMURF::DUSTINMon Feb 03 1997 16:557
    No, input errors shouldn't be that high.  I have 6 input errors
    over the last 55 days.
    
    Get us a "netstat -is" so we can see what the input errors are.
    
    John
    
8678.4Output from netstat -isIOSG::MARSHALLWed Feb 05 1997 12:4731
Well, input errors are up to 710 now, although these errors don't seem to
coincide with the network delays.

Here's the output from netstat -is.  I guess the "Block check error" and
"Framing Error" lines are the significant ones.  What do they mean?

tu0 Ethernet counters at Wed Feb  5 17:21:44 1997

       65535 seconds since last zeroed
   966491707 bytes received
    27988323 bytes sent
    10333496 data blocks received
      393414 data blocks sent
   936096342 multicast bytes received
    10057314 multicast blocks received
     1287953 multicast bytes sent
        9451 multicast blocks sent
           0 blocks sent, initially deferred
           0 blocks sent, single collision
           0 blocks sent, multiple collisions
           1 send failures, reasons include:
           0 collision detect check failure
         710 receive failures, reasons include:
                Block check error
                Framing Error
           0 unrecognized frame destination
           0 data overruns
           0 system buffer unavailable
           0 user buffer unavailable

Scott
8678.5related questionLEXSS1::GINGERRon GingerWed Feb 05 1997 15:583
    In the previous note the 'blocks received' and the 'multicast blocks'
    are nearly the same value. Is it normal to see such a high multicast?
    What causes these?
8678.6IOSG::MARSHALLThu Feb 06 1997 05:1628
re .5

Don't know if this answers the question, but: our local network topology is UTP,
with all workstation nodes (inc. my AS255) on individual UTP wires from a bank
of repeaters that feed into the 'main' network in the machine room.

Most workstation nodes have a very small /etc/hosts (or PC equiv; most of the
nodes are W95 or NT PCs) file, containing just the names/addresses of the two
name servers in our area.

Would the lack of 'complete' address databases cause in increase in multicasts? 
Do PCs use multicasts significantly more than Unix (I am led to believe the
NetBEUI uses broadcasts a lot), or specifically more than Unix is tuned to
handle?

I did a test: from my AS255, I pinged another Unix node in the machine room,
while simultaneously pinging my machine from that node, letting it run for about
a minute.

Other node pinging my node gives:  round-trip (ms)  min/avg/max = 0/0/0 ms
My node pinging other node gives:  round-trip (ms)  min/avg/max = 0/1/7 ms

Not particularly conclusive, but repeating the test always gives a longer
round-trip time when starting at my node.

Anything else I can try?

Scott
8678.7SMURF::MENNERit's just a box of Pax..Thu Feb 06 1997 08:105
    As far as i know UNIX has no problem dealing with multicasts.  The
    710 receive errors (block check errors; Framing errors) point to a
    hardware problem with either your network adapter or with a network
    adapter somewhere else on the net.  You really need  a sniffer to
    find out more.  
8678.8How can I force FULL-DUPLEX mode?IOSG::MARSHALLThu Mar 27 1997 11:5530
This problem hasn't gone away, but I've been trying to get hold of a 'sniffer'
as per .7 to analyse this some more.  But unfortunately it seems we have only
one network protocol analyser, and I'm a very low priority, so no joy yet.

My latest thought concerns full-duplex vs simplex.  I have a UTP port, and at
the console I see:
    ewa0_mode = Full Duplex, Twisted Pair

As the system boots, I see the message:
    tu0: console mode: selecting 10BasetT (UTP) port: full duplex

But after booting, ifconfig -a gives:
    tu0: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,MULTICAST,SIMPLEX>

The wiring and routers I'm connected to support full duplex, and I've been
recommended to try that as a remedy for this problem.  But given the console and
boot-time settings, why does ifconfig claim SIMPLEX, or is it lying?  If it
really is simplex, how can I change it to full duplex?

>> a hardware problem with either your network adapter or with a network
>> adapter somewhere else on the net

The network adapter on my machine is on the motherboard (AS255), which has been
swapped since this problem began (for another reason), without making any
difference, so I think my end of the hardware is 'clean'.

The other end of the wire has been moved from port to port on one router, and to
different routers, without making any difference.  No-one else on the same
routers has the problem, so I don't think it lies there either.

8678.9netrix.lkg.dec.com::thomasThe Code WarriorThu Mar 27 1997 12:283
SIMPLEX means the device does not listen to its transmit and has absolutely
nothing with to do with full-duplex.

8678.10IOSG::MARSHALLTue Apr 01 1997 11:435
Ahhh... a confusing re-use of terminology.  I take it this means the duplex
thing is a red-herring and I should continue looking elsewhere.

Ta,
Scott
8678.11It's something in the AS255, either h/w or UNIX...IOSG::MARSHALLWed Apr 16 1997 12:1740
The problem persists, and a protocol analyser on the net hasn't uncovered
anything much; here is the current situation:

There is nothing wrong with the rest of the network.

It isn't a complete network "hang"; I can have two windows with remote sessions
(to the same remote machine) and one can hang while the other is fine; then a
few seconds later the situation could reverse.  It isn't local loading either;
all local apps work fine, both during and in between the network glitches, and
there's very little happening on the system.

The problem is worse when network traffic increases, but the network is nowhere
near saturated, and the problem persists even when the network is otherwise
quiescent.

Also, I've installed DECnet, and the same symptoms occur over 'dlogin' sessions
as well as 'rlogin' ones.

We notice similar symptoms on all AS255s, but not on other machines (this isn't
conclusive yet, but is a definite trend).  All the affected AS255s run UNIX
V4.0x, whereas most other machines are on UNIX V3.x or non-UNIX operating
systems.

So, assuming it's not hardware-related, could there be something in UNIX V4.0x
causing this problem (I've upgarded from 4.0A to 4.0B with no change)?  Maybe
something in the device driver that doesn't interact properly with the version
of the network chip in these machines?

I'm thinking along the lines of the sound problem in AS255s where the audio
codec manufacturer changed their spec such that the UNIX device driver no longer
did the right thing with the chip.  Could something similar be true for the
network stuff?

Any suggestions on how we can track this down?

Oh, and as an aside, why would netstat start saying "no namelist" and not give
any output?

Thanks,
Scott
8678.12Just testsKEIKI::WHITEMIN(2�,FWIW)Thu Apr 17 1997 03:5611
    
    	Simplex is the red herring not Full Duplex.
    
    	try setting ewa0_mode to twisted      
    
    	>>>set ewa0_mode      Return will list the syntax for different settings
    	
    	Also how long a run of twisted pair are you using? If over 60
    Meters try a test and run a short cable to the repeater.
    
    						Bill
8678.13Another possibility...ADISSW::TENHAVEThu Apr 17 1997 09:0216
    
    After you work out any console settings....
    
    Do you a PCI card in the bottom most slot (closest to motherboard)?
    
    This PCI slot shares an interrupt with the eisa/isa bus.  This shared
    interrupt has an affect on your embedded network adapter.  If you do
    have a PCI card and an available slot above this lower slot, move this
    card out of the lower slot.  This wil most likely cause a kernel
    rebuild depending on your kernel config file (to pick up your new 
    hardware configuration - moved PCI boards).  If you don't have an
    available slot, shuffle your PCI cards around.  Try the graphics card
    in the lower slot...
                                             
    				It is worth a try,  Tim
    
8678.14Progress?IOSG::MARSHALLFri Apr 18 1997 13:5618
re .13: No, there's nothing in the bottom slot; the only card is the graphics
one.  But I'll bear this in mind if I get any more cards.

re .12: At the console, I changed ewa0_mode from "Full Duplex, Twisted Pair", to
just "Twisted Pair".  When UNIX boots, it now claims that tu0 is half-duplex
instead of full-duplex.

The effect of this seems to be that the several-second "hangs" no longer occur. 
The response is still lumpy, but doesn't seem quite so bad.  This isn't
conclusive yet, but the tests so far suggest an improvement.

To verify whether this is the case, can someone please explain what difference
the half/full duplex setting actually makes?  Yes, I know what the words mean in
terms of comms technology, but what is the practical upshot in this case, and
why would it make a difference?

Many thanks for the suggestions so far, I'm glad I'm finally getting somewhere!
Scott
8678.15StartKEIKI::WHITEMIN(2�,FWIW)Fri Apr 18 1997 22:2817
    
    	If you are attached to a repeater then full duplex should never be
    used. Only Switches and bridges and other computers are capable of
    full duplex and not all of those either. 
    
    	Question why are we shipping these workstations configured for
    Full Duplex?
    
    	In very rough terms when the card is configured for full duplex
    we assume we can transmit whenever we want and do not monitor for
    collisions. Our transmit packets can easily cause late collisions and or
    CRC's, and the retransmissions would take place very slowly since
    upper layers of each protocol would have to timeout first.
    
    					Bill
    
    PS - Are you using over 30 meters of 10BaseT cable? 
8678.16Makes sense...IOSG::MARSHALLMon Apr 21 1997 05:5512
Bill,

Thanks for the very informative explanation.  I think I understand what's going
on now, and your description would account for the 'hangs', and also for why the
hangs are more frequent when the network is busier.

As for why my machine was configured for full-duplex: I don't think it was
shipped that way, I think it was one of the things I changed at some point while
investigating this problem (having been told by our networks guy that that's how
it should be set!).

Scott
8678.17IOSG::MARSHALLMon Apr 21 1997 06:0610
Oh, just seen your PS: I don't know the exact length, as the cable goes through
ducts and the ceiling space to get to the repeater, but assuming it follows a
"sensible" path, there would be about 90 to 100 feet, so yes, it is knocking
30m.  Would this length cause significant degradation of the signal?

Perhaps more significant than the length, the cable is in conduits with
electricity supply leads, etc; I don't know how resilient they are to
interference.

Scott
8678.18KEIKI::WHITEMIN(2�,FWIW)Mon Apr 21 1997 21:259
    
    	Well 10BaseT should never be run in close proximity to and parallel
    with anything inherently noisy. However 30 Meters should be short
    enough that most signal degrading problems should be eliminated as
    a cause for your problems.
    
    	Have the errors you saw earlier gone away?
    
    						Bill 
8678.1910BaseT specsQUARRY::reevesJon Reeves, UNIX compiler groupTue Apr 22 1997 11:415
My notes on the 10BaseT specs say 90m is the limit of the run, so that's
probably not your problem, but they also specify a one foot separation from
parallel power conduits, so that may well be your problem.  I'd beat up your
wiring contractor, who is obviously incompetent if they ran 10BaseT wiring
in the same conduit as power.
8678.20IOSG::MARSHALLThu Apr 24 1997 07:1715
re .18

I am happy to report (via netstat) no send or receive failures, and no error
packets, compared with the rather large number I used to get.

Things are still lumpy, but it's not as bad as it was.

re .19

Unfortunately the wiring guys are at the mercy of the conduit system available
in our (DEC standard issue) partition walling.  Also, I'd rather not beat them
up as they're also our system managers, and as we all know, it pays to be nice
to your system manager :-)

Scott
8678.21Everyone feels any lumps on a shared segmentKEIKI::WHITEMIN(2�,FWIW)Thu Apr 24 1997 20:337
    
    	You might be at the mercy of someone elses lumps.
    
    	Check the other systems on your network for incorrect duplex
    settings.
    
    						Bill
8678.22Unfortunately I have to deal with inferior operating systems ;-)IOSG::MARSHALLFri Apr 25 1997 06:527
Bill,

Yes, that's what we're doing.  Trouble is, a lot of the systems are running
Windows 95/NT, and on such systems it seems hard to find out what mode they're
running in, let alone how to change it if it's wrong!

Scott
8678.23KITCHE::schottEric R. Schott USG Product ManagementFri Apr 25 1997 09:505
Have you run tcpdump to see what is going on?

Have you run sys_check?