| 15:01:38.349032 elbe.236b740d > mosel.nfs-v2: 1472 reply read
15:01:38.355868 elbe.246b740d > mosel.nfs-v2: 1472 reply read
15:01:38.356845 mosel.256b740d > elbe.nfs-v2: 128 call read
15:01:38.362704 mosel.266b740d > elbe.nfs-v2: 128 call read
15:01:38.364657 elbe.266b740d > mosel.nfs-v2: 1472 reply read
15:01:39.230868 mosel.256b740d > elbe.nfs-v2: 128 call read
15:01:39.233798 elbe.256b740d > mosel.nfs-v2: 1472 reply read
Thank you for the tcpdump traces. Now if the news->usenet gateway
would get fixed and stay fixed I'd see these posting sooner....
The key here is to look at the transmision IDs, or XIDs. XIDs are
sequence numbers attache to RPC messages that allow replies to be
matched to the request. Ususally the XID increments each message
For little endian machines, they get printed backwards, so the high
byte increments first.
Look at this one:
15:01:38.356845 mosel.256b740d > elbe.nfs-v2: 128 call read
^^^XID^^
I assume that tcpdump was running on the client, so this says that
we sent the request. We see no reply, so later on we retransmit:
15:01:39.230868 mosel.256b740d > elbe.nfs-v2: 128 call read
15:01:39.233798 elbe.256b740d > mosel.nfs-v2: 1472 reply read
The next step is to determine:
1) Did the Sun get the request?
2) Did it reply?
3) Did traffic get corrupted?
Since things ran fine when Dunix ws the server, you can probably rule
out corruption, though it would be worth trying writes to see what
happens the Sun sends long messages. You mention a busy network,
NFS over UDP is extremely sensitive to lost messages. In a busy network
the usual cause is "excessive collisions", which we report via netstat -i -s.
You don't have enough data to determine if the request or reply is getting
lost - I generally have tcpdump record all traffic between the two nodes.
If the first fragment of a reply is missing, but other fragments show
up, then that's proof the server got the request.
I'd look at more tcpdump data, look at nfsstat/netstat/ifconfig data on
client and server. (One way to tell if the Sun got the requests is to
do nfsstat on client and server, read the file, do nfsstat and see if Sun
reports as many new READs as Dunix does.)
I'd also look at FTP tcpdump traces, looking for retransmits (also reported
in nfsstat). TCP has several features that make it much less sensitive to
packet loss, so you may be losing packets and not notice.
At any rate, the key question is "Why aren't all my reads being answered?"
[Posted by WWW Notes gateway]
|
| Think of SunOS 4.x kernels as the Sun analog to Ultrix. Never talked V3,
never will. While "upgrading" to Solaris is sometimes possible, it is often
not desirable, especially if you dislike SysV.
[Posted by WWW Notes gateway]
|
| We went on analyzing the problem and found the following pattern
11:38:57.438876 mosel.67e14976 > elbe.nfs-v2: 132 call read fh
11:38:57.438876 mosel.68e14976 > elbe.nfs-v2: 132 call read fh
11:38:57.441806 elbe.67e14976 > mosel.nfs-v2: 1472 reply read
...
11:38:57.447665 mosel.69e14976 > elbe.nfs-v2: 132 call read
11:38:57.448642 elbe.68e14976 > mosel.nfs-v2: 1472 reply read
...
11:38:57.456454 mosel.6ae14976 > elbe.nfs-v2: 132 call read
11:38:57.459384 elbe.6ae14976 > mosel.nfs-v2: 1472 reply read
...
11:38:58.321689 mosel.69e14976 > elbe.nfs-v2: 132 call read
11:38:58.324618 elbe.69e14976 > mosel.nfs-v2: 1472 reply read
i.e. elbe, (Digital UNIX) issues two read requests in a row (very fast,
look at the timestamp). Mosel (SUNos) asnwers the first(67e14..), elbe
issues the third, mosel answers the second (68e...), elbe issues the
fourth and mosel answers the fourth (6ae..) and forgets(!) the third
(69e...).
The performance problem comes from the fact that DU is waiting ~ 0.9s
before it resends the third request - 11:38:57.447665 mosel.69e14976
until the resend 11:38:58.321689 mosel.69e14976. There is no activity
after the reply to the fourth request has been received. This
calculates to 3 or 4 times 8KB per second, == ~30KB/s. exactly what we
measured.
The question is: is this normal behaviour? Can this be changed
dynamically? Can I prevent UDP from sending requests before a previous
one was answered??
As .2 pointed out we figured out whether the requests made it into
SUNos. Using nfsstat on the SUN did not show the non answered requests.
A PD version of tcpdump from the Internet did not show them either. So
they got lost, either in SUNos, in the SUN's ethernet hardware or on
the wire - we had to insert a repeater as the customer only had a
yellow cable and our AlphaStation only supported BNC and Twisted Pair.
So the problem remains unsolved, as we had no time left to do low level
analysis.
Regards
Hartmut
|