[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:	DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:	Welcome to the Digital UNIX Conference
Moderator:	SMURF::DENHAM

Created:	Thu Mar 16 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	10068
Total number of notes:	35879

9041.0. "NFS3 performance problem...caching ???" by PANTER::MARTIN (Be vigilant...) Wed Mar 05 1997 11:08

    A customer complains that one of their Alpha's running OSF 3.2C takes
    2/3 of the network bandwidth (which in their case should correspond to
    some 400 KBytes/sec) when creating a file on an NFS mounted file system.
    
    Both the NFS server and client are running Digital Unix v3.2C, and
    all the NFS patches have already been applied on both systems.
    
Can somebody explain to me this NFS cache traffic, including why my 
Sniffer refers to it as obsolete!

It seems to me like the two end systems have decided to do some caching
of a very large file, thus involving a LOT of update traffic even when
the user concerned is making no specific reference to the file.

    But what do the Write to cache mean:
    
    	 NFS:  Proc = 7 (Write to cache)
    	 NFS:  (This procedure is obsolete)
    
    Has it to do with Unix's buffer cache ?
    And why is it refered as obsolete ?
    
    Here is the customer's complaint:
    
>> We have unacceptable problems at the moment due to AXCA47 reading
>> via NFS from ASCA02. The reads are consuming 60-80% of the network
>> bandwidth in Building 892 AND they seem to be "wrong"!
>>
>> I am seeing repeated requests from AXCA47 as follows
>>
>> 12:07:00.3736  [137.138.28.137]  .axca47.cern.ch NFS C F=896C Read 0 at 25165824
>> 12:07:00.3738  [137.138.28.137]  .axca47.cern.ch NFS C F=896C Read 0 at 25165824
>> 12:07:00.3923  [137.138.28.137]  .axca47.cern.ch NFS C F=896C Read 0 at 25165824
>> 12:07:00.3930  [137.138.28.137]  .axca47.cern.ch NFS C F=896C Read 0 at 25165824
>> 
>> to which ASCA02 is each time replying with 8k bytes!
>> ASCA02 has now sent 522068000 bytes !!!!!
>> 
>> I do not know what file corresponds to F=896C : the details of the read
>> request are given by the Sniffer as
>>  
>>> NFS:  Proc = 6 (Read from file)
>>> NFS:  File handle = 000000202C3179FE0A0000000C000000
>>> NFS:                D0030000048000000C00000002000000
>>> NFS:  Offset = 25165824
>>> NFS:  Count = 0
>>> 
>>
>> (and the offset never changes!).
>> 
>> Now, I am not expert enough in NFS to know what is meant when there is
>> a zero value for the count, nor an unchanging value for the offset.
>> Maybe it is some special sort of file.
>> 
>> I can see that the data being returned is slightly different in each
>> 8k block! I can also see that my Sniffer analysis thinks that there
>> are still over 500000000 bytes to come.
>>  
>> If it really 500000000 then that can occupy an ethernet at 75% for
>> over 10 minutes. However, what I saw earlier was SEVERAL simultaneous
>> read requests on different files!
 

    
    ======================================================================
    Trace below.
    First packet in detail
    
    DLC:  Destination = Station 0A003080D329
    DLC:  Source      = Station 0000F820574A
    DLC:  Ethertype  = 0800 (IP)
    
    IP:   Version = 4, header length = 20 bytes
    IP:   Type of service = 00
    IP:         000. .... = routine
    IP:         ...0 .... = normal delay
    IP:         .... 0... = normal throughput
    
    IP:   Type of service = 00
    IP:         000. .... = routine
    IP:         ...0 .... = normal delay
    IP:         .... 0... = normal throughput
    IP:         .... .0.. = normal reliability
    IP:   Total length = 1500 bytes
    IP:   Identification = 62737
    IP:   Flags = 2X
    IP:   .0.. .... = may fragment
    IP:   ..1. .... = more fragments
    IP:   Fragment offset = 0 bytes
    IP:   Fragment offset = 0 bytes
    IP:   Time to live = 30 seconds/hops
    IP:   Protocol = 17 (UDP)
    IP:   Header checksum = 9BEC (correct)
    IP:   Source address = [128.141.191.114], axca92.cern.ch
    IP:   Time to live = 30 seconds/hops
    IP:   Protocol = 17 (UDP)
    IP:   Header checksum = 9BEC (correct)
    IP:   Source address = [128.141.191.114], axca92.cern.ch
    IP:   Destination address = [137.138.28.137]
    IP:   No options
    
    UDP:  Source port = 1017 (Sun RPC)
    UDP:  Destination port = 2049
    UDP:  Length = 8344 (not all data contained in this fragment)
    UDP:  Checksum = 4EFD (Can't verify, packet incomplete)
    
    RPC:  Transaction id = 2270736662
    RPC:  Type = 0 (Call)
    RPC:  RPC version = 2
    RPC:  Program = 100003 (NFS), version = 3
    RPC:  Transaction id = 2270736662
    RPC:  Type = 0 (Call)
    RPC:  RPC version = 2
    RPC:  Program = 100003 (NFS), version = 3
    RPC:  Procedure = 7 (Write to cache)
    RPC:  Credentials: authorization flavor = 1 (Unix)
    RPC:   len = 48, stamp = 857039708
    RPC:   machine = axca92.cern.ch
    RPC:   uid = 659, gid = 1227
    RPC:   3 other group id(s):
    RPC:    gid 33536
    RPC:    gid 32533
    RPC:    gid 1227
    RPC:  Verifier: authorization flavor = 0 (Null)
    RPC:  [Verifier: 0 byte(s) of authorization data]
    RPC:    gid 1227
    RPC:  Verifier: authorization flavor = 0 (Null)
    RPC:  [Verifier: 0 byte(s) of authorization data]
    
    NFS:  Proc = 7 (Write to cache)
    NFS:  (This procedure is obsolete)
    NFS:  *** 1384 byte(s) of additional data present ***
    
    
    
      1574    0.8286  [137.138.28.137]   axca92.cern.ch      DLC
    Ethertype=0800, size=1514 bytes
            IP  D=[137.138.28.137] S=[128.141.191.114] LEN=1480 ID=62737
            UDP D=2049 S=1017 LEN=8344
            RPC C XID=2270736662 PROG=NFS VERS=3 PROC=7
            NFS C Write cache (obsolete)
      1575    0.0124  [137.138.28.137]   axca92.cern.ch      DLC
    Ethertype=0800, size=1514 bytes
            IP  D=[137.138.28.137] S=[128.141.191.114] LEN=1480 ID=62737
            UDP continuation ID=62737
      1576    0.0026  [137.138.28.137]   axca92.cern.ch      DLC
    Ethertype=0800, size=1514 bytes
            IP  D=[137.138.28.137] S=[128.141.191.114] LEN=1480 ID=62737
            UDP continuation ID=62737
      1577    0.2047  [137.138.28.137]   axca92.cern.ch      DLC
    Ethertype=0800, size=1514 bytes
            IP  D=[137.138.28.137] S=[128.141.191.114] LEN=1480 ID=62737
            UDP continuation ID=62737
      1578    0.0010  [137.138.28.137]   axca92.cern.ch      DLC
    Ethertype=0800, size=978 bytes
            IP  D=[137.138.28.137] S=[128.141.191.114] LEN=944 ID=62737
            UDP continuation ID=62737
      1579    2.3894  [137.138.28.137]   axca92.cern.ch      NFS
    retransmission
            DLC Ethertype=0800, size=1514 bytes
            IP  D=[137.138.28.137] S=[128.141.191.114] LEN=1480 ID=62738
            UDP D=2049 S=1017 LEN=8344
            RPC C XID=2270736662 PROG=NFS VERS=3 PROC=7
            NFS C Write cache (obsolete)
      1580    0.0017  [137.138.28.137]   axca92.cern.ch      DLC
    Ethertype=0800, size=1514 bytes
    .
    .
    .
    ======================================================================

T.R	Title	User	Personal Name	Date	Lines
9041.1	NFS read with zero count ?	PANTER::MARTIN	Be vigilant...	`Thu Mar 06 1997 05:49`	16
	Other question arising from this problem : Why a read which specifies a particular offset and a count of zero keeps getting different chunks (8K?) of data. I had understood that a read with a count of zero should simply get an OK response, like it was a keep-alive. Thanks in advance for your explanations... Cheers, ============================ Alain MARTIN/SSG Switzerland
9041.2	nfsstat and netstat statistics...	LEMAN::MARTIN_A	Be vigilant...	`Thu Mar 06 1997 10:02`	118
	Here is the nfsstat output : ********************************************************************** # nfsstat Server rpc: calls badcalls nullrecv badlen xdrcall 0 0 0 0 0 Server nfs: calls badcalls 0 0 Server nfs V2: null getattr setattr root lookup readlink read 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% wrcache write create remove rename link symlink 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% mkdir rmdir readdir statfs 0 0% 0 0% 0 0% 0 0% Server nfs V3: null getattr setattr lookup access readlink read 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% write create mkdir symlink mknod remove rmdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% rename link readdir readdir+ fsstat fsinfo pathconf 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% commit 0 0% Client rpc: calls badcalls retrans badxid timeout wait newcred bad verfs timers 2880159 7 6636 8 6643 0 0 0 19980 Client nfs: calls badcalls nclget nclsleep 2880117 2 2880151 0 Client nfs V2: null getattr setattr root lookup readlink read 0 0% 2235 0% 0 0% 0 0% 400 0% 134 0% 0 0% wrcache write create remove rename link symlink 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% mkdir rmdir readdir statfs 0 0% 0 0% 17 0% 67 0% Client nfs V3: null getattr setattr lookup access readlink read 0 0% 3128 0% 22 0% 3613 0% 5022 0% 96 0% 2328091 80% write create mkdir symlink mknod remove rmdir 525130 18% 46 0% 0 0% 0 0% 0 0% 32 0% 2 0% rename link readdir readdir+ fsstat fsinfo pathconf 15 0% 0 0% 182 0% 3749 0% 142 0% 65 0% 0 0% commit 7929 0% ********************************************************************** # netstat -p udp udp: 2908416 packets sent 2953433 packets received 0 incomplete headers 0 bad data length fields 692 bad checksums 0 full sockets 41565 for no port (41496 broadcasts, 0 multicasts) # netstat -p ip ip: 14484817 total packets received 0 bad header checksums 0 with size smaller than minimum 0 with data size < data length 0 with header length < data size 0 with data length < header length 13817112 fragments received 0 fragments dropped (dup or out of space) 205 fragments dropped after timeout 0 packets forwarded 87 packets not forwardable 0 redirects sent # netstat -I tu0 -s tu0 Ethernet counters at Thu Mar 6 15:59:58 1997 65535 seconds since last zeroed 2486930378 bytes received 4294967257 bytes sent 15227760 data blocks received 5538214 data blocks sent 102512090 multicast bytes received 807449 multicast blocks received 0 multicast bytes sent 0 multicast blocks sent 77530 blocks sent, initially deferred 1305417 blocks sent, single collision 2203918 blocks sent, multiple collisions 5637 send failures, reasons include: Excessive collisions 0 collision detect check failure 18 receive failures, reasons include: Block check error Framing Error 0 unrecognized frame destination 0 data overruns 0 system buffer unavailable 0 user buffer unavailable #
9041.3	Anybody online ???	PANTER::MARTIN	Be vigilant...	`Fri Mar 07 1997 09:18`	3
	Anybody online, should I IPMT this for getting an answer ? Alain
9041.4	An IPMT would insure you get some kind of attention	UNIFIX::HARRIS	Juggling has its ups and downs	`Sun Mar 09 1997 22:34`	11
	If you have a customer problem, the only reliable way to get official support is via an official support channel. An IPMT is one such support channel. As a side note, the most experienced NFS engineer around here happens to be away at a 2 week long connect-a-thon and as such has been unable to read and answer any notes. Whether he has any answers or any time to catch up on notes after he gets back is an entirely different question. Bob Harris
9041.5	yes, but could be applic. problem...	LEMAN::MARTIN_A	Be vigilant...	`Mon Mar 10 1997 04:16`	45
	Thanks Bob for your answer. I understand that for official escalation, IPMT is the way to proceed. However I strongly suspect an application problem. Actually I only expect unofficial answer, experience sharing... which can help me to narrow down were the problem is. I do not want to overload engineering with application problems but want to make sure there isn't something wrong with NFS3! However I can say that the original statements reported in .0 and .1 already evolved. The question : >> Why a read which specifies a particular offset and a count of zero >> keeps getting different chunks (8K?) of data. >> I had understood that a read with a count of zero should simply get >> an OK response, like it was a keep-alive. was due to the network analyser which wasn't able to properly decode NFS3 messages !!! After SW upgrade of Net-analyser, it transforms the "count 0 offset n" differently : they become something like >> NFS C READ FH=21CC at 22798336 for 8192 and the writes turn into >> NFS C WRITE FH=21CC at 60997632 for 8192 (UNSTABLE) so it does seem as if it is reading and writing the same file, but in different places (treating it as a random access file!) all the time. However I don't know why it says "UNSTABLE"! Any explanation ??? Thanks for your co-operation, ============================ Alain MARTIN/SSG Switzerland
9041.6	See note 2328.0, see http://www.zk3.dec.com/~werme/support.html	NETRIX::"[email protected]"	Ric Werme	`Wed Mar 12 1997 09:51`	103
	PANTER::MARTIN (Be vigilant...) writes in .0 or whatever: A customer complains that one of their Alpha's running OSF 3.2C takes 2/3 of the network bandwidth (which in their case should correspond to some 400 KBytes/sec) when creating a file on an NFS mounted file system. What sort of network has 600-700 MB/sec capacity? Before analyzing any performance problem knowing what the underlying fabric is is vital. Both the NFS server and client are running Digital Unix v3.2C, and all the NFS patches have already been applied on both systems. Can somebody explain to me this NFS cache traffic, including why my Sniffer refers to it as obsolete! Your traces, as you have figured out, are from obsolete Sniffer code. You have not posted new traces. Sniffers are useful in chasing low level performance problems, especially to look at CRC errors and the sort. If the network is running well (tested via spray, ttcp, etc), then dealing with Sniffer traces is a pain in the neck because they aren't as malleable as tcpdump traces. See note 2328.0. (But note that tcpdump in V3.xX doesn't decode NFS V3 any better than your old sniffer code.) I don't understand what problem you are addressing. In .0 you talk about data caching and updates, but that may be misled by the Sniffer "write cache" messages. In .1 you're confused by old Sniffer code and read messages. .2 shows a very happy nfsstat (but server nfsstat data discloses very little about NFS performance problems - I run nfsstat on the client and look at the number of timeouts). netstat suggests that you have an Ethernet, which has considerably higher throughput than 600 MB/sec. The collision rate suggests that it's very busy. The excessive collision rate is pushing 1%, a loss rate that devastates NFS performance. .5 asks about unstable writes. Those are a key addition to NFS V3 and are what allows us to get 6 MB/sec NFS write rates to disk on Sandpipers. I wish I had time to measure performance on current hardware. With the sort of loss rate you post, NFS over TCP may help, but that requires Duniox V4.0. NFS V3 is well described in a USENIX paper. I guess we don't have a reference on our internal Web, but it's on the zk3.dec.com production machines at /usr/specs/filesystem/nfs/nfsv3_usenix.ps. The USENIX references is: Author: Brian Pawlowski Author: Chet Juszczak Author: Peter Staubach Author: Carl Smith Author: Diane Lebel Author: Dave Hitz Title: NFS Version 3: Design and Implementation Pages: 137-151 Publisher: USENIX Proceedings: USENIX Conference Proceedings Date: Summer, 1994 Location: Boston, MA Institution: Network Appliance Institution: Digital Equipment Corp Institution: SunSoft Inc. Institution: SunSoft Inc. Institution: Digital Equipment Corp. Institution: Network Appliance Odd - USENIX doesn't appear to make that public, but I think I saw it at netapp. Look for it with AltaVista. The protocol itself is RFC 1813, but a PostScript version is available at http://www.sun.com/sunsoft/connectathon/nfsv3.ps. I've been putting together a page of information about analyzing NFS issues at http://www.zk3.dec.com/~werme/support.html. It is only the barest beginnings, as training support is not in my job description. To the best of my knowledge, the customer support organization does nothing like it. (Indeed, is training customer support part of customer support's charter?) If anyone in this silly company is trying build a WWW support hints page, I'd love to hear about it. For that matter, if anyone knows how to fix the NOTES-USENET gateway for this notes file, I wish he would - this Web interface is only good for archival access and no one seems to have ported notes to Dunix. Mutter, mutter. Back to the issue at hand. What is the customer complaining about? Nothing in your postings states a problem. Even the phrase "A customer complains that one of their Alpha's running OSF 3.2C takes 2/3 of the network bandwidth..." is not a statement of a problem, as I think it should take 100% of an idle Ethernet. Unless, of course, CERN is writing the results of a quantum mechanical simulation of Uranium fission, in which case I'd expect 1% network bandwidth to be about right. What does the customer want to see? How big of a file is he writing, how many NFS writes does it take? Do tcpdump (or Sniffer) traces show gaps. (Really easy to look for with tcpdump and xterm configured to buffer several hundred lines.) What else is on his network? Are there routers between client and server? I haven't met a Cisco router that didn't discard more messages than I think it should, and dropped messages do dreadful things to NFS read and write performance. I'm still recovering from Connectathon, so I won't have time for further help until I file expense reports, rebuild my workstation (I had to take it to the show since I wanted something faster than a Sandpiper), catch up on mail, review several specs, and dodge the next snowstorm. [Posted by WWW Notes gateway]