[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference hydra::axp-developer

Title:	Alpha Developer Support
Notice:	[email protected], 800-332-4786
Moderator:	HYDRA::SYSTEM

Created:	Mon Jun 06 1994
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	3722
Total number of notes:	11359

3057.0. "Scandinavian Softline Technology Oy" by HYDRA::DORHAMER () Mon Jan 20 1997 17:02

T.R	Title	User	Date	Lines
3057.1	sent pointer to online docs	HYDRA::DORHAMER	`Mon Jan 20 1997 17:04`	26
3057.2	more questions	HYDRA::DORHAMER	`Wed Jan 22 1997 10:52`	38
3057.3	use SO_KEEPALIVE ?	HYDRA::DORHAMER	`Wed Jan 22 1997 13:04`	24
3057.4	SO_KEEPALIVE not sufficient	HYDRA::DORHAMER	`Thu Jan 23 1997 09:59`	33
3057.5	checking with engineering	HYDRA::DORHAMER	`Fri Jan 24 1997 10:26`	4
	I have sent Hu Rui's questions to John Dustin in the UNIX networking group and also posted them in the Digital UNIX notes file (note 8578). Karen
3057.6	response from engineering	HYDRA::DORHAMER	`Fri Jan 24 1997 12:09`	42
	#1 24-JAN-1997 12:07:16.05 NEWMAIL From: HYDRA::AXPDEVELOPER "[email protected]" To: US3RMC::"[email protected]" CC: AXPDEVELOPER Subj: RE: Socket Read Return Value Hu Rui, I received the following response to your questions from one of our engineers. Please let me know if this resolves your problem. Karen Dorhamer Alpha Developer Support > If I read from a non_blocking TCP socket. How can I know the remote side > has closed connection? you'll get back a return value of 0 from read(2) if TCP has determined the connection has been explicitly closed by the remote side. You will never get back 0 from read(2) on a TCP socket for any other reason. If it's non-blocking and the connection is still open, but there is simply no data, you'll get back a return value of -1, and errno will of been set to EWOULDBLOCK. If there was data then read(2) will return the number of octets copied to your buffer. If the socket is blocking, you will block as long as the connection is still open and there is no data to read. Otherwise it's the same as for non-blocking (0 will be returned when connection is closed and you've already read all data received on the socket prior to the close). You could also get back -1 and errno EINTR if you are playing with signals in the process. > From read's return value or signal. If you have turned on async notification for the socket, you should also get a SIGIO when the connection is closed. And if you are select(2)'ing on the socket, the socket will be both readable and writtable.
3057.7	TCP/IP FAQ	HYDRA::DORHAMER	`Fri Jan 24 1997 12:57`	978
	#2 24-JAN-1997 12:48:47.41 NEWMAIL From: HYDRA::AXPDEVELOPER "[email protected]" To: US3RMC::"[email protected]" CC: AXPDEVELOPER Subj: more socket info Hu Rui, One of the engineers from our Digital UNIX engineering group sent me the attached info sheet on TCP/IP. Please see question 5 for more info. Karen Dorhamer Alpha Developer Support Subj: Re: socket question Karen, He is probably out of luck in many of the cases he has described, however, there are a few things he can do which may help, and which are outlined in the TCP/IP FAQ. I have enclosed a copy from last March from the comp.protocols.tcp-ip usenet group. It hasn't changed much recently so is as good as the latest version. Question 5 covers connections that have gone away. John --------- From nntpd.lkg.dec.com!pa.dec.com!decuac.dec.com!haven.umd.edu!purdue!lerc.nasa. gov!magnus.acs.ohio-state.edu!math.ohio-state.edu!howland.reston.ans.net!newsfee d.internetmci.com!ns.pilot.net!news2.pilot.net!wrs.com!wrs.com!gnn Thu Mar 7 16 :57:47 1996 Article 48463 of comp.protocols.tcp-ip: Path: nntpd.lkg.dec.com!pa.dec.com!decuac.dec.com!haven.umd.edu!purdue!lerc.nasa .gov!magnus.acs.ohio-state.edu!math.ohio-state.edu!howland.reston.ans.net!newsfe ed.internetmci.com!ns.pilot.net!news2.pilot.net!wrs.com!wrs.com!gnn >From [email protected] (George Neville-Neil) Newsgroups: comp.protocols.tcp-ip Subject: FAQ for March 1996 Date: 1 Mar 96 16:32:54 GMT Organization: Wind River Systems, Inc. Lines: 877 Message-ID: <[email protected]> NNTP-Posting-Host: loire.wrs.com Summary: FAQ Keywords: FAQ Hi Folks, Here is the latest FAQ. Not many changes this month. Later, George Archive-name:tcp-ip/FAQ Last-modified: 1996/3/1 Internet Protocol Frequently Asked Questions Maintained by: George V. Neville-Neil ([email protected]) Contributions from: Ran Atkinson Mark Bergman Stephane Bortzmeyer Rodney Brown Dr. Charles E. Campbell Jr. Phill Conrad Alan Cox Rick Jones Jon Kay Jay Kreibrich William Manning Barry Margolin Jim Muchow Subu Rama W. Richard Stevens Version 3.2 ********************************************************************** The following is a list of Frequently Asked Questions, and their answers, for people interested in the Internet Protocols, including TCP, UDP, ICMP and others. Please send all additions, corrections, complaints and kudos to the above address. This FAQ will be posted on or about the first of every month. This FAQ is available for anonymous ftp from : ftp.netcom.com:/pub/gnn/tcp-ip.faq . You may get it from my home page at ftp://ftp.netcom.com/pub/gnn/gnn.html You can read the FAQ in HTMl format on Netcom or from the mirror site http://web.cnam.fr/Network/TCP-IP/tcp-ip.html ********************************************************************** Table of Contents: Glossary 1) Are there any good books on IP? 2) Where can I find example source code for TCP/UDP/IP? 3) Are there any public domain programs to check the performance of an IP link? 4) Where do I find RFCs? 5) How can I detect that the other end of a TCP connection has crashed? Can I use "keepalives" for this? 6) Can the keepalive timeouts be configured? 7) Can I set up a gateway to the Internet that translates IP addresses, so that I don't have to change all our internal addresses to an official network? 8) Are there object-oriented network programming tools? 9) What other FAQs are related to this one? 10) What newsgroups contain information on networks/protocols? 11) Van Jacobson explains TCP congestion avoidance. 12) Can I use a single bit subnet? Glossary: I felt this should be first given the plethora of acronyms used in the rest of this FAQ. IP: Internet Protocol. The lowest layer protocol defined in TCP/IP. This is the base layer on which all other protocols mentioned herein are built. IP is often referred to as TCP/IP as well. UDP: User Datagram Protocol. This is a connectionless protocol built on top of IP. It does not provide any guarantees on the ordering or delivery of messages. This protocol is layered on top of IP. TCP: Transmission Control Protocol. TCP is a connection oriented protocol that guarantees that messages are delivered in the order in which they were sent and that all messages are delivered. If a TCP connection cannot deliver a message it closes the connection and informs the entity that created it. This protocol is layered on top of IP. ICMP: Internet Control Message Protocol. ICMP is used for diagnostics in the network. The Unix program, ping, uses ICMP messages to detect the status of other hosts in the net. ICMP messages can either be queries (in the case of ping) or error reports, such as when a network is unreachable. RFC: Request For Comment. RFCs are documents that define the protocols used in the IP Internet. Some are only suggestions, some are even jokes, and others are published standards. Several sites in the Internet store RFCs and make them available for anonymous ftp. SLIP: Serial Line IP. An implementation of IP for use over a serial link (modem). CSLIP is an optimized (compressed) version of SLIP that gives better throughput. Bandwidth: The amount of data that can be pushed through a link in unit time. Usually measured in bits or bytes per second. Latency: The amount of time that a message spends in a network going from point A to point B. Jitter: The effect seen when latency is not a constant. That is, if messages experience a different latencies between two points in a network. RPC: Remote Procedure Call. RPC is a method of making network access to resource transparent to the application programmer by supplying a "stub" routine that is called in the same way as a regular procedure call. The stub actually performs the call across the network to another computer. Marshalling: The process of taking arbitrary data (characters, integers, structures) and packing them up for transmission across a network. MBONE: A virtual network that is a Multicast backBONE. It is still a research prototype, but it extends through most of the core of the Internet (including North America, Europe, and Australia). It uses IP Multicasting which is defined in RFC-1112. An MBONE FAQ is available via anonymous ftp from: ftp.isi.edu" There are frequent broadcasts of multimedia programs (audio and low bandwidth video) over the MBONE. Though the MBONE is used for mutlicasting, the long haul parts of the MBONE use point-to-point connections through unicast tunnels to connect the various multicast networks worldwide. 1) Are there any good books on IP? A) Yes. Please see the following: Internetworking with TCP/IP Volume I (Principles, Protocols, and Architecture) Douglas E. Comer Prentice Hall 1991 ISBN 0-13-468505-9 This volume covers all of the protocols, including IP, UDP, TCP, and the gateway protocols. It also includes discussions of higher level protocols such as FTP, TELNET, and NFS. Internetworking with TCP/IP Volume II (Design, Implementation, and Internals) Douglas E. Comer / David L. Stevens Prentice Hall 1991 ISBN 0-13-472242-6 Discusses the implementation of the protocols and gives numerous code examples. Internetworking with TCP/IP Volume III (BSD Socket Version) (Client - Server Programming and Applications) Douglas E. Comer / David L. Stevens Prentice Hall 1993 ISBN 0-13-474222-2 This book discusses programming applications that use the internet protocols. It includes examples of telnet, ftp clients and servers. Discusses RPC and XDR at length. TCP/IP Illustrated, Volume 1: The Protocols, W. Richard Stevens (c) Addison-Wesley, 1994 ISBN 0-201-63346-9 An excellent introduction to the entire TCP/IP protocol suite, covering all the major protocols, plus several important applications. "TCP/IP Illustrated, Volume 2: The Implementation", by Gary R. Wright and W. Richard Stevens (c) Addison-Wesley, 1995 ISBN 0-201-63354-X This is a complete, and lenthy, discussion of the internals of TCP/IP based on the Net/2 release of BSD. Unix Network Programming W. Richard Stevens Prentice Hall 1990 ISBN 0-13-949876 An excellent introduction to network programming under Unix. The Design and Implementation of the 4.3 BSD Operating System Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, John S. Quarterman Addison-Wesley 1989 ISBN 0-201-06196-1 Though this book is a reference for the entire operating system, the eleventh and twelfth chapters completely explain how the networking protocols are implemented in the kernel. Stevens, W. Richard, Unix Network Programming. 1990, Prentice-Hall. An excellent introduction to network programming under Unix. Widely cited on the Usenet bulliten boards as the "best place to start" if you want to actually learn how to write Unix programs that communicate over a network. Rago, Steven A. Unix System V. Network Programming. 1993, Addison-Wesley. A book that covers the same kinds of topics as W. Richard Stevens Unix Network Programming, but is more specific to Unix System V Release 4 (SVR4), and so perhaps is more useful and up to date if you are working specifically with that implementation. (Stevens book covers Unix System V release 3.x). There is a much more extensive coverage of Streams in Rago's book; 4 chapters, where Stevens only provides a couple of subsections. The design project at the end of the book is an implementation of SLIP. 2) Where can I find example source code for TCP/UDP/IP? A) Code from the Internetworking with TCP/IP Volume III is available for anonymous ftp from: arthur.cs.purdue.edu:/pub/dls Code used in the Net-2 version of Berkeley Unix is available for anonymous ftp from: ftp.uu.net:systems/unix/bsd-sources/sys/netinet and gatekeeper.dec.com:/pub/BSD/net2/sys/netinet Code from Richard Steven's book is available on: ftp.uu.net:/published/books/stevens.* Example source code and libraries to make coding quicker is available in the Simple Sockets Library written at NASA. The Simple Sockets Library makes sockets easy to use! And, it comes as source code. It has been tested on: Unix (SGI, DecStation, AIX, Sun 3, Sparcstation; version 2.02+: Solaris 2.1, SCO), VMS, and MSDOS (client only since there's no background there). It is provided in source code form, of course, and sits atop Berkeley sockets and tcp/ip. You can order the "Simple Sockets Library" from Austin Code Works 11100 Leafwood Lane Austin, TX 78750-3464 USA Phone (512) 258-0785 Ask for the "SSL - The Simple Sockets Library". Last I checked, they were asking $20 US for it. For DOS there is WATTCP.ZIP (numerous sites): WATTCP is a DOS TCP/IP stack derived from the NCSA Telnet program and much enhanced. It comes with some example programs and complete source code. The interface isn't BSD sockets but is well suited to PC type work. It is also written so that it can be used and memory allocation). 3) Are there any public domain programs to check the performance of an IP link? A) TTCP: Available for anonymous ftp from.... wuarchive.wustl.edu:/graphics/graphics/mirrors/sgi.com/sgi/src/ttcp On ftp.sgi.com are netperf (from Rick Jones at HP) and nettest (from Dave Borman at Cray). ttcp is also availabel at ftp.sgi.com. You can get to the NetPerf home page via: http://www.cup.hp.com/netperf/NetperfPage.html There is suite of Bandwidth Measuring programs from [email protected]. Available for anonymous ftp from ftp.netcom.com in ~ftp/gnn/bwmeas-0.3.tar.Z These are several programs that meausre bandwidth and jitter over several kinds of IPC links, including TCP and UDP. 4) Where do I find RFCs? A) This is the latest info on obtaining RFCs: Details on obtaining RFCs via FTP or EMAIL may be obtained by sending an EMAIL message to [email protected] with the message body help: ways_to_get_rfcs. For example: To: [email protected] Subject: getting rfcs help: ways_to_get_rfcs The response to this mail query is quite long and has been omitted. RFCs can be obtained via FTP from DS.INTERNIC.NET, NIS.NSF.NET, NISC.JVNC.NET, FTP.ISI.EDU, WUARCHIVE.WUSTL.EDU, SRC.DOC.IC.AC.UK, FTP.CONCERT.NET, or FTP.SESQUI.NET. Using Web, WAIS, and gopher: Web: http://web.nexor.co.uk/rfc-index/rfc-index-search-form.html WAIS access by keyword: wais://wais.cnam.fr/RFC Excellent presentation with a full-text search too: http://www.cis.ohio-state.edu/hypertext/information/rfc.html With Gopher: gopher://r2d2.jvnc.net/11/Internet%20Resources/RFC gopher://muspin.gsfc.nasa.gov:4320/1g2go4%20ds.internic.net%2070%201%201/.ds/ .internetdocs 5) How can I detect that the other end of a TCP connection has crashed? Can I use "keepalives" for this? A) Detecting crashed systems over TCP/IP is difficult. TCP doesn't require any transmission over a connection if the application isn't sending anything, and many of the media over which TCP/IP is used (e.g. ethernet) don't provide a reliable way to determine whether a particular host is up. If a server doesn't hear from a client, it could be because it has nothing to say, some network between the server and client may be down, the server or client's network interface may be disconnected, or the client may have crashed. Network failures are often temporary (a thin ethernet will appear down while someone is adding a link to the daisy chain, and it often takes a few minutes for new routes to stabilize when a router goes down), and TCP connections shouldn't be dropped as a result. Keepalives are a feature of the sockets API that requests that an empty packet be sent periodically over an idle connection; this should evoke an acknowledgement from the remote system if it is still up, a reset if it has rebooted, and a timeout if it is down. These are not normally sent until the connection has been idle for a few hours. The purpose isn't to detect a crash immediately, but to keep unnecessary resources from being allocated forever. If more rapid detection of remote failures is required, this should be implemented in the application protocol. There is no standard mechanism for this, but an example is requiring clients to send a "no-op" message every minute or two. An example protocol that uses this is X Display Manager Control Protocol (XDMCP), part of the X Window System, Version 11; the XDM server managing a session periodically sends a Sync command to the display server, which should evoke an application-level response, and resets the session if it doesn't get a response (this is actually an example of a poor implementation, as a timeout can occur if another client "grabs" the server for too long). 6) Can the keepalive timeouts be configured? A) This varies by operating system. There is a program that works on many Unices (though not Linux or Solaris), called netconfig, that allows one to do this and documents many of the variables. It is available by anonymous FTP from cs.ucsd.edu:pub/csl/Netconfig/netconfig2.2.tar.Z In addition, Richard Stevens' TCP/IP Illustrated, Volume 1 includes a good discussion of setting the most useful variables on many platforms. 7) Can I set up a gateway to the Internet that translates IP addresses, so that I don't have to change all our internal addresses to an official network? A) There's no general solution to this. Many protocols include IP addresses in the application-level data (FTP's "PORT" command is the most notable), so it isn't simply a matter of translating addresses in the IP header. Also, if the network number(s) you're using match those assigned to another organization, your gateway won't be able to communicate with that organization (RFC 1597 proposes network numbers that are reserved for private use, to avoid such conflicts, but if you're already using a different network number this won't help you). However, if you're willing to live with limited access to the Internet from internal hosts, the "proxy" servers developed for firewalls can be used as a substitute for an address-translating gateway. See the firewall FAQ. 8) Are there object-oriented network programming tools? A) Yes, and one such system is called ACE (ADAPTIVE Communication Environment). Here is how to get more information and the software: OBTAINING ACE An HTML version of this README file is available at URL http://www.cs.wustl.edu/~schmidt/ACE.html. All software and documentation is available via both anonymous ftp and the Web. ACE is available for anonymous ftp from the ics.uci.edu (128.195.1.1) host in the gnu/C++_wrappers.tar.Z file (approximately .5 meg compressed). This release contains contains the source code, documentation, and example test drivers for C++ wrapper libras. 9) What other FAQs might you want to look in? comp.protocols.tcp-ip.ibmpc Aboba, Bernard D.(1994) "comp.protocols.tcp-ip.ibmpc Frequently Asked Questions (FAQ)" Usenet news.answers, available via file://ftp.netcom.com/pub/ma/mailcom/IBMTCP/ibmtcp.zip, 57 pages. comp.protocols.ppp Archive-name: ppp-faq/part[1-8] URL: http://cs.uni-bonn.de/ppp/part[1-8].html comp.dcom.lans.ethernet ftp site: dorm.rutgers.edu, pub/novell/DOCS Ethernet Network Questions and Answers Summarized from UseNet group comp.dcom.lans.ethernet 10) What other newsgroups deal with networking? comp.dcom.cabling Cabling selection, installation and use. comp.dcom.isdn The Integrated Services Digital Network (ISDN). comp.dcom.lans.ethernet Discussions of the Ethernet/IEEE 802.3 protocols.comp.dcom.lans.fddi Discussions of the FDDI protocol suite. comp.dcom.lans.misc Local area network hardware and software. comp.dcom.lans.token-ring Installing and using token ring networks. comp.dcom.servers Selecting and operating data communications servers. comp.dcom.sys.cisco Info on Cisco routers and bridges. comp.dcom.sys.wellfleet Wellfleet bridge & router systems hardware & software. comp.protocols.ibm Networking with IBM mainframes. comp.protocols.iso The ISO protocol stack. comp.protocols.kerberos The Kerberos authentication server. comp.protocols.misc Various forms and types of protocol. comp.protocols.nfs Discussion about the Network File System protocol. comp.protocols.ppp Discussion of the Internet Point to Point Protocol. comp.protocols.smb SMB file sharing protocol and Samba SMB server/client. comp.protocols.tcp-ip TCP and IP network protocols. comp.protocols.tcp-ip.ibmpc TCP/IP for IBM(-like) personal computers. comp.security.misc Security isuipment for the PC. comp.os.ms-windows.networking.misc Windows and other networks. comp.os.ms-windows.networking.tcp-ip Windows and TCP/IP networking. comp.os.ms-windows.networking.windows Windows' built-in networking. comp.os.os2.networking.misc Miscellaneous networking issues of OS/2. comp.os.os2.networking.tcp-ip TCP/IP under OS/2. comp.sys.novell Discussion of Novell Netware products. 11) Van Jacobson explains TCP congestion avoidance. I've attached Van J's original posting on it (I seem to repost this every 6 months or so). If you want to see some real examples of this in action, take a look at Chapter 21 of my "TCP/IP Illustrated, Volume 1". Rich Stevens --------------------------------------------------------------------------- >From [email protected] Mon Apr 30 01:44:05 1990 To: [email protected] Subject: modified TCP congestion avoidance algorithm Date: Mon, 30 Apr 90 01:40:59 PDT From: Van Jacobson <[email protected]> Status: RO This is a description of the modified TCP congestion avoidance algorithm that I promised at the teleconference. BTW, on re-reading, I noticed there were several errors in Lixia's note besides the problem I noted at the teleconference. I don't know whether that's because I mis-communicated the algorithm at dinner (as I recall, I'd had some wine) or because she's convinced that TCP is ultimately irrelevant :). Either way, you will probably be disappointed if you experiment with what's in that note. First, I should point out once again that there are two completely independent window adjustment algorithms running in the sender: Slow-start is run when the pipe is empty (i.e., when first starting or re-starting after a timeout). Its goal is to get the "ack clock" started so packets will be metered into the network at a reasonable rate. The other algorithm, congestion avoidance, is run any time but when (re-)starting and is responsible for estimating the (dynamically varying) pipesize. You will cause yourself, or me, no end of confusion if you lump these separate algorithms (as Lixia's message did). The modifications described here are only to the congestion avoidance algorithm, not to slow-start, and they are intended to apply to large bandwidth-delay product paths (though they don't do any harm on other paths). Remember that with regular TCP (or with slow-start/c-a TCP), throughput really starts to go to hell when the probability of packet loss is on the order of the bandwidth-delay product. E.g., you might expect a 1% packet loss rate to translate into a 1% lower throughput but for, say, a TCP connection with a 100 packet b-d p. (= window), it results in a 50-75% throughput loss. To make TCP effective on fat pipes, it would be nice if throughput degraded only as function of loss probability rather than as the product of the loss probabilty and the b-d p. (Assuming, of course, that we can do this without sacrificing congestion avoidance.) These mods do two things: (1) prevent the pipe from going empty after a loss (if the pipe doesn't go empty, you won't have to waste round-trip times re-filling it) and (2) correctly account for the amount of data actually in the pipe (since that's what congestion avoidance is supposed to be estimating and adapting to). For (1), remember that we use a packet loss as a signal that the pipe is overfull (congested) and that packet loss can be detected one of two different ways: (a) via a retransmit timeout or (b) when some small number (3-4) of consecutive duplicate acks has been received (the "fast retransmit" algorithm). In case (a), the pipe is guaranteed to be empty so we must slow-start. In case (b), if the duplicate ack threshhold is small compared to the bandwidth-delay product, we will detect the loss with the pipe almost full. I.e., given a threshhold of 3 packets and an LBL-MIT bandwidth-delay of around 24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75% full when fast-retransmit detects a loss (actually, until gateways start doing some sort of congestion control, the pipe is overfull when the loss is detected so at least 75% of the packets needed for ack clocking are in transit when fast-retransmit happens). Since the pipe is full, there's no need to slow-start after a fast-retransmit. For (2), consider what a duplicate ack means: either the network duplicated a packet (i.e., the NSFNet braindead IBM token ring adapters) or the receiver got an out-of-order packet. The usual cause of out-of-order packets at the receiver is a missing packet. I.e., if there are W packets in transit and one is dropped, the receiver will get W-1 out-of-order and (4.3-tahoe TCP will) generate W-1 duplicate acks. If the `consecutive duplicates' threshhold is set high enough, we can reasonably assume that duplicate acks mean dropped packets. But there's more information in the ack: The receiver can only generate one in response to a packet arrival. I.e., a duplicate ack means that a packet has left the network (it is now cached at the receiver). If the sender is limitted by the congestion window, a packet can now be sent. (The congestion window is a count of how many packets will fit in the pipe. The ack says a packet has left the pipe so a new one can be added to take its place.) To put this another way, say the current congestion window is C (i.e, C packets will fit in the pipe) and D duplicate acks have been received. Then only C-D packets are actually in the pipe and the sender wants to use a window of C+D packets to fill the pipe to its estimated capacity (C+D sent - D received = C in pipe). So, conceptually, the slow-start/cong.avoid/fast-rexmit changes are: - The sender's input routine is changed to set `cwnd' to `ssthresh' when the dup ack threshhold is reached. [It used to set cwnd to mss to force a slow-start.] Everything else stays the same. - The sender's output routine is changed to use an effective window of min(snd_wnd, cwnd + dupacksmss) [the change is the addition of the `dupacksmss' term.] `Dupacks' is zero until the rexmit threshhold is reached and zero except when receiving a sequence of duplicate acks. The actual implementation is slightly different than the above because I wanted to avoid the multiply in the output routine (multiplies are expensive on some risc machines). A diff of the old and new fastrexmit code is attached (your line numbers will vary). Note that we still do congestion avoidance (i.e., the window is reduced by 50% when we detect the packet loss). But, as long as the receiver's offered window is large enough (it needs to be at most twice the bandwidth-delay product), we continue sending packets (at exactly half the rate we were sending before the loss) even after the loss is detected so the pipe stays full at exactly the level we want and a slow-start isn't necessary. Some algebra might make this last clear: Say U is the sequence number of the first un-acked packet and we are using a window size of W when packet U is dropped. Packets [U..U+W) are in transit. When the loss is detected, we send packet U and pull the window back to W/2. But in the round-trip time it takes the U retransmit to fill the receiver's hole and an ack to get back, W-1 dup acks will arrive (one for each packet in transit). The window is effectively inflated by one packet for each of these acks so packets [U..U+W/2+W-1) are sent. But we don't re-send packets unless we know they've been lost so the amount actually sent between the loss detection and the recovery ack is U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion avoidance allows us to send (if we add in the rexmit of U). The recovery ack is for packet U+W so when the effective window is pulled back from W/2+W-1 to W/2 (which happens because the recovery ack is `new' and sets dupack to zero), we are allowed to send up to packet U+W+W/2 which is exactly the first packet we haven't yet sent. (I.e., there is no sudden burst of packets as the `hole' is filled.) Also, when sending packets between the loss detection and the recovery ack, we do nothing for the first W/2 dup acks (because they only allow us to send packets we've already sent) and the bottleneck gateway is given W/2 packet times to clean out its backlog. Thus when we start sending our W/2-1 new packets, the bottleneck queue is as empty as it can be. [I don't know if you can get the flavor of what happens from this description -- it's hard to see without a picture. But I was delighted by how beautifully it worked -- it was like watching the innards of an engine when all the separate motions of crank, pistons and valves suddenly fit together and everything appears in exactly the right place at just the right time.] Also note that this algorithm interoperates with old tcp's: Most pre-tahoe tcp's don't generate the dup acks on out-of-order packets. If we don't get the dup acks, fast retransmit never fires and the window is never inflated so everything happens in the old way (via timeouts). Everything works just as it did without the new algorithm (and just as slow). If you want to simulate this, the intended environment is: - large bandwidth-delay product (say 20 or more packets) - receiver advertising window of two b-d p (or, equivalently, advertised window of the unloaded b-d p but two or more connections simultaneously sharing the path). - average loss rate (from congestion or other source) less than one lost packet per round-trip-time per active connection. (The algorithm works at higher loss rate but the TCP selective ack option has to be implemented otherwise the pipe will go empty waiting to fill the second hole and throughput will once again degrade at the product of the loss rate and b-d p. With selective ack, throughput is insensitive to b-d p at any loss rate.) And, of course, we should always remember that good engineering practise suggests a b-d p worth of buffer at each bottleneck -- less buffer and your simulation will exhibit the interesting pathologies of a poorly engineered network but will probably tell you little about the workings of the algorithm (unless the algorithm misbehaves badly under these conditions but my simulations and measurements say that it doesn't). In these days of $100/megabyte memory, I dearly hope that this particular example of bad engineering is of historical interest only. - Van ----------------- * /tmp/,RCSt1a26717 Mon Apr 30 01:35:17 1990 --- tcp_input.c Mon Apr 30 01:33:30 1990 *********** * 834,850 **** * Kludge snd_nxt & the congestion * window so we send only this one ! * packet. If this packet fills the ! * only hole in the receiver's seq. ! * space, the next real ack will fully ! * open our window. This means we ! * have to do the usual slow-start to ! * not overwhelm an intermediate gateway ! * with a burst of packets. Leave ! * here with the congestion window set ! * to allow 2 packets on the next real ! * ack and the exp-to-linear thresh ! * set for half the current window ! * size (since we know we're losing a ! * the current window size). / if (tp->t_timer[TCPT_REXMT] == 0 \|\| --- 834,850 ---- Kludge snd_nxt & the congestion * window so we send only this one ! * packet. ! * ! * We know we're losing at the current ! * window size so do congestion avoidance ! * (set ssthresh to half the current window ! * and pull our congestion window back to ! * the new ssthresh). ! * ! * Dup acks mean that packets have left the ! * network (they're now cached at the receiver) ! * so bump cwnd by the amount in the receiver ! * to keep a constant cwnd packets in the ! * network. / if (tp->t_timer[TCPT_REXMT] == 0 \|\| ************ * 853,864 **** else if (++tp->t_dupacks == tcprexmtthresh) { tcp_seq onxt = tp->snd_nxt; ! u_int win = ! MIN(tp->snd_wnd, tp->snd_cwnd) / 2 / ! tp->t_maxseg; if (win < 2) win = 2; tp->snd_ssthresh = win * tp->t_maxseg; - tp->t_timer[TCPT_REXMT] = 0; tp->t_rtt = 0; --- 853,864 ---- else if (++tp->t_dupacks == tcprexmtthresh) { tcp_seq onxt = tp->snd_nxt; ! u_int win = MIN(tp->snd_wnd, ! tp->snd_cwnd); + win /= tp->t_maxseg; + win >>= 1; if (win < 2) win = 2; tp->snd_ssthresh = win * tp->t_maxseg; tp->t_timer[TCPT_REXMT] = 0; tp->t_rtt = 0; ************* * 866,873 **** tp->snd_cwnd = tp->t_maxseg; (void) tcp_output(tp); ! if (SEQ_GT(onxt, tp->snd_nxt)) tp->snd_nxt = onxt; goto drop; } } else --- 866,879 ---- tp->snd_cwnd = tp->t_maxseg; (void) tcp_output(tp); ! tp->snd_cwnd = tp->snd_ssthresh + ! tp->t_maxseg * ! tp->t_dupacks; if (SEQ_GT(onxt, tp->snd_nxt)) tp->snd_nxt = onxt; goto drop; + } else if (tp->t_dupacks > tcprexmtthresh) { + tp->snd_cwnd += tp->t_maxseg; + (void) tcp_output(tp); + goto drop; } } else ************* * 874,877 **** --- 880,890 ---- tp->t_dupacks = 0; break; + } + if (tp->t_dupacks) { + /* + * the congestion window was inflated to account for + * the other side's cached packets - retract it. + / + tp->snd_cwnd = tp->snd_ssthresh; } tp->t_dupacks = 0; /tmp/,RCSt1a26725 Mon Apr 30 01:35:23 1990 --- tcp_timer.c Mon Apr 30 00:36:29 1990 *********** * 223,226 **** --- 223,227 ---- tp->snd_cwnd = tp->t_maxseg; tp->snd_ssthresh = win * tp->t_maxseg; + tp->t_dupacks = 0; } (void) tcp_output(tp); >From [email protected] Mon Apr 30 10:37:36 1990 To: [email protected] Subject: modified TCP congestion avoidance algorithm (correction) Date: Mon, 30 Apr 90 10:36:12 PDT From: Van Jacobson <[email protected]> Status: RO I shouldn't make last minute 'fixes'. The code I sent out last night had a small error: * t.c Mon Apr 30 10:28:52 1990 --- tcp_input.c Mon Apr 30 10:30:41 1990 *********** * 885,893 **** * the congestion window was inflated to account for * the other side's cached packets - retract it. / ! tp->snd_cwnd = tp->snd_ssthresh; } - tp->t_dupacks = 0; if (SEQ_GT(ti->ti_ack, tp->snd_max)) { tcpstat.tcps_rcvacktoomuch++; goto dropafterack; --- 885,894 ---- the congestion window was inflated to account for * the other side's cached packets - retract it. / ! if (tp->snd_cwnd > tp->snd_ssthresh) ! tp->snd_cwnd = tp->snd_ssthresh; ! tp->t_dupacks = 0; } if (SEQ_GT(ti->ti_ack, tp->snd_max)) { tcpstat.tcps_rcvacktoomuch++; goto dropafterack; 12) Can I use a single bit subnet? A) It would seem that the consensus is no. The best citable answer follows. >From RFC1122: "3.3.6 Broadcasts Section 3.2.1.3 defined the four standard IP broadcast address forms: Limited Broadcast: {-1, -1} Directed Broadcast: {<Network-number>,-1} Subnet Directed Broadcast: {<Network-number>,<Subnet-number>,-1} All-Subnets Directed Broadcast: {<Network-number>,-1,-1}" All-Subnets Directed broadcasts are being deprecated in favor of IP multicast, but were very much defined at the time RFC1122 was written. Thus a Subnet Directed Broadcast to a subnet of all ones is not distinguishable from an All-Subnets Directed Broadcast. For those old systems that used all zeros for broadcast in IP addresses, a similar argument can be made against the subnet of all zeros. Also, for old routing protocols like RIP, a route to subnet zero is not distinguishable from the route to the entire network number (except possibly by context). Most of today's systems don't support variable length subnet masks (VLSM), and for such systems the above is true. However, all the major router vendors and some* Unix systems (BSD 4.4 based ones) support VLSMs, and in that case the situation is more complicated :-) With VLSMs (necessary to support CIDR, see RFC 1519), you can utilize the address space more efficiently. Routing lookups are based on longest match, and this means that you can for instance subnet the class C net with a mask of 255.255.255.224 (27 bits) in addition to the subnet mask of 255.255.255.192 (26 bits) given above. You will then be able to use the addresses x.x.x.33 through x.x.x.62 (first three bits 001) and the addresses x.x.x.193 through x.x.x.222 (first three bits 110) with this new subnet mask. And you can continue with a subnet mask of 28 bits, etc. (Note also, by the way, that non-contiguous subnet masks are deprecated.) This is all very nicely covered in the paper by Havard Eidnes: Practical Considerations for Network Address using a CIDR Block Allocation Proceedings of INET '93 This paper is available with anonymous FTP from aun.uninett.no:/pub/misc/eidnes-cidr.ps The same paper, with minor revisions, is one of the articles in the special Internetworking issue of Communications of the ACM (last month, I believe). > I have be told that some network equipment (Cisco I think was the vendor > named) will not correctly handle subnets that violated that standard. As far as I know cisco is one of the router vendors that do handle VLSMs correctly. Could you substantiate this claim? Steinar Haug, SINTEF RUNIT, University of Trondheim, NORWAY Email: [email protected] -- George V. Neville-Neil work: [email protected] home:[email protected] NIC: GN82 This signature kept blank due to the CDA.
3057.8	more tcp and socket info from UNIX notes file	HYDRA::DORHAMER	`Mon Jan 27 1997 10:26`	27
	#1 27-JAN-1997 10:24:19.80 NEWMAIL From: HYDRA::AXPDEVELOPER "[email protected]" To: US3RMC::"[email protected]" CC: AXPDEVELOPER Subj: more tcp and socket info Hu Rui, Here's another response to some of your questions: --------------------------------------------------------------------------- >We have not found any good detailed material that can answer my question. >How the TCP is implemented in Digital Unix? Digital follows the RFCs. See SPD, man netintro(7), inet(7), ip(7),tcp(7), udp(7), etc.. and also Network and Communications Overview. >Tell me currently my socket is in what state? You could do a "netstat -a[n]", from the shell, or even from a prog... --------------------------------------------------------------------------- Karen Dorhamer Alpha Developer Support
3057.9	need strlen for XDR	HYDRA::DORHAMER	`Thu Jan 30 1997 09:01`	27
	From: SMTP%"[email protected]" 30-JAN-1997 02:51:19.53 To: [email protected] CC: Subj: Thanks for TCP socket help! Thank you very much for your continuing help in my TCP socket questions. Now we are able to deliver the products with confidence. Your confirmation of TCP socket read, non blocking mode return value is the most important reply I got. My code is very much depend on that. I have still another question about XDR. I use XDR to transfer structure through the socket. My structure is very complicated and of variable length. Inside there are several linked list. The list can be 1,000 nodes of 1 nodes. Currently I allocate a big enough buffer to store the XDR string. This is very unefficient when the list has only few value. Is there any strlen function for XDR so I can know the length of binary string. _________________________________________________ Hu Rui R&D, SMS Unit (ASAP code A60205) Scandinavian Softline Technology Oy Tulkinkuja 3 02600 ESPOO Finland tel. +358-9-5495 6202 fax. +358-9-512 4629 home tel. +358-9-2789426 Internet: [email protected] http://www.softline.fi/ _________________________________________________
3057.10	moved from note 3115.1 (misplaced note)	HYDRA::DORHAMER	`Mon Feb 03 1997 16:41`	14
	#1 30-JAN-1997 13:14:53.92 NEWMAIL From: HYDRA::AXPDEVELOPER "[email protected]" To: HYDRA::AXPDEVELOPER CC: AXPDEVELOPER Subj: RE: FWD: Thanks for TCP socket help! Hu Rui, Can you use the sizeof operator to get the size of your structure? If not, I'll check with engineering to find out the proper way to do this. Karen Dorhamer Alpha Developer Support
3057.11	resend	HYDRA::DORHAMER	`Mon Feb 03 1997 16:45`	18
	#1 3-FEB-1997 16:44:57.75 NEWMAIL From: HYDRA::AXPDEVELOPER "[email protected]" To: SMTP%"[email protected]" CC: AXPDEVELOPER Subj: RE: Thanks for TCP socket help! Hu Rui, Sorry if you have not yet receiced a response to your last e-mail regarding XDR and strlen. I may have sent the response to the wrong e-mail address. Can you use the sizeof operator to get the size of your structure? If not, I'll check with engineering to find out the proper way to do this. Karen Dorhamer Alpha Developer Support
3057.12	further clarification	HYDRA::DORHAMER	`Fri Feb 07 1997 12:45`	34
	From: SMTP%"[email protected]" 4-FEB-1997 02:33:02.01 To: "[email protected]" <[email protected]> CC: Kari Kailamaki <[email protected]> Subj: Re: Thanks for TCP socket help! > Hu Rui, > > Sorry if you have not yet receiced a response to your last e-mail regarding > XDR and strlen. I may have sent the response to the wrong e-mail address. No I had not received any thing, it might go to the wrong address. > Can you use the sizeof operator to get the size of your structure? Yes, I can and I am using this solution now. > If not, I'll check with engineering to find out the proper way to do this. But I want to know some simple solution if it exists. Currently what I did is, I estimate how many bytes I need, for string I allocate (strlen + 20) but this is not sharp solution. What I need is a function like that int xdrlen(xdrstring, ... structure definition) I want to know if DEC has coded it. I have read the whole network programming manu, but found nothing. Regards.
3057.13	posted in Digital_UNIX notes	HYDRA::DORHAMER	`Fri Feb 07 1997 13:04`	3
	I have posted his questions in Digital_UNIX note 8759. Karen
3057.14	response from Digital_unix note 8759	HYDRA::DORHAMER	`Fri Feb 14 1997 16:46`	34
	#1 14-FEB-1997 16:44:18.94 NEWMAIL From: HYDRA::AXPDEVELOPER "[email protected]" To: NM%US6RMC::"[email protected]" CC: AXPDEVELOPER Subj: XDR info Hu Rui, Attached is a response that I received regarding your questions about calculating the length of XDR data. I hope this helps you out. Karen Dorhamer Alpha Developer Support I don't know of libc routines that let you estimate how long XDR data will be. It would be fairly easy to write a new XDR module that merely counts the length of data to be encoded, and using the public domain RPC code would provide enough hints. (That has a lot of 32/64 bit issues, so it doesn't replace our libc code.) Some sizes (I may not be exactly right): xdr_bytes, xdr_string: The number of bytes, rounded up to next multiple of 4, plus 4. xdr_char, xdr_short, xdr_int, xdr_long, xdr_float xdr_bool, xdr_enum: 4 bytes. xdr_longlong, xdr_hyper, xdr_double: 8 bytes Ah - here's an idea - try calling xdr_getpos before and after encoding something. The difference is the number of bytes used. Me? I'd just look at the messages sent via tcpdump and figure the length and structure.